blog.8-p.info

In the very beginning of my career, I was using Perl. Mixi, a company I work for was a Perl shop. They were literally using LAMP (Linux, Apache, MySQL and Perl) at that time. Naturally I read Perl’s interpreter a bit. So, I know a few odd things about Perl.

The Lord of the Rings

All .c files in the Perl interpreter has a quote from “The Load of the Rings”. For example, perl.c has;

/*
 *      A ship then new they built for him
 *      of mithril and of elven-glass
 *              --from Bilbo's song of Eärendil
 *
 *     [p.236 of _The Lord of the Rings_, II/i: "Many Meetings"]
 */

regcomp.c has;

/*
 * 'A fair jaw-cracker dwarf-language must be.'            --Samwise Gamgee
 *
 *     [p.285 of _The Lord of the Rings_, II/iii: "The Ring Goes South"]
 */

Perl’s VM has a lot of opcodes

Perl’s opcode.h is long, because a lot of built-in functionalities are actually have opcodes internally. chomp has its opcode, stat has its opcode.

Some of them share its opcode wildly. For example sin, cos, exp, log and sqrt are all eventually go to pp_sin.

/* also used for: pp_cos() pp_exp() pp_log() pp_sqrt() */

PP(pp_sin)
{
    dSP; dTARGET;
    int amg_type = fallback_amg;
    const char *neg_report = NULL;
    const int op_type = PL_op->op_type;

    switch (op_type) {
    case OP_SIN:  amg_type = sin_amg; break;
    case OP_COS:  amg_type = cos_amg; break;
    case OP_EXP:  amg_type = exp_amg; break;
    case OP_LOG:  amg_type = log_amg;  neg_report = "log";  break;
    case OP_SQRT: amg_type = sqrt_amg; neg_report = "sqrt"; break;
    }

A heuristic approach to resolve the grammar’s ambiguity

This is the best.

In Perl’s grammar, s/$numbers[1]/xxx/g; would be parsed as either “embed $numbers here and then [1] as a set of chracters” or “embed $numbers[1] here”. The interpreter uses a heuristic to decide that. The implementation is in toke.c;

/* S_intuit_more
 * Returns TRUE if there's more to the expression (e.g., a subscript),
 * FALSE otherwise.
 *
 * It deals with "$foo[3]" and /$foo[3]/ and /$foo[0123456789$]+/
 *
 * ->[ and ->{ return TRUE
 * ->$* ->$#* ->@* ->@[ ->@{ return TRUE if postderef_qq is enabled
 * { and [ outside a pattern are always subscripts, so return TRUE
 * if we're outside a pattern and it's not { or [, then return FALSE
 * if we're in a pattern and the first char is a {
 *   {4,5} (any digits around the comma) returns FALSE
 * if we're in a pattern and the first char is a [
 *   [] returns FALSE
 *   [SOMETHING] has a funky algorithm to decide whether it's a
 *      character class or not.  It has to deal with things like
 *      /$foo[-3]/ and /$foo[$bar]/ as well as /$foo[$\d]+/
 * anything else returns TRUE
 */

/* This is the one truly awful dwimmer necessary to conflate C and sed. */

STATIC int
S_intuit_more(pTHX_ char *s, char *e)
{
    PERL_ARGS_ASSERT_INTUIT_MORE;

    if (PL_lex_brackets)
        return TRUE;
    if (*s == '-' && s[1] == '>' && (s[2] == '[' || s[2] == '{'))
        return TRUE;
    if (*s == '-' && s[1] == '>'
     && FEATURE_POSTDEREF_QQ_IS_ENABLED
     && ( (s[2] == '$' && (s[3] == '*' || (s[3] == '#' && s[4] == '*')))
        ||(s[2] == '@' && memCHRs("*[{",s[3])) ))
        return TRUE;
    if (*s != '{' && *s != '[')
        return FALSE;
    PL_parser->sub_no_recover = TRUE;
    if (!PL_lex_inpat)
        return TRUE;

The whole function is ~130 lines long. Perl is great.