Skip to content

Conversation

@iabyn
Copy link
Contributor

@iabyn iabyn commented Dec 14, 2025

The main commit in this branch fixes GH #16197.
The follow-up commit does some minor code tidying.

  • This set of changes requires a perldelta entry, and it is included.

GH #16197

The main purpose of this commit is to stop premature scope unwinding
within eval code in regexes. Aside from backtracking on failure, the
scopes of every eval, such as (?{ local $x }), are supposed to accumulate,
and are only unwound en-masse at the very end of the pattern match.
However, a combination of a sub-pattern call, such as (??{...}) or
(?&...), combined with a cut, (?>...), can trigger earlier savestack
popping.

The direct fix for this, as explained below, is to remove this single
line from the EVAL_postponed_A/B: case:

    REGCP_UNWIND(ST.cp); /* LEAVE in disguise */

However, that line is entwined with code which attempts to preserve the
final value of $^R during scope unwinding.  Since that code was kind of
working around the misplaced REGCP_UNWIND(), it needs ripping out and
re-implementing. This has to be done at the same time, so the bulk of
this commit is actually concerned with $^R, even though it isn't the
subject of this ticket. So this commit doesn't change the behaviour of
$^R, but just changes its implementation somewhat.

The $^R issue is that every /(?{...})/ causes the equivalent of

    local $^R = ...;

to be executed. During final exit, the savestack gets unwound, and all
those local's get undone, leaving $^R with the value it had before the
match started. But we promise that after the match,  $^R will hold the
value of the most recent (?{...}). The code which this commit rips out
restored that value in one way; the new code in this commit does it a
different way. Basically, almost the last thing S_regmatch() does is a
LEAVE_SCOPE(orig_savestack_ix); This commit makes it so that the current
value of $^R is copied just before the LEAVE_SCOPE(), and that value is
copied back to $^R just after the LEAVE_SCOPE().

For efficiency, we only do the copy if we've actually set $^R. A mechanism
is also needed to ensure that the temporary copy doesn't leak if we die
during the savestack unwind. This is achieved by holding a pointer to
the copy in the aux_eval struct, which gets processed even if we die.

Now back to the main purpose of this commit, the premature stack unwind
in the presence of a cut with a sub-pattern. This bug has been there
since these features were added. It is instructive to look at a somewhat
idealised overview of the S_regmatch() function from around 5.6.0 (with
some bug fixes from later releases added). This was while the function
was still recursive. It looks approximately like:

    pp_match(...) {
        I32 ix = PL_savestack_ix;
        ... do a match ...
        LEAVE_SCOPE(ix);
    }

    S_regmatch(...) {
        while (scan) {
            switch (OP(scan)) {
            case FOO:
                if (! there's a FOO)
                    return 0;
                I32 ix = PL_savestack_ix;
                if (regmatch(...)) /* recursively match rest of pattern */
                    return 1;
                LEAVE_SCOPE(ix);
                return 0;

            case END:
                if (doing a (??{...}) ) {
                    I32 ix = PL_savestack_ix;
                    if (regmatch(...) { /* recursively match rest of pattern */
                        LEAVE_SCOPE(ix);
                        return 1;
                    }
                    LEAVE_SCOPE(ix);
                    return 0;
                }
                return 1;

            case EVAL:
                ... run the code, then, if its a (?{...}) ...
                I32 ix = PL_savestack_ix;
                if (regmatch(...) {   /* recursively run subpattern */
                    LEAVE_SCOPE(ix)
                    return 1;
                }
                LEAVE_SCOPE(ix);
                return 0;
        }
    }

Here, the FOO: case represents all the various ops which recurse. In
general, they match the next item and then recurse to match the rest of
the pattern.

Note that they all do a LEAVE_SCOPE() only in the *failure* branch.

At the end of a successful match, there is potentially much recursion,
and much stuff on the savestack. When the END op is reached, the series
of 'return 1's causes all the recursion to unwind, while leaving the
savestack untouched. Finally, the caller - such as pp_match() - clears
the savestack. In more recent perls the recursion has been removed and
the final LEAVE_SCOPE() is done within S_regmatch() itself, but the
principle remains the same: no stack freeing is done *during* matching,
and instead there's a single big clean up at the end.

Once (??{...}) enters the picture,  that changes a bit. When the END op
associated with the '...' sub-pattern is reached, regmatch() is called
recursively to process any pattern after the (??{..}); then on success,
while working its way back through the nested regmatch() calls, both the
END and the EVAL code do a LEAVE_SCOPE() in the *success* branch.

This is anomalous, and those two LEAVE_SCOPE()'s are what this commit
removes (although in the current non-recursive regex engine, they are
shared by the same piece of code, so only one had to be removed). By
removing them, this regularises the behaviour of sub-patterns. I can't
think why those LEAVE_SCOPE()s were originally added, and assume it was a
thinko.

Normally it makes no difference whether the savestack is popped near the
end, interleaved with popping all recursive regmatch calls (or
equivalently on non-recursive engines popping the regmatch_state stack),
or whether the savestack is popped only after all the recursion is
exited from.

However, it makes a difference in the presence of a cut, (?>...). Here,
the final op of the sub-pattern '...' is SUCCEED, which rather than
recursing to match anything following the cut block, just returns. The
recursion pops back to the SUSPEND op which started the cut, which then
continues with the op loop as normal.

Thus when about to match the first op following a (?>...), the recursion
*within* the cut has been blown away as if it never happened, but the
accumulated savestack entries (e.g. from evals within the cut block) are
preserved and continue to accumulate.

Now, if there is a (??{...}) sub-pattern, or similarly a (?&FOO), within
the cut, then at the end of the cut, the recursion is unwound, which
includes the stacked EVAL and END recursions, which at this time call
LEAVE_SCOPE(), which frees part of the savestack, even though the
pattern match hasn't ended yet. That's the bug which this commit fixes.

The tests added to pat_re_eval.t check that the scope bug has been
fixed. The test added to svleak.t checks that the new $^R copying code
doesn't leak.
This commit should produce no practical change in functionality.

Currently, S_regtry() notes the position of PL_savestack_ix, calls
S_regmatch(), then pops the savestack back to that position. However,
S_regmatch() also does this just before returning, so it's redundant in
S_regtry(). (A temporary assert confirmed that lastcp == PL_savestack_ix
in S_regtry always while running the test suite).

So this commit removes the REGCP_UNWIND(lastcp) and associated machinery
from S_regtry().

It also regularises the "note current ix; pop back to old ix" code at
the start and end of S_regmatch() to use the standard REGCP_SET() and
REGCP_UNWIND() macros which do the same thing but also produce debugging
messages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants