codenode

DB<3> x $code

Debugging Perl regex, backtracking

Posted on February 28th, 2010 by Daniel Nichter

This is a two part article about debugging Perl regular expressions using use re 'debug' with emphasis on understanding backtracking.

For parsing SQL statements I use the following “clauser”:

my @clause = grep { defined $_ } ($query =~ m/\G(.+?)(?:$clauses\s+|\Z)/gci);

$clauses is defined as something like:

my $clauses = qr/(FROM|WHERE)/;

When the first word of a SQL statement is removed, that clauser successfully parses almost every query. For example, “SELECT col FROM tbl WHERE id=1″ with “SELECT” removed gets claused as:

   'col '
   'FROM'
   'tbl '
   'WHERE'
   'id=1'

After shifting off the first value of that list, the remain pairs are represent each clause in the statement.

I thought to myself, “Damn, that’s an elegant clauser.” It passes a good suite of tests, too. With code, however, an elgant appearnce does not guarantee an elegant peformance. So I had to ask myself, “Is this regex efficient?” In particular I wanted to know if the pattern caused excessive backtracking.

I began investigating how to debug or more deeply see and understand what Perl’s regex engine does to accomplish a pattern match. I found the easily available info disjointed and obscure, so I decided to synthesize, understand and write a (?:better)? explanation of Perl regex debugging with the goal of understanding how and why the engine is matching a pattern, with emphasis on backtracking.

The pre-reading is perlre, perlretut and perldebguts–particularly the sections on backtracking. Notice what the first doc says,

backtracking, which is currently used (when needed) by all regular non-possessive expression quantifiers, namely * , *? , + , +?, {n,m}, and {n,m}?.

This will be clarified later. And as the last doc says, we get Perl regex debug info with

use re 'debug'

It explains the parts of this pragma’s output but we’ll explore real examples more thoroughly.

First, let’s get clear on what backtracking is. perlre says backtracking is,

[...] if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part–that’s why it’s called backtracking.

It then gives an example that we’ll look at next. In my humble opinion, I don’t think the example demonstrates backtracking very well, so we’ll modify it to better see backtracking in action.

After much reading, thinking and debugging, my definition of backtracking is,

when Perl gives back previously matched quantified expressions because subsequent expressions would cause the whole pattern match to fail

This definition is more similar to what is said in perlretut,

Doing so, however, wouldn’t allow the whole regexp to match, so after backtracking, a* eventually gave back the a and matched the empty string.

Strictly speaking, I think there’s a distinction between backtracking–which is to give back previously matched quantified expressions–and decrementing the string pos(). Is seems that backtracking usually leads to decrementing the string pos, but we’ll see in a later case where these two operations seem to happen independently.

Now let’s use the example from perlre to explore the output of use re 'debug' and see why I think the perlre example does not demonstrate backtracking very well.

   # code
   "Food is on the foo table." =~ m/\b(foo)\s+(\w+)/i

   # re 'debug' output
   Compiling REx "\b(foo)\s+(\w+)"
   Final program:
      1: BOUND (2)
      2: OPEN1 (4)
      4:   EXACTF <foo> (6)
      6: CLOSE1 (8)
      8: PLUS (10)
      9:   SPACE (0)
     10: OPEN2 (12)
     12:   PLUS (14)
     13:     ALNUM (0)
     14: CLOSE2 (16)
     16: END (0)
   stclass BOUND minlen 5
   Matching REx "\b(foo)\s+(\w+)" against "Food is on the foo table."
   Matching stclass BOUND against "Food is on the foo ta" (21 chars)
      0 <> <Food is on>         |  1:BOUND(2)
      0 <> <Food is on>         |  2:OPEN1(4)
      0 <> <Food is on>         |  4:EXACTF <foo>(6)
      3 <Foo> <d is on th>      |  6:CLOSE1(8)
      3 <Foo> <d is on th>      |  8:PLUS(10)
                                     SPACE can match 0 times out of 2147483647...
                                     failed...
      4 <Food> < is on the>     |  1:BOUND(2)
      4 <Food> < is on the>     |  2:OPEN1(4)
      4 <Food> < is on the>     |  4:EXACTF <foo>(6)
                                     failed...
      5 <Food > <is on the >    |  1:BOUND(2)
      5 <Food > <is on the >    |  2:OPEN1(4)
      5 <Food > <is on the >    |  4:EXACTF <foo>(6)
                                     failed...
      7 <od is> < on the fo>    |  1:BOUND(2)
      7 <od is> < on the fo>    |  2:OPEN1(4)
      7 <od is> < on the fo>    |  4:EXACTF <foo>(6)
                                     failed...
      8 <d is > <on the foo>    |  1:BOUND(2)
      8 <d is > <on the foo>    |  2:OPEN1(4)
      8 <d is > <on the foo>    |  4:EXACTF <foo>(6)
                                     failed...
     10 <is on> < the foo t>    |  1:BOUND(2)
     10 <is on> < the foo t>    |  2:OPEN1(4)
     10 <is on> < the foo t>    |  4:EXACTF <foo>(6)
                                     failed...
     11 <s on > <the foo ta>    |  1:BOUND(2)
     11 <s on > <the foo ta>    |  2:OPEN1(4)
     11 <s on > <the foo ta>    |  4:EXACTF <foo>(6)
                                     failed...
     14 <n the> < foo table>    |  1:BOUND(2)
     14 <n the> < foo table>    |  2:OPEN1(4)
     14 <n the> < foo table>    |  4:EXACTF <foo>(6)
                                     failed...
     15 < the > <foo table.>    |  1:BOUND(2)
     15 < the > <foo table.>    |  2:OPEN1(4)
     15 < the > <foo table.>    |  4:EXACTF <foo>(6)
     18 <e foo> < table.>       |  6:CLOSE1(8)
     18 <e foo> < table.>       |  8:PLUS(10)
                                     SPACE can match 1 times out of 2147483647...
     19 <e foo > <table.>       | 10:  OPEN2(12)
     19 <e foo > <table.>       | 12:  PLUS(14)
                                       ALNUM can match 5 times out of 2147483647...
     24 <e foo table> <.>       | 14:    CLOSE2(16)
     24 <e foo table> <.>       | 16:    END(0)
   Match successful!
   Freeing REx: "\b(foo)\s+(\w+)"

The “final program” is important. It’s the roadmap to successfully matching this pattern. Each node ID is numbered on the left and on the right in parenthesis linked to the next node ID in the sequence of events which leads to END if the pattern matches successfully.

Regarding the output of the pattern execution, perldebguts says that each line has the format,
STRING-OFFSET |ID: TYPE, and that “The TYPE info is indented with respect to the backtracking level.” With that in mind, let’s go through this debug dump. The first significant chunk ends at the first occurrence of “failed…” just after the line,

   3 <Foo> <d is on th>      |  8:PLUS(10)

At string pos 3, the first two expressions of the pattern have matched: \b(foo). Now Perl is trying to match the third expression: \s+. You’ll see that the output lists the expression’s quantifier first: PLUS and then, on the next line, the type of expression. It took me some time to get used to this since I read left to right so I’m used to thinking type then quantifier, not the inverse. If this node/expression/type matched we’d jump to node ID 10 which, from the roadmap, is 10: OPEN2 (12). But it doesn’t match so Perl jumps all the way back to node 1 as seen on the line after “failed…”.

Although Perl failed to match \s+ and restarted from node 1, it has not technically backtracked because no quantified expressions were taken. The expressions matched thus far were \b(foo), none of which are quantified. perlre doesn’t say that Perl backtracks at this point, but it seems implied (to me, at least) because it’s talking about backtracking and using this as an example.

Any way, the regex engines marches on until pos 15 at the same level of indention, repeating the same pattern of failing to match 4:EXACTF <foo>(6). At pos 15 Perl runs into “foo” again and the first three expressions match: \b(foo)\s+. At the line,

   19 <e foo > <table.>       | 10:  OPEN2(12)

the indention level increases, indicating that a backtracking “milestone” has been established, so speak, and if the next expression fails to match we’ll have to backtrack and give up everything established by that milestone. At this point that milestone has established \b(foo)\s+ or “foo ” from pos 15-19. Now we try to match the final expression: (\w+) from node 12.

The final expression matches and the indention level increases again, meaning that we have taken \b(foo)\s+(\w+) or “foo table”. That being the whole pattern, node 16 END is called and we’re done.

Note that Perl does not backtrack in this example, which is why I said it does not demonstrate backtracking very well. If the final expression had failed to match, Perl would have backtracked to node 1 and string pos 18. Let’s see this in action by replacing “foo table.” with “foo . foo table.”:

    [snip]
    15 < the > <foo . foo >    |  1:BOUND(2)
    15 < the > <foo . foo >    |  2:OPEN1(4)
    15 < the > <foo . foo >    |  4:EXACTF <foo>(6)
    18 <e foo> < . foo tab>    |  6:CLOSE1(8)
    18 <e foo> < . foo tab>    |  8:PLUS(10)
                                    SPACE can match 1 times out of 2147483647...
    19 < foo > <. foo tabl>    | 10:  OPEN2(12)
    19 < foo > <. foo tabl>    | 12:  PLUS(14)
                                      ALNUM can match 0 times out of 2147483647...
                                      failed...
                                    failed...
    18 <e foo> < . foo tab>    |  1:BOUND(2)
    18 <e foo> < . foo tab>    |  2:OPEN1(4)
    18 <e foo> < . foo tab>    |  4:EXACTF <foo>(6)
                                    failed...
    21 <oo . > <foo table>     |  1:BOUND(2)
    21 <oo . > <foo table>     |  2:OPEN1(4)
    21 <oo . > <foo table>     |  4:EXACTF <foo>(6)
    24 < . foo> < table>       |  6:CLOSE1(8)
    24 < . foo> < table>       |  8:PLUS(10)
                                    SPACE can match 1 times out of 2147483647...
    25 < . foo > <table>       | 10:  OPEN2(12)
    25 < . foo > <table>       | 12:  PLUS(14)
                                      ALNUM can match 5 times out of 2147483647...
    30 < . foo table> <>       | 14:    CLOSE2(16)
    30 < . foo table> <>       | 16:    END(0)
   Match successful!

This example more clearly shows backtracking via changes in indention levels. The two “failed…” lines show that Perl not only backtracked–i.e. it gave back what it had taken for the node IDs up to the failed ALNUM match–but also that it went one string pos back, from 19 to 18, to restart the whole pattern match.

As mentioned earlier, backtracking applies to quantified expressions. The perlre example has two quantified expressions: \s+ and (\w+). Therefore two backtracking levels are seen in the debug output when each of these quantified expressions is taken/matched. The first two expressions, \b and foo are not quantified so as they were matched (several times for \b) new backtrack/indention levels were not created. Let’s see this is in action by changing \b to \s+:

   Matching REx "\s+(foo)\s+(\w+)" against "Food is on the foo table."
   Matching stclass SPACE against "Food is on the foo t" (20 chars)
      4 <Food> < is on the>     |  1:PLUS(3)
                                     SPACE can match 1 times out of 2147483647...
                                     failed...
      7 <od is> < on the fo>    |  1:PLUS(3)
                                     SPACE can match 1 times out of 2147483647...
                                     failed...
     10 <is on> < the foo t>    |  1:PLUS(3)
                                     SPACE can match 1 times out of 2147483647...
                                     failed...
     14 <n the> < foo table>    |  1:PLUS(3)
                                     SPACE can match 1 times out of 2147483647...
     15 < the > <foo table.>    |  3:  OPEN1(5)
     15 < the > <foo table.>    |  5:  EXACTF <foo>(7)
     18 <e foo> < table.>       |  7:  CLOSE1(9)
     18 <e foo> < table.>       |  9:  PLUS(11)
                                       SPACE can match 1 times out of 2147483647...
     19 <e foo > <table.>       | 11:    OPEN2(13)
     19 <e foo > <table.>       | 13:    PLUS(15)
                                         ALNUM can match 5 times out of 2147483647...
     24 <e foo table> <.>       | 15:      CLOSE2(17)
     24 <e foo table> <.>       | 17:      END(0)

Notice: three indention levels for the three quantified expressions.

That takes care of the perlre example of backtracking, so let’s critique the perlretut example:

   # code
   "ab" =~ m/a*ab/;

   # re 'debug' output
   Compiling REx "a*ab"
   synthetic stclass "ANYOF[a]".
   Final program:
      1: STAR (4)
      2:   EXACT <a> (0)
      4: EXACT <ab> (6)
      6: END (0)
   floating "ab" at 0..2147483647 (checking floating) stclass ANYOF[a] minlen 2
   Guessing start of match in sv for REx "a*ab" against "ab"
   Found floating substr "ab" at offset 0...
   start_shift: 0 check_at: 0 s: 0 endpos: 1
   Does not contradict STCLASS...
   Guessed: match at offset 0
   Matching REx "a*ab" against "ab"
   Matching stclass ANYOF[a] against "a" (1 chars)
      0 <> <ab>                 |  1:STAR(4)
                                     EXACT <a> can match 1 times out of 2147483647...
      0 <> <ab>                 |  4:  EXACT <ab>(6)
      2 <ab> <>                 |  6:  END(0)

I’m going to dare to say that following blockquote from perlretut is wrong:

This obviously matches, but in the process of matching, the subexpression a* first grabbed the a . Doing so, however, wouldn’t allow the whole regexp to match, so after backtracking, a* eventually gave back the a and matched the empty string.

I think it’s wrong because the Perl regex debug output does not support the claim that Perl gave back the first ‘a’ for a* so that ab could match. Unless I’m reading the regex debug output incorrectly, the first 2 lines suggest that a* matched and, being a quantified expression, caused an increase in the backtrack/indention level as seen in the last 2 lines. There’s no “failed…” line like we saw in the previous examples.

My explanation–and I’ll disclaim right now that I don’t have direct evidence–is that Perl optimized away node 1, making it a null operation that always matches because zero or more of an expression is the same as a non-existent expression for virtuously lazy langauges like Perl. (Optimized-away nodes are not something I just made up to support my claim.)

In any case, I think the debug output clearly shows that Perl did not backtrack–it gave nothing back–and it matched, instead, due to being smart/optimized.

Since we’re deep in this, let’s go deeper and examine what happens when we change the pattern to "ab" =~ m/(ab)*ab/:

   Compiling REx "(ab)*ab"
   synthetic stclass "ANYOF[a]".
   Final program:
      1: CURLYM[1] {0,32767} (11)
      5:   EXACT <ab> (9)
      9:   SUCCEED (0)
     10: NOTHING (11)
     11: EXACT <ab> (13)
     13: END (0)
   floating "ab" at 0..2147483647 (checking floating) stclass ANYOF[a] minlen 2
   Guessing start of match in sv for REx "(ab)*ab" against "ab"
   Found floating substr "ab" at offset 0...
   start_shift: 0 check_at: 0 s: 0 endpos: 1
   Does not contradict STCLASS...
   Guessed: match at offset 0
   Matching REx "(ab)*ab" against "ab"
   Matching stclass ANYOF[a] against "a" (1 chars)
      0 <> <ab>                 |  1:CURLYM[1] {0,32767}(11)
      0 <> <ab>                 |  5:  EXACT <ab>(9)
      2 <ab> <>                 |  9:  SUCCEED(0)
                                       subpattern success...
                                     CURLYM now matched 1 times, len=2...
      2 <ab> <>                 |  5:  EXACT <ab>(9)
                                       failed...
                                     CURLYM trying tail with matches=1...
                                     CURLYM trying tail with matches=0...
      0 <> <ab>                 | 11:  EXACT <ab>(13)
      2 <ab> <>                 | 13:  END(0)

I can't find mention in official docs if the indention of the CURLM lines is also with respect to backtracking, but my tests suggest they are not. Rather, second and subsequent CURLYM lines appear at the same indention level as the first appearance to which they relate. This way one can keep track of them if there are multiple. Since this is beside the point, I won't reproduce my tests which support this claim.

For now, just imagine the last 3 CURLYM lines are indented one level. The output then suggests that Perl, once again, did not backtrack. What's interesting, though, is that the output does show that it decremented its pos from 2 to 0 after it realized that matching the first expression would prevent the whole pattern from matching.

Therefore we see that decrementing the string position and backtracking are sometimes independent--at least it appears so in this case. Like the previous example, I suspect Perl optimized away the (ab)* expression, but I currently lack the profundity of enlightenment required to hack into Perl internals and prove this.

That does it for the perlretut example of backtracking, and that should be sufficient to cover the much more complex case: the clauser, $query =~ m/\G(.+?)(?:$clauses\s+|\Z)/gci. The clauser regex is analyzed in part 2, which has yet to be written...

Copyright © 2009 codenode. Theme by THAT Agency powered by WordPress.