codenode

"A designer achieves perfection when nothing is left to remove." —Saint-Exupéry

Single pass replace with Perl regex \G anchor

without comments

Let’s say we have the query SELECT a, b, c FROM tbl WHERE id=1 ORDER BY a ASC, b ASC, c ASC and we want to remove the redundant ASC keywords, i.e. replace them with a blank string. This has to be fast and efficient which means a single, non-backtracking pass.

The first temptation might be to write:

1 while $query =~ s/(ORDER BY.+?)\s+ASC/$1/gmsi;

That works but it’s horribly inefficient because it does worse than backtracking: it restarts. After every match Perl restarts the pattern match from the beginning of the string, looking for ORDER BY. If the string is huge then these restarts absolutely kill performance–I’ll demonstrate this later.

It’s often the case, though, that there’s a point in the string after which all potential replaceable substrings will occur. This allows for a single, non-backtracking, non-restarting pass. The key is to anchor Perl at that point and tell it search and replace forward until the end of the string. This is what the \G anchor is used for:

if ( $query =~ m/\bORDER BY /gi ) {
   1 while $query =~ s/\G(.+?)\s+ASC/$1/gmsi && pos $query;
}

The first condition finds the anchor point if it exists. If it doesn’t, then we’re done, but if it does then we begin an anchored search and replace. For the first iteration, the \G causes Perl to search from where the first condition anchored, i.e. just after ORDER BY. The /g (global match) modifier moves the anchor forward after each match so second and subsequent iterations only search forward. Finally, matching never backtracks and never restarts thanks to && pos $query because normally Perl would retry the whole string after hitting its end. use re "debug" shows this to be the case. Instead, the end of the string will cause the pattern match to fail and pos $query will become false, terminating the loop.

The speed difference is enormous. I created a 216M string with a few hundred thousand occurrences of a keyword after an anchor near the end of the text (anchor pos at 70% total string length) and benchmarked how long it took each approach to remove the keywords after the anchor. The first approach took:

^CCommand terminated by signal 2
882.25user 269.79system 19:14.11elapsed 99%CPU

It took so long that I terminated it after 19 minutes. The second approach took:

16.04user 0.64system 0:16.75elapsed 99%CPU

The first approach obviously suffered from having to re-find the anchor point hundreds of thousands of time, each time discarding the first 70% of the string. Conversely, the second approach discarded the first 70% of the string once it found the anchor point and then did a single forward pass to the end of the string.

For small inputs the difference is probably negligible, but for hundreds of millions of small inputs the difference can add up, and that’s the kind of data set this particular code has to deal with quickly and efficiently.

Written by Daniel Nichter

June 24th, 2010 at 4:46 pm

Posted in Perl

Tagged with , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>