codenode

DB<3> x $code

Single pass replace with Perl regex \G anchor

Posted on June 24th, 2010 by Daniel Nichter

Let’s say we have the query SELECT a, b, c FROM tbl WHERE id=1 ORDER BY a ASC, b ASC, c ASC and we want to remove the redundant ASC keywords, i.e. replace them with a blank string. This has to be fast and efficient which means a single, non-backtracking pass.

The first temptation might be to write:

1 while $query =~ s/(ORDER BY.+?)\s+ASC/$1/gmsi;

That works but it’s horribly inefficient because it does worse than backtracking: it restarts. After every match Perl restarts the pattern match from the beginning of the string, looking for ORDER BY. If the string is huge then these restarts absolutely kill performance–I’ll demonstrate this later.

It’s often the case, though, that there’s a point in the string after which all potential replaceable substrings will occur. This allows for a single, non-backtracking, non-restarting pass. The key is to anchor Perl at that point and tell it search and replace forward until the end of the string. This is what the \G anchor is used for:

if ( $query =~ m/\bORDER BY /gi ) {
   1 while $query =~ s/\G(.+?)\s+ASC/$1/gmsi && pos $query;
}

The first condition finds the anchor point if it exists. If it doesn’t, then we’re done, but if it does then we begin an anchored search and replace. For the first iteration, the \G causes Perl to search from where the first condition anchored, i.e. just after ORDER BY. The /g (global match) modifier moves the anchor forward after each match so second and subsequent iterations only search forward. Finally, matching never backtracks and never restarts thanks to && pos $query because normally Perl would retry the whole string after hitting its end. use re "debug" shows this to be the case. Instead, the end of the string will cause the pattern match to fail and pos $query will become false, terminating the loop.

The speed difference is enormous. I created a 216M string with a few hundred thousand occurrences of a keyword after an anchor near the end of the text (anchor pos at 70% total string length) and benchmarked how long it took each approach to remove the keywords after the anchor. The first approach took:

^CCommand terminated by signal 2
882.25user 269.79system 19:14.11elapsed 99%CPU

It took so long that I terminated it after 19 minutes. The second approach took:

16.04user 0.64system 0:16.75elapsed 99%CPU

The first approach obviously suffered from having to re-find the anchor point hundreds of thousands of time, each time discarding the first 70% of the string. Conversely, the second approach discarded the first 70% of the string once it found the anchor point and then did a single forward pass to the end of the string.

For small inputs the difference is probably negligible, but for hundreds of millions of small inputs the difference can add up, and that’s the kind of data set this particular code has to deal with quickly and efficiently.

Are you a hacker?

Posted on June 2nd, 2010 by Daniel Nichter

A stranger once asked me, “Are you a hacker?”, when he saw me coding on my Linux-running laptop. I didn’t know how to answer him, but after seeing only one guy in the OSBridge Hacker Lounge last night at 21:30 hours when I was finally going home I realized that, no, I’m not a hacker. Being in your teens and actually coding all night are requirements for being a “hacker” proper. The rest of us are just professionals who hack it until 22:00 hours at best then go home for a full nights sleep. “Hacker” whatever at conferences amuse me because let’s face it: being young, owlish, and poor of diet and sleep may have gotten us to this point in our lives, but it does not and will not take us further.

Copyright © 2009 codenode. Theme by THAT Agency powered by WordPress.