Parsing SQL WHERE clause
Posted on July 26th, 2010 by Daniel NichterParsing a SQL WHERE clause is really difficult. A number of people have told me, “looks like grammar”, implying that I should use some traditional grammar/rule-based solution like yacc or bison, but I can’t because Maatkit tools (of which this is a part) must have minimal external dependencies. Plus, I don’t want or need to parse SQL fully; I bet only MySQL can really do that. I just need a 90% solution like the rest of my little Perl SQL parser. So I wrote my own SQL WHERE parser in about 100 lines. I’ll just reproduce the code comments here which explain, in general, how it works:
# This is not your traditional parser, but it works for simple to rather
# complex cases, with a few noted and intentional limitations. First,
# the limitations:
#
# * probably doesn't handle every possible operator (see $op)
# * doesn't care about grouping with parentheses
# * not "fully" tested because the possibilities are infinite
#
# It works in four steps; let's take this WHERE clause as an example:
#
# i="x and y" or j in ("and", "or") and x is not null or a between 1 and 10 and sz="this 'and' foo"
#
# The first step splits the string on and|or, the only two keywords I'm
# aware of that join the separate predicates. This step doesn't care if
# and|or is really between two predicates or in a string or something else.
# The second step is done while the first step is being done: check predicate
# "fragments" (from step 1) for operators; save which ones have and don't
# have at least one operator. So the result of step 1 and 2 is:
#
# PREDICATE FRAGMENT OPERATOR
# ================================ ========
# i="x Y
# and y" N
# or j in (" Y
# and", " N
# or") N
# and x is not null Y
# or a between 1 Y
# and 10 N
# and sz="this ' Y
# and' foo" N
#
# The third step runs through the list of pred frags backwards and joins
# the current frag to the preceding frag if it does not have an operator.
# The result is:
#
# PREDICATE FRAGMENT OPERATOR
# ================================ ========
# i="x and y" Y
# N
# or j in ("and", "or") Y
# N
# N
# and x is not null Y
# or a between 1 and 10 Y
# N
# and sz="this 'and' foo" Y
# N
#
# The fourth step is similar but not shown: pred frags with unbalanced ' or "
# are joined to the preceding pred frag. This fixes cases where a pred frag
# has multiple and|or in a string value; e.g. "foo and bar or dog".
#
# After the pred frags are complete, the parts of these predicates are parsed
# and returned in an arrayref of hashrefs like:
#
# {
# predicate => 'and',
# column => 'id',
# operator => '>=',
# value => '42',
# }
#
# Invalid predicates, or valid ones that we can't parse, will cause
# the sub to die.
The full code is SQLParser.pm which is part of Maatkit.
I’m not sure if I’m using the term “predicate” correctly so don’t quote me, but that’s a trivial concern next to whether the code works or not (it does; it’s tested).