codenode

DB<3> x $code

Parsing SQL WHERE clause

Posted on July 26th, 2010 by Daniel Nichter

Parsing a SQL WHERE clause is really difficult. A number of people have told me, “looks like grammar”, implying that I should use some traditional grammar/rule-based solution like yacc or bison, but I can’t because Maatkit tools (of which this is a part) must have minimal external dependencies. Plus, I don’t want or need to parse SQL fully; I bet only MySQL can really do that. I just need a 90% solution like the rest of my little Perl SQL parser. So I wrote my own SQL WHERE parser in about 100 lines. I’ll just reproduce the code comments here which explain, in general, how it works:


# This is not your traditional parser, but it works for simple to rather
# complex cases, with a few noted and intentional limitations.  First,
# the limitations:
#
#   * probably doesn't handle every possible operator (see $op)
#   * doesn't care about grouping with parentheses
#   * not "fully" tested because the possibilities are infinite
#
# It works in four steps; let's take this WHERE clause as an example:
#
#   i="x and y" or j in ("and", "or") and x is not null or a between 1 and 10 and sz="this 'and' foo"
#
# The first step splits the string on and|or, the only two keywords I'm
# aware of that join the separate predicates.  This step doesn't care if
# and|or is really between two predicates or in a string or something else.
# The second step is done while the first step is being done: check predicate
# "fragments" (from step 1) for operators; save which ones have and don't
# have at least one operator.  So the result of step 1 and 2 is:
#
#   PREDICATE FRAGMENT                OPERATOR
#   ================================  ========
#   i="x                              Y
#   and y"                            N
#   or j in ("                        Y
#   and", "                           N
#   or")                              N
#   and x is not null                 Y
#   or a between 1                    Y
#   and 10                            N
#   and sz="this '                    Y
#   and' foo"                         N
#
# The third step runs through the list of pred frags backwards and joins
# the current frag to the preceding frag if it does not have an operator.
# The result is:
#
#   PREDICATE FRAGMENT                OPERATOR
#   ================================  ========
#   i="x and y"                       Y
#                                     N
#   or j in ("and", "or")             Y
#                                     N
#                                     N
#   and x is not null                 Y
#   or a between 1 and 10             Y
#                                     N
#   and sz="this 'and' foo"           Y
#                                     N
#
# The fourth step is similar but not shown: pred frags with unbalanced ' or "
# are joined to the preceding pred frag.  This fixes cases where a pred frag
# has multiple and|or in a string value; e.g. "foo and bar or dog".
#
# After the pred frags are complete, the parts of these predicates are parsed
# and returned in an arrayref of hashrefs like:
#
#   {
#     predicate => 'and',
#     column    => 'id',
#     operator  => '>=',
#     value     => '42',
#   }
#
# Invalid predicates, or valid ones that we can't parse,  will cause
# the sub to die.

The full code is SQLParser.pm which is part of Maatkit.

I’m not sure if I’m using the term “predicate” correctly so don’t quote me, but that’s a trivial concern next to whether the code works or not (it does; it’s tested).

Perl SQL parser

Posted on March 6th, 2010 by Daniel Nichter

I know “great programmers reuse code”, or some adage like that, but I wrote my own SQL query parser in Perl for various, uninteresting reasons. My goals were:

  1. Single package/file without external dependencies (rules out SQL::Parser)
  2. MySQL-specific, i.e. parse MySQL’s “flexible” syntax
  3. Full FROM/JOIN clause and subquery parsing
  4. Parse nested clauses/statements/(sub)queries
  5. Parse everything into logical structure (logical to me at least)
  6. Extensively tested
  7. Handle 90% of cases

The result is SQLParser.pm (only available via that link, not on CPAN).

It successfully parses:

select
   now(),
   (select foo from bar where id=1)
from
   t1,
   t2
   join (select * from sqt1) as t3 using (`select`)
   join t4 on t4.id=t3.id
where
   c1 > any(
      select col2 as z from sqt2 zz where sqtc<(
         select max(col) from l where col<100
      )
   )
   and s in ("select", "tricky")
   or s <> "select"
group by 1
limit 10

(That syntax is valid but the query is a complete fabrication, i.e. not sane.) Apart from the MySQL source I don’t know of any other publicly available code that can successfully parse a SQL statement like that. Granted, my code has limitations, too, like CASE statements, but my goal was only 90% of cases (relative; what’s 90% for me may not be 90% for you).

The code was easier than I thought at first. Parsing FROM/JOIN clauses and subqueries are the biggest challenges. There’s a fair amount of code comments so you should be able to follow along and understand why and how I do things.

If you know of a better Perl SQL parser that meets my goals, please let me know!

Copyright © 2009 codenode. Theme by THAT Agency powered by WordPress.