Inside a Query Engine (Part 4): Expressions and Precedence
Table of contents
- Why AND and OR are Complicated
- The Building Blocks: A Simple Starting Point
- Understanding the Baseline Parser
- Step 1: Adding AND
- Overview of the code
- Step 2: Introducing OR and Precedence
- How to Think About Precedence
- Walkthrough: The Precedence Ladder
- Deriving the Recursive AST
- Step 1: The AND Component
- Step 2: The OR Component
- Step 3: The Substitution
- The Breakthrough: Why Recursion?
- Step 3: Adding Parentheses
- The Challenge: The Unit Mismatch
- A Failed Attempt: The Naive Fix
- The Breakthrough: Defining a “Primary” Unit
- The Recursive Magic
- From Grammar to Code: The Implementation
- How to Think About This Code
- Conclusion
- References
In the previous parts of this series, we built a Handwritten Parser capable of understanding basic SELECT statements. But real-world queries are rarely that simple. Users want to filter data using complex logic: WHERE (age > 25 OR salary > 50000) AND status = 'active'.
Supporting this requires us to tackle the most challenging part of parsing: Expressions, Precedence, and Parentheses.
Why AND and OR are Complicated
At first glance, adding AND and OR seems like just adding more keywords. However, they introduce the problem of precedence. Just like in math, where 2 + 3 * 4 is 14 (not 20), in SQL, AND usually has higher precedence than OR.
If we don’t encode this precedence into our grammar, the parser won’t know whether A OR B AND C means (A OR B) AND C or A OR (B AND C).
The Building Blocks: A Simple Starting Point
Before we dive into expressions, let’s look at the basic grammar for a SELECT query. This is our foundation:
select
= "SELECT" projection "FROM" identifier [where] [";"] ;
projection
= "*"
| identifier ("," identifier)* ;
where
= "WHERE" clause ;
clause
= identifier operator literal ;
operator
= "=" | ">" | "<" | ">=" | "<=" | "!=" ;
identifier
= IDENTIFIER ;
literal
= LITERAL ;
In this version, the WHERE clause is extremely limited. It only supports a single comparison (e.g., id = 1).
Here is how the implementation looks:
fn parse_select(&mut self) -> Result<Ast, ParseError> {
self.expect_keyword("select")?;
let projection = self.expect_projection()?;
self.expect_keyword("from")?;
let table_name = self.expect_identifier()?;
let where_clause = self.maybe_where_clause()?;
let order_by = self.maybe_order_by()?;
let limit = self.maybe_limit()?;
let _ = self.eat_if(|token| token.is_semicolon());
Ok(Ast::Select {
table_name: table_name.to_string(),
projection,
where_clause,
order_by,
limit,
})
}
fn maybe_where_clause(&mut self) -> Result<Option<WhereClause>, ParseError> {
let is_where_clause = self.eat_if(|token| token.is_keyword("where"));
if is_where_clause {
let where_clause = self.expect_clause()?;
return Ok(Some(where_clause));
}
Ok(None)
}
fn expect_clause(&mut self) -> Result<WhereClause, ParseError> {
let column_name = self.expect_identifier()?;
let operator = self.expect_operator()?;
let literal = self.expect_literal()?;
Ok(WhereClause::Comparison {
column_name,
operator,
literal,
})
}
fn expect_operator(&mut self) -> Result<Operator, ParseError> {
match self.cursor.next() {
Some(token) => Operator::from_token(token),
None => Err(ParseError::UnexpectedEndOfInput),
}
}
fn expect_literal(&mut self) -> Result<Literal, ParseError> {
match self.cursor.next() {
Some(token) => Literal::from_token(token),
None => Err(ParseError::UnexpectedEndOfInput),
}
}
Understanding the Baseline Parser
This code reflects a direct translation of our simple grammar. Each function, like expect_projection or expect_keyword, corresponds to a specific rule.
At this stage:
parse_selectacts as the orchestrator, calling sub-parsers in the exact sequence expected by the SQL syntax.maybe_where_clauseshows the “Maybe” pattern, it checks for theWHEREkeyword and, if missing, returnsOk(None)instead of erroring.expect_clauseis completely linear. It expects exactly one identifier, one operator, and one literal. There is no concept of nesting or grouping yet.
The code for the complete parser corresponding to the above grammar is here. (Note: This was the Parser during the early stages of development.)
Step 1: Adding AND
Let’s expand the WHERE clause to support multiple conditions joined by AND. Surprisingly, the grammar remains relatively simple and linear:
where
= "WHERE" clause ("AND" clause)*
clause
= identifier operator literal
operator
= "=" | ">" | "<" | ">=" | "<=" | "!=" | "LIKE"
Here, WHERE starts with one clause, followed by any number (zero or more) of "AND" clause pairs. This works well for a flat list of conditions, but it doesn’t allow for grouping or alternative logic.
Overview of the code
At this stage, the parser handles AND logic using a simple “collect” loop. It first expects a single condition and then iteratively looks for more AND <condition> pairs.
fn maybe_where_clause(&mut self) -> Result<Option<WhereClause>, ParseError> {
let is_where_clause = self.eat_if(|token| token.is_keyword("where"));
if is_where_clause {
let conditions = self.expect_conditions()?;
return if conditions.len() == 1 {
Ok(Some(WhereClause::Single(
conditions.into_iter().next().unwrap(),
)))
} else {
Ok(Some(WhereClause::And(conditions)))
};
}
Ok(None)
}
fn expect_conditions(&mut self) -> Result<Vec<Condition>, ParseError> {
let condition = self.expect_condition()?;
let mut conditions = Vec::new();
conditions.push(condition);
while self.eat_if(|token| token.matches(TokenType::Keyword, "and")) {
let condition = self.expect_condition()?;
conditions.push(condition);
}
Ok(conditions)
}
fn expect_condition(&mut self) -> Result<Condition, ParseError> {
let column_name = self.expect_identifier()?;
let operator = self.expect_operator()?;
let literal = self.expect_literal()?;
match operator {
BinaryOperator::Like => Ok(Condition::like(&column_name, literal)),
_ => Ok(Condition::comparison(&column_name, operator, literal)),
}
}
In our Rust AST, we model this using a specialized enum:
pub(crate) enum WhereClause {
Single(Condition),
And(Vec<Condition>),
}
#[derive(Debug, Eq, PartialEq)]
pub(crate) enum Condition {
Comparison {
column_name: String,
operator: BinaryOperator,
literal: Literal,
},
Like {
column_name: String,
literal: Literal,
},
}
Introduction of AND was all about iteratively looking for more AND clauses.
Step 2: Introducing OR and Precedence
When we add OR, things get interesting. If we were to naively define WHERE as:
(* BAD GRAMMAR *)
where = clause (("AND" | "OR") clause)*
We’ve created an ambiguous grammar. This grammar doesn’t tell the parser which operation to perform first. In a recursive descent parser, we solve this by tiering our rules.
How to Think About Precedence
The rule is simple: The rule representing the lower-precedence operator delegates to the rule representing the higher-precedence operator.
- Expression (OR): The entry point for logic. It handles
ORoperations and callsand_expression. - And Expression: Handles
ANDoperations. - Clause: The actual comparison (
id = 1).
Here is the grammar that correctly encodes that AND > OR:
where
= "WHERE" expression ;
(* Encodes precedence (AND > OR) *)
expression
= or_expression ;
or_expression
= and_expression ("OR" and_expression)* ;
and_expression
= clause ("AND" clause)* ;
clause
= identifier operator literal
By structuring it this way, the parser is forced to “group” AND operations together before it even considers an OR. This naturally produces an AST that respects the correct order of operations.
Walkthrough: The Precedence Ladder
How does this grammar handle name = 'relop' OR language = 'rust' AND type = 'query-parsing'? Instead of a flat list, let’s look at how the recursion “scoops up” tokens based on the precedence rules:
QUERY: name = 'relop' OR language = 'rust' AND type = 'query-parsing'
1. expect_or_expression (ENTRY POINT)
└── calls expect_and_expression (Left Side)
└── matches "name = 'relop'"
└── does not see "AND", goes up
2. expect_or_expression sees the "OR" keyword
└── calls expect_and_expression (Right Side)
├── matches "language = 'rust'"
├── sees "AND" keyword (HIGHER PRECEDENCE!)
└── matches "type = 'query-parsing'"
└── Result: AND_NODE(lang='rust', type='parsing')
3. expect_or_expression combines them:
└── FINAL AST: OR_NODE(
SINGLE(name='relop'),
AND_NODE(lang='rust', type='parsing')
)
The hierarchy forces the AND to finish its work (grouping the neighbor on the right) before the OR even gets a chance to see the tokens. This is how the parser “knows” to bind AND more tightly.
Deriving the Recursive AST
How do we model this hierarchy in our AST? Let’s try to derive the structure by thinking through a complex query: name = 'relop' AND language = 'rust' OR status = 'active'
Step 1: The AND Component
First, we group the high-precedence AND conditions. This looks like a single logical unit:
AND (
name = 'relop',
language = 'rust'
)
Step 2: The OR Component
Next, we think about the OR operation. It joins our previous AND group with another condition:
OR (
[The AND group from above],
status = 'active'
)
Step 3: The Substitution
When we substitute the AND group back into the OR group, we see the true structure:
OR (
AND (
name = 'relop',
language = 'rust'
),
status = 'active'
)
The Breakthrough: Why Recursion?
This substitution shows us that an OR node must be able to hold an AND node. But if we introduce parentheses, like (A OR B) AND C, then an AND node would need to hold an OR node!
This “circular” requirement, where AND can contain OR, and OR can contain AND, tells us that we cannot use fixed types like Vec<Condition>. Both must instead hold a unified, Recursive Expression:
#[derive(Debug, Eq, PartialEq)]
pub(crate) struct WhereClause(pub(crate) Expression);
pub(crate) enum Expression {
/// A base comparison (e.g., id = 1)
Single(Clause),
/// A group of expressions joined by AND
And(Vec<Expression>),
/// A group of expressions joined by OR
Or(Vec<Expression>),
}
This recursive structure allows arbitrary nesting depth. An And node can now hold a Single(A) and an Or node, which in turn holds Single(B) and Single(C).
For example:
name = 'relop' OR language = 'rust' AND type = 'query-parsing'
Produces the following tree:
Notice how the recursive structure naturally reflects precedence: the AND expression is constructed first and then becomes a child of the OR.
Step 3: Adding Parentheses
Precedence rules provide a “default” behavior, but users need a way to override them. They need parentheses. Let’s try to derive how to support (language = 'rust' OR language = 'go') AND status = 'active' step-by-step:
The Challenge: The Unit Mismatch
Our current and_expression expects to join simple clauses: and_expression = clause ("AND" clause)*
But in (A OR B) AND C, the left side of the AND is not a clause, it is an entire expression nested inside brackets.
A Failed Attempt: The Naive Fix
What if we just allowed optional brackets around a clause?
(* WRONG *)
and_expression = (clause | "(" clause ")") ("AND" (clause | "(" clause ")"))*
This fails immediately. It only allows us to wrap a single comparison (like (id=1)). It wouldn’t allow (A OR B) because an OR operation is not a clause.
The Breakthrough: Defining a “Primary” Unit
We need a unit that can stand on its own as the base of an AND operation. This unit must be flexible enough to be “either a simple comparison OR a portal back to the start.”
We call this the Primary Expression:
- If it’s a simple comparison, match a
clause. - If it starts with
(, match a fullexpressionand then expect a).
This leads us to the final, tiered grammar:
expression
= or_expression ;
or_expression
= and_expression ("OR" and_expression)* ;
and_expression
= primary_expression ("AND" primary_expression)* ;
primary_expression
= clause
| "(" expression ")" ;
The Recursive Magic
Notice the recursive loop: expression -> or_expression -> and_expression -> primary_expression -> expression.
When the parser is deep inside the precedence ladder and sees a (, it uses primary_expression to “reset” the logic. It jumps back to the very top, evaluates the grouped expression, and then returns to where it left off.
From Grammar to Code: The Implementation
In Relop, the implementation of this recursive logic is surprisingly clean. The updated AST for SELECT is:
#[derive(Debug)]
pub(crate) enum Ast {
/// Represents a `SELECT` statement.
Select {
/// The source to select from (table or join).
source: TableSource,
/// The projection (columns or all) to select.
projection: Projection,
/// The WHERE filter criteria.
where_clause: Option<WhereClause>,
/// The ORDER BY clause, defining the columns and directions used to order rows.
order_by: Option<Vec<OrderingKey>>,
/// The LIMIT (max records) to return.
limit: Option<usize>,
},
}
#[derive(Debug, Eq, PartialEq)]
pub(crate) enum TableSource {
Table {
name: String,
alias: Option<String>,
},
...
}
/// `WhereClause` represents the filtering criteria in a SELECT statement.
#[derive(Debug, Eq, PartialEq)]
pub(crate) struct WhereClause(pub(crate) Expression);
#[derive(Debug, Eq, PartialEq)]
pub(crate) enum Expression {
Single(Clause),
And(Vec<Expression>),
Or(Vec<Expression>),
Grouped(Box<Expression>),
}
Here is a deconstructed look at how the parser handles expressions:
impl Parser {
fn expect_expression(&mut self) -> Result<Expression, ParseError> {
self.expect_or_expression()
}
fn expect_or_expression(&mut self) -> Result<Expression, ParseError> {
let expr = self.expect_and_expression()?;
let mut expressions = vec![expr];
while self.eat_if(|token| token.is_keyword("or")) {
expressions.push(self.expect_and_expression()?);
}
if expressions.len() > 1 {
Ok(Expression::or(expressions))
} else {
Ok(expressions.into_iter().next().unwrap())
}
}
fn expect_and_expression(&mut self) -> Result<Expression, ParseError> {
let expr = self.expect_primary_expression()?;
let mut expressions = vec![expr];
while self.eat_if(|token| token.is_keyword("and")) {
expressions.push(self.expect_primary_expression()?);
}
if expressions.len() > 1 {
Ok(Expression::and(expressions))
} else {
Ok(expressions.into_iter().next().unwrap())
}
}
fn expect_primary_expression(&mut self) -> Result<Expression, ParseError> {
if self.eat_if(|token| token.is_left_parentheses()) {
let expr = self.expect_expression()?;
self.expect_right_parentheses()?;
Ok(Expression::grouped(expr))
} else {
Ok(Expression::single(self.expect_clause()?))
}
}
}
How to Think About This Code
Let’s break down the transformation from grammar to code step-by-step:
- Direct Mapping: Notice how each grammar rule (
expression,or_expression,and_expression,primary_expression) has a direct 1-to-1 equivalent in Rust functions. - The Call Hierarchy:
expect_or_expressionstarts by callingexpect_and_expression. This is the core of implicit precedence. The parser must satisfy a potentialANDbranch before it even considers anOR. - The Collection Loop: Both
ANDandORmethods use awhileloop to collect multiple operands.- It starts with one expression.
- It looks for the operator keyword.
- If found, it calls the next highest tier rule and adds the result to the list.
- Tree Simplification: The
if expressions.len() > 1check is a crucial optimization. If only one operand is found (e.g.,id=1), we return it directly rather than wrapping it in a redundantExpression::And,Expression::Ornode. - The Recursive Reset: The
expect_primary_expressionmethod contains the “magic.” When it sees a(, it callsexpect_expression(the top of the ladder). This recursion allows the entire precedence structure to be re-evaluated inside the parentheses.
Conclusion
Handling expressions and precedence is often the most intricate part of building a parser. By encoding precedence directly into the grammar rules and leveraging recursion, we can transform complex nested strings into a perfectly structured tree.
With expressions handled, our parser is now powerful enough to represent almost any logical query. In the next part, we will move beyond syntax and explore the Logical Plan: how the engine transforms this Abstract Syntax Tree into a series of relational operations.
References
- Crafting Interpreters
- Chapters 4,5,6