Building a simple query parser using PEG in Go
In another post, Simple Query Parser we had built a simple query parser using Participle - a parser builder for go lang. Parsing expression grammar (PEG) is a type of grammar. The advantage of PEG is that it doesn’t tolerate ambiguous grammar definitions and so is better in error reporting.
The go language port of PEG is pigeon The popular Javascript port of PEG is pegjs. A good introduction to PEG grammar can be found in the pegjs documentation and also here.
In this article, we will use pigeon to translate from a .peg grammar file to .go source. Run the command below to get pigeon on your laptop.
go get -u github.com/mna/pigeon
You can run the pigeon command to translate from .peg file to .go like this
pigeon -o query.go query.peg
A simple query language
As mentioned in the previous article, the format of the simple query language is like this:
item.name=laptop item.spec.ram > 8gb item.spec.ssd=yes item.spec.ssd.capacity > 512gb sort_by:price/asc
This query is used to filter items for sale from an online store.
The first step
Let us first try to parse a simple query like this: something=another
Here is the .peg file for this simple query parser:
{
package qp1
type Field struct{
Key string
Value string
}
}
Field
= key:Identifier _ "=" _ value:Identifier {
return &Field{Key:key.(string), Value:value.(string)}, nil
}
_
= [ \t]*
Identifier
= [a-zA-Z0-9_]+ {
return string(c.text), nil
}
EOF
= !.
The first part of the .peg file between { }
braces is copied directly to the generated .go file. This is the place for any code initializations. In this case, the package definition and some structure definitions are added in this block.
As for grammar, let us start from the bottom. The grammar rule format is like this
rule
= rule definitions
The bottom most lines defines End Oof File, the rule !.
when no more characters, then it is the end of file.
An identifier is defined as a sequence of alpha-numeric characters:
Identifier
= [a-zA-Z0-9_]+ {
return string(c.text), nil
}
This rule says that, when such an identifier is found, get its text, convert to string, and return.
This rule here defines whitespace.
_
= [ \t]*
Yes you can define it in a more verbose and clearer fashion too
whitespace
= [ \t]*
The definition of Field
shows how to define a complete grammar rule:
Field
= key:Identifier _ "=" _ value:Identifier
This rule says that capture the first part of the field - that follows the Identifier definition - to a variable key
and the second part - after the =
sign , also an identifier - to a variable value
.
Field
= key:Identifier _ "=" _ value:Identifier {
return &Field{Key:key.(string), Value:value.(string)}, nil
}
The function body after the definition picks the parsed values, packages them into a Field
struct, and returns it.
Using the parser
There are two steps: first, generate the parser code in go. For that run:
pigeon -o query.go query.peg
Then write the code that uses the generated parser:
func main() {
res, err := qp.Parse("", []byte("some = another"))
if err != nil {
fmt.Println(err)
return
}
result, err := json.MarshalIndent(res, "", " ")
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("output:\n%v\n", string(result))
}
The output will look like this:
{
"Key": "some",
"Value": "another"
}
Repeating definitions
Let us now make a small update - the “key” will allow something similar to object member notation like this:
item.spec.ssd=something
We need to get the “item.spec.ssd” part parsed out to a Source
structure.
The Field
and Source
definitions are defined:
Field
= src:Source _ "=" _ value:Identifier {
return &Field{Key:src.(*Source), Value:value.(string)}, nil
}
Source
= name:Identifier path:("." Identifier)* {
return makeSource(name,path)
}
Make a note of the Source grammar definition. Identifier
then .Identifier
that can repeat 0 or more times (* means 0 or more repetitions). So the path is optional and can repeat if required.
Note that it is calling a function makeSource
which extracts the parsed values to a Source
struct. It has a number of type casting since name and path are of type interface{} We have to type assert and extract the parameters.
type Field struct {
Key *Source
Value string
}
type Source struct {
Name string
Path []string
}
func makeSource(name interface{}, path interface{}) (*Source, error) {
ps := path.([]interface{})
paths := make([]string, 0)
for _, p := range ps {
pa := p.([]interface{})
px := pa[1:]
for _, pi := range px {
paths = append(paths, pi.(string))
}
}
return &Source{Name: name.(string), Path: paths}, nil
}
When the query item.spec.ssd=another
is parsed, this is the output structure:
{
"Key": {
"Name": "item",
"Path": [
"spec",
"ssd"
]
},
"Value": "another"
}
Different Value Types
Now we move on to parsing different types of values - quoted strings, unquoted strings, numbers, and booleans.
The grammar first:
Value
= val:(
Measure
/ Float
/ Integer
/ Identifier
/ String
){
return makeValue(val)
}
String
= '"' chars:[^"]* '"' {
return stringFromChars(chars), nil
}
Integer
= [+-]? [0-9]+ {
return strconv.ParseInt(string(c.text), 10, 64)
}
Measure
= number:(Integer / Float) unit:Identifier {
return makeMeasure(number, unit)
}
Float
= [+-]? ([0-9]* "." [0-9]+ ) {
return strconv.ParseFloat(string(c.text), 64)
}
Notice the definition of String
grammar. It extracts characters between two double quotes. Extraction requires series of deep casting to get it from interface{} slices to bytes and a string. This is the function that extracts the character sequence to the string:
func stringFromChars(chars interface{}) string {
str := ""
r := chars.([]interface{})
for _, i := range r {
j := i.([]uint8)
str += string(j[0])
}
return str
}
The Value structure has members for each data type like this
type Value struct {
String string
Int int64
Float float64
Measure *Measure
}
Out of these, the Measure
member is peculiar. It gets a number and its unit together. For example, 512gb
.
Better Generic Value
Having all the field types in one Value struct is not an optimal method; it will take more space and could be confusing to use. It doesn’t provide any advantage either because we will have to check the type of the value anyways. A better alternative would be to use interface{}
for Value. Let us update the definitions.
type Field struct {
Key *Source
Value interface{} // String / Int /Float /Measure
}
type Source struct {
Name string
Path []string
}
type Measure struct {
Number interface{} //int64/float64
Units string
}
Then the capturing functions are updated as well:
func makeValue(val interface{}) (interface{}, error) {
return val, nil
}
func makeMeasure(num interface{}, units interface{}) (*Measure, error) {
retVal := &Measure{Number: num, Units: units.(string)}
return retVal, nil
}
Supporting comparison operators
The next improvement we could attempt would be to support other comparison operators like Greater than , less than.
Field
= src:Source _ op:Operator _ value:Value {
return &Field{Key:src.(*Source),Op:op.(string), Value:value}, nil
}
Operator
= op:(
"<="
/ ">="
/ "="
/ "<"
/ ">"
){
return string(c.text), nil
}
Notice that the longer rules come first; “<=” must be before “<”. If the “<” rule comes first, the parser will capture “<” and stop there. This method helps reduce ambiguity.
Combination operators
We can combine two queries using either AND or OR. For this simple query language, the default combinatory is AND.
item.name=laptop item.spec.ssd > 512
means that get only the items that are laptops AND SSD size is more than 512
Since the default is AND, let us introduce an operator for OR. Here is a sample query:
item.name=laptop (item.maker=asus | item.maker=coconics)
Let’s first try parsing the OR sequence
Making a few changes to the grammar makes it possible to parse a sequence of OR field queries.
Query
= f:Field _ oq:OrQuery* _ {
return makeQuery(f,oq)
}
OrQuery
= _ '|' _ f:Field _ {
return f, nil
}
Field
= src:Source _ op:Operator _ value:Value {
return &Field{Key:src.(*Source),Op:op.(string), Value:value}, nil
}
Note that the OrQuery*
is the second part of the Query
definition. So it will match 0 or more | Field
expressions. That means Query will match simple query like item.name=laptop
as well.
Going Recursive
Consider a query like this:
item.name=laptop item.spec.ram > 8gb ( item.spec.ssd=yes | item.spec.ssd.capacity > 512gb) ( item.maker=asus | item.maker=coconics )
The query inside the braces can be counted as sub-queries that are queries themselves. The braces can contain braces, just like an arithmetic expression. In order to accommodate query like this, we have to define grammar recursively. Let’s edit the grammar:
Query
= _ fq:FieldQuery _ fqs:FieldQuery* _ {
return makeQuery(fq, fqs)
}
FieldQuery
= _ '(' _ q:Query _ ')'_ {
return makeFQFromQuery(q)
}
/ _ f:Field _ {
return makeFQFromField(f)
}
Query is defined by FieldQuery and FieldQuery can be a Query inside braces.
In order to allow any number of Fields followed by any number of OR expressions, let us update the query and include an AndQuery grammar. Here is the complete grammar file:
Query
= aq:AndQuery _ oq:OrQuery* _ {
return makeQuery(aq,oq)
}
OrQuery
= _ '|' _ aq:AndQuery _ {
return aq, nil
}
AndQuery
= _ fq:FieldQuery _ fqs:FieldQuery* _ {
return makeAndQuery(fq, fqs)
}
FieldQuery
= _ '(' _ q:Query _ ')'_ {
return makeFQFromQuery(q)
}
/ _ f:Field _ {
return makeFQFromField(f)
}
Field
= src:Source _ op:Operator _ value:Value {
return &Field{Key:src.(*Source),Op:op.(string), Value:value}, nil
}
Source
= name:Identifier path:("." Identifier)* {
return makeSource(name,path)
}
Operator
= op:(
"<="
/ ">="
/ "="
/ "<"
/ ">"
){
return string(c.text), nil
}
Value
= val:(
Measure
/ Float
/ Integer
/ Identifier
/ String
){
return makeValue(val)
}
String
= '"' chars:[^"]* '"' {
return stringFromChars(chars), nil
}
Integer
= [+-]? [0-9]+ {
return strconv.ParseInt(string(c.text), 10, 64)
}
Measure
= number:(Integer / Float) unit:Identifier {
return makeMeasure(number, unit)
}
Float
= [+-]? ([0-9]* "." [0-9]+ ) {
return strconv.ParseFloat(string(c.text), 64)
}
Identifier
= [a-zA-Z0-9_]+ {
return string(c.text), nil
}
_
= [ \t]* {
return "", nil
}
EOF
= !.
We can make an arbitrarily complex parser with a few lines of PEG grammar. Not only it is functional but is graceful in spotting errors too.