Prasanth Janardhanan

Building a simple query parser using PEG in Go

In another post, Simple Query Parser we had built a simple query parser using Participle - a parser builder for go lang. Parsing expression grammar (PEG) is a type of grammar. The advantage of PEG is that it doesn’t tolerate ambiguous grammar definitions and so is better in error reporting.

The go language port of PEG is pigeon The popular Javascript port of PEG is pegjs. A good introduction to PEG grammar can be found in the pegjs documentation and also here.

In this article, we will use pigeon to translate from a .peg grammar file to .go source. Run the command below to get pigeon on your laptop.

go get -u github.com/mna/pigeon

You can run the pigeon command to translate from .peg file to .go like this

pigeon -o query.go query.peg

A simple query language

As mentioned in the previous article, the format of the simple query language is like this:

item.name=laptop item.spec.ram > 8gb item.spec.ssd=yes item.spec.ssd.capacity > 512gb sort_by:price/asc

This query is used to filter items for sale from an online store.

The first step

Let us first try to parse a simple query like this: something=another

Here is the .peg file for this simple query parser:

{
package qp1

type Field struct{
  Key string
  Value string
}
   
}

Field
  = key:Identifier _ "=" _ value:Identifier {
      return &Field{Key:key.(string), Value:value.(string)}, nil
  }

_
  = [ \t]*
  
Identifier
  = [a-zA-Z0-9_]+ { 
      return string(c.text), nil
}
  
EOF 
  = !.

The first part of the .peg file between { } braces is copied directly to the generated .go file. This is the place for any code initializations. In this case, the package definition and some structure definitions are added in this block.

As for grammar, let us start from the bottom. The grammar rule format is like this

rule
   = rule definitions

The bottom most lines defines End Oof File, the rule !. when no more characters, then it is the end of file. An identifier is defined as a sequence of alpha-numeric characters:

Identifier 
  = [a-zA-Z0-9_]+ { 
      return string(c.text), nil
}

This rule says that, when such an identifier is found, get its text, convert to string, and return.

This rule here defines whitespace.

_
  = [ \t]*

Yes you can define it in a more verbose and clearer fashion too

whitespace
  = [ \t]*

The definition of Field shows how to define a complete grammar rule:

Field
  = key:Identifier _ "=" _ value:Identifier 

This rule says that capture the first part of the field - that follows the Identifier definition - to a variable key and the second part - after the = sign , also an identifier - to a variable value.

Field
  = key:Identifier _ "=" _ value:Identifier {
      return &Field{Key:key.(string), Value:value.(string)}, nil
  }

The function body after the definition picks the parsed values, packages them into a Field struct, and returns it.

Using the parser

There are two steps: first, generate the parser code in go. For that run:

pigeon -o query.go query.peg

Then write the code that uses the generated parser:


func main() {
	res, err := qp.Parse("", []byte("some = another"))
	if err != nil {
		fmt.Println(err)
		return
	}
	result, err := json.MarshalIndent(res, "", "   ")
	if err != nil {
		fmt.Println(err)
		return
	}

	fmt.Printf("output:\n%v\n", string(result))
}

The output will look like this:

{
   "Key": "some",
   "Value": "another"
}

Repeating definitions

Let us now make a small update - the “key” will allow something similar to object member notation like this:

item.spec.ssd=something

We need to get the “item.spec.ssd” part parsed out to a Source structure.

The Field and Source definitions are defined:

Field
  = src:Source _ "=" _ value:Identifier {
      return &Field{Key:src.(*Source), Value:value.(string)}, nil
  }

Source
  = name:Identifier path:("." Identifier)* {
      return makeSource(name,path)
  }

Make a note of the Source grammar definition. Identifier then .Identifier that can repeat 0 or more times (* means 0 or more repetitions). So the path is optional and can repeat if required.

Note that it is calling a function makeSource which extracts the parsed values to a Source struct. It has a number of type casting since name and path are of type interface{} We have to type assert and extract the parameters.

type Field struct {
	Key   *Source
	Value string
}

type Source struct {
	Name string
	Path []string
}

func makeSource(name interface{}, path interface{}) (*Source, error) {
	ps := path.([]interface{})

	paths := make([]string, 0)
	for _, p := range ps {
		pa := p.([]interface{})
		px := pa[1:]
		for _, pi := range px {
			paths = append(paths, pi.(string))
		}
	}

	return &Source{Name: name.(string), Path: paths}, nil
}

When the query item.spec.ssd=another is parsed, this is the output structure:

{
   "Key": {
      "Name": "item",
      "Path": [
         "spec",
         "ssd"
      ]
   },
   "Value": "another"
}

Different Value Types

Now we move on to parsing different types of values - quoted strings, unquoted strings, numbers, and booleans.

The grammar first:

Value
  = val:(
      Measure
    / Float
    / Integer
    / Identifier
    / String 
    ){
    return makeValue(val)
  }

String
  = '"' chars:[^"]* '"' {
    return stringFromChars(chars), nil
  } 

Integer
  = [+-]? [0-9]+ {
    return strconv.ParseInt(string(c.text), 10, 64)
  }
  
Measure
  = number:(Integer / Float) unit:Identifier {
    return makeMeasure(number, unit)
  }
  
Float
  = [+-]? ([0-9]* "." [0-9]+ ) {
      return strconv.ParseFloat(string(c.text), 64)
    }
    

Notice the definition of String grammar. It extracts characters between two double quotes. Extraction requires series of deep casting to get it from interface{} slices to bytes and a string. This is the function that extracts the character sequence to the string:

func stringFromChars(chars interface{}) string {
	str := ""
	r := chars.([]interface{})
	for _, i := range r {
		j := i.([]uint8)
		str += string(j[0])
	}
	return str
}

The Value structure has members for each data type like this

type Value struct {
	String  string
	Int     int64
	Float   float64
	Measure *Measure
}

Out of these, the Measure member is peculiar. It gets a number and its unit together. For example, 512gb.

Better Generic Value

Having all the field types in one Value struct is not an optimal method; it will take more space and could be confusing to use. It doesn’t provide any advantage either because we will have to check the type of the value anyways. A better alternative would be to use interface{} for Value. Let us update the definitions.

type Field struct {
	Key   *Source
	Value interface{} // String / Int /Float /Measure
}

type Source struct {
	Name string
	Path []string
}

type Measure struct {
	Number interface{} //int64/float64
	Units  string
}

Then the capturing functions are updated as well:

func makeValue(val interface{}) (interface{}, error) {
	return val, nil
}

func makeMeasure(num interface{}, units interface{}) (*Measure, error) {
	retVal := &Measure{Number: num, Units: units.(string)}

	return retVal, nil
}

Supporting comparison operators

The next improvement we could attempt would be to support other comparison operators like Greater than , less than.

Field
  = src:Source _ op:Operator _ value:Value {
      return &Field{Key:src.(*Source),Op:op.(string), Value:value}, nil
  }
  
Operator
  = op:(
     "<="
    / ">="
    / "="
    / "<"
    / ">"
  ){
    return string(c.text), nil
  }

Notice that the longer rules come first; “<=” must be before “<”. If the “<” rule comes first, the parser will capture “<” and stop there. This method helps reduce ambiguity.

Combination operators

We can combine two queries using either AND or OR. For this simple query language, the default combinatory is AND.

item.name=laptop item.spec.ssd > 512

means that get only the items that are laptops AND SSD size is more than 512

Since the default is AND, let us introduce an operator for OR. Here is a sample query:

item.name=laptop  (item.maker=asus | item.maker=coconics)

Let’s first try parsing the OR sequence

Making a few changes to the grammar makes it possible to parse a sequence of OR field queries.

Query
  = f:Field _ oq:OrQuery* _ {
    return makeQuery(f,oq)
  }
  
OrQuery 
  = _ '|' _ f:Field _ {
    return f, nil
  }

Field
  = src:Source _ op:Operator _ value:Value {
      return &Field{Key:src.(*Source),Op:op.(string), Value:value}, nil
  }

Note that the OrQuery* is the second part of the Query definition. So it will match 0 or more | Field expressions. That means Query will match simple query like item.name=laptop as well.

Going Recursive

Consider a query like this:

item.name=laptop item.spec.ram > 8gb  ( item.spec.ssd=yes | item.spec.ssd.capacity > 512gb)  ( item.maker=asus | item.maker=coconics ) 

The query inside the braces can be counted as sub-queries that are queries themselves. The braces can contain braces, just like an arithmetic expression. In order to accommodate query like this, we have to define grammar recursively. Let’s edit the grammar:

Query
  = _ fq:FieldQuery _  fqs:FieldQuery* _ {
    return makeQuery(fq, fqs)
  }
  
FieldQuery
   = _ '(' _ q:Query _ ')'_ {
        return makeFQFromQuery(q)
      }
   / _ f:Field _ {
        return makeFQFromField(f)
     }

Query is defined by FieldQuery and FieldQuery can be a Query inside braces.

In order to allow any number of Fields followed by any number of OR expressions, let us update the query and include an AndQuery grammar. Here is the complete grammar file:


Query
  = aq:AndQuery _ oq:OrQuery* _ {
    return makeQuery(aq,oq)
  }

OrQuery 
  = _ '|' _ aq:AndQuery _ {
    return aq, nil
  }

AndQuery
  = _ fq:FieldQuery _  fqs:FieldQuery* _ {
    return makeAndQuery(fq, fqs)
  }
  
FieldQuery
   = _ '(' _ q:Query _ ')'_ {
        return makeFQFromQuery(q)
      }
   / _ f:Field _ {
        return makeFQFromField(f)
     }

Field
  = src:Source _ op:Operator _ value:Value {
      return &Field{Key:src.(*Source),Op:op.(string), Value:value}, nil
  }

Source
  = name:Identifier path:("." Identifier)* {
      return makeSource(name,path)
  }

Operator
  = op:(
     "<="
    / ">="
    / "="
    / "<"
    / ">"
  ){
    return string(c.text), nil
  }
  
Value
  = val:(
      Measure
    / Float
    / Integer
    / Identifier
    / String 
    ){
    return makeValue(val)
  }

String
  = '"' chars:[^"]* '"' {
    return stringFromChars(chars), nil
  } 

Integer
  = [+-]? [0-9]+ {
    return strconv.ParseInt(string(c.text), 10, 64)
  }
  
Measure
  = number:(Integer / Float) unit:Identifier {
    return makeMeasure(number, unit)
  }
  
Float
  = [+-]? ([0-9]* "." [0-9]+ ) {
      return strconv.ParseFloat(string(c.text), 64)
    }

Identifier
  = [a-zA-Z0-9_]+ { 
      return string(c.text), nil
}

_
  = [ \t]* { 
      return "", nil
 }
  
EOF 
  = !.
  

We can make an arbitrarily complex parser with a few lines of PEG grammar. Not only it is functional but is graceful in spotting errors too.