Prasanth Janardhanan

Let's build a search query parser in Go

Mini languages are great. Makes it easy to express what you want in a concise manner. Sometimes a complex UI can be replaced with a mini-language. Some time back, google used to support a simple language in the search queries. For example: “some phrase” +required -not_required . Alas! they stopped it and Google search is less cool ever since. I would count regular expressions also as a mini-language.

Imagine we are building an online store that allows searching for products using a simple but structured query language. You can search for a laptop with more than 8GB RAM and more than 512GB SSD using this query:

item.name=laptop item.spec.ram > 8gb item.spec.ssd=yes item.spec.ssd.capacity > 512gb sort_by:price/asc

The first simplest solution that comes to mind would be to use the strings.Split() Sure, the first version can be done using a set of Split() calls. However, the users of the online shop loved this feature and soon you are asked to support queries such as this:

item.name=laptop (item.maker=asus | item.maker=coconics) item.spec.ssd=yes

Let us build a mini-compiler for parsing such a query.

Building a compiler in Go

There are many options to get you started.

participle makes it all simple - just by defining the structs for the parser, you can create the compiler. Should be well suited for a simple query language that we are building.

Starting simple - simple key-value pairs

Let us make an MVP(Minimum Viable Product or MiniViaPro) version of the parser first.

This should parse queries where the key and value are separated by equal to sign

package v1

import (
	participle "github.com/alecthomas/participle/v2"
)

type Query struct {
	Fields []*Field `@@*`
}

type Field struct {
	Key   string `@Ident "="`
	Value *Value `@@`
}

type Value struct {
	Number *float64 `  @Float | @Int`
	String *string  `| @String`
	Bool   *bool    `| ( @"true" | "false" )`
	Nil    bool     `| @"nil"`
}

var parser = participle.MustBuild(&Query{})

func ParseQuery(q string) (*Query, error) {
	var expr Query

	err := parser.ParseString("", q, &expr)
	if err != nil {
		return nil, err
	}

	return &expr, nil
}

What does the structure definition mean?

Let us try parsing this query:

a=100 b="something"

The returned query structure is like this:

 {
   "Fields": [
      {
         "Key": "a",
         "Value": {
            "Number": 100,
            "String": null,
            "Bool": null,
            "Nil": false
         }
      },
      {
         "Key": "b",
         "Value": {
            "Number": null,
            "String": "\"something\"",
            "Bool": null,
            "Nil": false
         }
      }
   ]
}

Not bad for a five-minute code!

Notice the string value “something” ? It contains the quotes as well. It is easy to get rid of that by updating the options:

var parser = participle.MustBuild(&Query{},
	participle.Unquote("String"),
)

Let us expand this grammar to include other operators such as greater than and less than.


type Field struct {
	Key   string `@Ident`
	Op    string `@("=" | "<" | ">" | "<" "=" | ">" "=" )`
	Value *Value `@@`
}

Added an Op (for operator) member to the Field. The operator can be < or > or <= or >=

Let us now try with this query:

item="laptop" price <= 10000 memory > 512 

But, this throws an error like this:

Error 1:22: unexpected token "=" (expected Value)

Why so? because when the parser see < symbol, it matches with the first rule and it does not match the second one.

This will work as expected :


type Field struct {
	Key   string `@Ident`
	Op    string `@("=" | "<" "=" | "<" | ">" "=" | ">" )`
	Value *Value `@@`
}

The resulting AST for our previous query:

{
   "Fields": [
      {
         "Key": "item",
         "Op": "=",
         "Value": {
            "Number": null,
            "String": "laptop",
            "Bool": null,
            "Nil": false
         }
      },
      {
         "Key": "price",
         "Op": "\u003c=",
         "Value": {
            "Number": 10000,
            "String": null,
            "Bool": null,
            "Nil": false
         }
      },
      {
         "Key": "memory",
         "Op": "\u003e",
         "Value": {
            "Number": 512,
            "String": null,
            "Bool": null,
            "Nil": false
         }
      }
   ]
}

Make the key name as a JSON path

In our original query spec, we mentioned the item name as item.name and item price as item.price. So we have to follow the spec. Let us modify the grammar

type Field struct {
	Source *Source `@@`
	Op     string  `@("=" | "<" "=" | "<" | ">" "=" | ">"  )`
	Value  *Value  `@@`
}

type Source struct {
	Name string   `@Ident`
	Path []string `("." @Ident)*`
}

The Source has a name and then the path to the attribute. The path can contain 0 or more path parts.

Let’s try the query:

item.name="laptop" item.price <= 10000 item.spec.memory > 512 

and we get the AST as:

{
   "Fields": [
      {
         "Source": {
            "Name": "item",
            "Path": [
               "name"
            ]
         },
         "Op": "=",
         "Value": {
            "Number": null,
            "String": "laptop",
            "Bool": null,
            "Nil": false
         }
      },
      {
         "Source": {
            "Name": "item",
            "Path": [
               "price"
            ]
         },
         "Op": "\u003c=",
         "Value": {
            "Number": 10000,
            "String": null,
            "Bool": null,
            "Nil": false
         }
      },
      {
         "Source": {
            "Name": "item",
            "Path": [
               "spec",
               "memory"
            ]
         },
         "Op": "\u003e",
         "Value": {
            "Number": 512,
            "String": null,
            "Bool": null,
            "Nil": false
         }
      }
   ]
}

String without quotes

One simple improvement we can do at this point is to make the quotes optional for the string. For example, item.name=laptop instead of item.name=“laptop”. Let us try using the @Ident (identifier) token for this purpose

type Value struct {
	Number *float64 `  @Float | @Int`
	String *string  `| @String | @Ident `
	Bool   *bool    `| ( @"true" | "false" )`
	Nil    bool     `| @"nil"`
}

Custom lexical grammar

So far we used the built-in lexical tokens. We can define our own lexical grammar. This may make the parser definition a bit more cleaner. We will move the comparison operators to the lexical grammar.

Here is the custom lexical grammar definition:

var queryLexer = lexer.Must(stateful.NewSimple([]stateful.Rule{
	{"Ident", `[a-zA-Z]\w*`, nil},
	{"Number", `[-+]?(\d*\.)?\d+`, nil},
	{"Comparison", `[=]|[<>]=?`, nil},
	{"Dot", `\.`, nil},
	{"String", `(?:[a-zA-Z0-9_\-\.]+)|(?:\"(?:[^\"]|\\.)*\")`, nil},
	{"Whitespace", `[ \t]+`, nil},
}))

The lexical grammar is nothing but a set of regular expressions. We name each type of token. Then we use the names in the parser grammar definition.

Now we update the parser grammar definitions to reach the final code :

package v4

import (
	participle "github.com/alecthomas/participle/v2"
	"github.com/alecthomas/participle/v2/lexer"
	"github.com/alecthomas/participle/v2/lexer/stateful"
)

type Query struct {
	Fields []*Field `@@*`
}

type Field struct {
	Source *Source `@@`
	Op     string  `@Comparison`
	Value  *Value  `@@`
}

type Source struct {
	Name string   `@Ident`
	Path []string `(Dot @Ident)*`
}

type Value struct {
	Number *float64 `  @Number`
	String *string  `| @String | @Ident`
	Bool   *bool    `| ( @"true" | "false" )`
	Nil    bool     `| @"nil"`
}

var queryLexer = lexer.Must(stateful.NewSimple([]stateful.Rule{
	{"Ident", `[a-zA-Z]\w*`, nil},
	{"Number", `[-+]?(\d*\.)?\d+`, nil},
	{"Comparison", `[=]|[<>]=?`, nil},
	{"Dot", `\.`, nil},
	{"String", `\"(?:[^\"]|\\.)*\"`, nil},
	{"Whitespace", `[ \t]+`, nil},
}))

var parser = participle.MustBuild(&Query{},
	participle.Lexer(queryLexer),
	participle.Elide("Whitespace"),
	participle.UseLookahead(2),
	participle.Unquote("String"),
)

func ParseQuery(q string) (*Query, error) {
	var expr Query

	err := parser.ParseString("", q, &expr)
	if err != nil {
		return nil, err
	}

	return &expr, nil
}

We have not yet reached the point where we can compile the original query style proposed in the beginning yet. Let us take stock of what remains

item.name=laptop item.spec.ram > 8gb item.spec.ssd=yes item.spec.ssd.capacity > 512gb sort_by:price/asc
  1. Capture the unit of the measurement ( 512gb )
  2. true, yes y all carries the same meaning ( ssd=yes being valid query)
  3. combine queries using the OR operator recursively as in:
    item.name=laptop (item.maker=asus | item.maker=coconics) item.spec.ssd=yes
    
  4. Handling errors better so mistakes are gracefully handled

The next article explains how to make a fairly complicated parser using PEG parser.