17.4: Case Study — A JSON Parser

Last updated
Save as PDF

Page ID: 43767

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

In this section we illustrate PetitParser through the development of a JSON parser. JSON is a lightweight data-interchange format defined in http://www.json.org. We are going to use the specification on this website to define our own JSON parser.

JSON is a simple format based on nested pairs and arrays. The following script gives an example taken from Wikipedia http://en.Wikipedia.org/wiki/JSON.

Code $\PageIndex{1}$ (Pharo): An example of JSON

{ "firstName" : "John",
  "lastName" : "Smith",
  "age" : 25,
  "address" :
      { "streetAddress" : "21 2nd Street",
        "city" : "New York",
        "state" : "NY",
        "postalCode" : "10021" },
  "phoneNumber":
      [
          { "type" : "home",
            "number" : "212 555-1234" },
          { "type" : "fax",
            "number" : "646 555-4567" } ] }

JSON consists of object definitions (between curly braces “{}”) and arrays (between square brackets “[]”). An object definition is a set of key/value pairs whereas an array is a list of values. The previous JSON example then represents an object (a person) with several key/value pairs (e.g., for the person’s first name, last name, and age). The address of the person is represented by another object while the phone number is represented by an array of objects.

First we define a grammar as subclass of PPCompositeParser. Let us call it PPJsonGrammar.

Code $\PageIndex{2}$ (Pharo): Defining the JSON grammar class

PPCompositeParser subclass: #PPJsonGrammar
    instanceVariableNames: ''
    classVariableNames: 'CharacterTable'
    poolDictionaries: ''
    category: 'PetitJson-Core'

We define the CharacterTable class variable since we will later use it to parse strings.

Syntax diagram representation for the JSON object parser. — Figure $\PageIndex{1}$: Syntax diagram representation for the JSON object parser defined in Code $\PageIndex{3}$.

Syntax diagram representation for the JSON array parser. — Figure $\PageIndex{2}$: Syntax diagram representation for the JSON array parser defined in Code $\PageIndex{4}$.

Parsing objects and arrays

The syntax diagrams for JSON objects and arrays are in Figure $\PageIndex{1}$ and Figure $\PageIndex{2}$. A PetitParser can be defined for JSON objects with the following code:

Code $\PageIndex{3}$ (Pharo): Defining the JSON parser for object as represented in Figure $\PageIndex{1}$

PPJsonGrammar>>object
    ^ ${ asParser token trim , members optional , $} asParser token trim

PPJsonGrammar>>members
    ^ pair separatedBy: $, asParser token trim

PPJsonGrammar>>pair
    ^ stringToken , $: asParser token trim , value

The only new thing here is the call to the PPParser»separatedBy: convenience method which answers a new parser that parses the receiver (a value here) one or more times, separated by its parameter parser (a comma here).

Arrays are much simpler to parse as depicted in the Code $\PageIndex{4}$.

Code $\PageIndex{4}$ (Pharo): Defining the JSON parser for array as represented in Figure $\PageIndex{2}$

PPJsonGrammar>>array
    ^ $[ asParser token trim ,
        elements optional ,
    $] asParser token trim

PPJsonGrammar>>elements
    ^ value separatedBy: $, asParser token trim

Parsing values

In JSON, a value is either a string, a number, an object, an array, a Boolean (true or false), or null. The value parser is defined as below and represented in Figure $\PageIndex{3}$:

Code $\PageIndex{5}$ (Pharo): Defining the JSON parser for value as represented in Figure $\PageIndex{3}$

PPJsonGrammar>>value
    ^ stringToken / numberToken / object / array /
        trueToken / falseToken / nullToken

Syntax diagram representation for the JSON value parser. — Figure $\PageIndex{3}$: Syntax diagram representation for the JSON value parser defined in Code $\PageIndex{5}$.

A string requires quite some work to parse. A string starts and end with double-quotes. What is inside these double-quotes is a sequence of characters. Any character can either be an escape character, an octal character, or a normal character. An escape character is composed of a backslash immediately followed by a special character (e.g., '\n' to get a new line in the string). An octal character is composed of a backslash, immediately followed by the letter 'u', immediately followed by 4 hexadecimal digits. Finally, a normal character is any character except a double quote (used to end the string) and a backslash (used to introduce an escape character).

Code $\PageIndex{6}$ (Pharo): Defining the JSON parser for string as represented in Figure $\PageIndex{4}$

PPJsonGrammar>>stringToken
    ^ string token trim
PPJsonGrammar>>string
    ^ $" asParser , char star , $" asParser
PPJsonGrammar>>char
    ^ charEscape / charOctal / charNormal
PPJsonGrammar>>charEscape
    ^ $\ asParser , (PPPredicateObjectParser anyOf: (String withAll: CharacterTable keys))
PPJsonGrammar>>charOctal
    ^ '\u' asParser , (#hex asParser min: 4 max: 4)
PPJsonGrammar>>charNormal
    ^ PPPredicateObjectParser anyExceptAnyOf: '\"'

Special characters allowed after a slash and their meanings are defined in the CharacterTable dictionary that we initialize in the initialize class method. Please note that initialize method on a class side is called when the class is loaded into the system. If you just created the initialize method class was loaded without the method. To execute it, you shoud evaluate PPJsonGrammar initialize in your workspace.

Syntax diagram representation for the JSON string parser. — Figure $\PageIndex{4}$: Syntax diagram representation for the JSON string parser defined in Code $\PageIndex{6}$.

Code $\PageIndex{7}$ (Pharo): Defining the JSON special characters and their meaning

PPJsonGrammar class>>initialize
    CharacterTable := Dictionary new.
    CharacterTable
        at: $\ put: $\;
        at: $/ put: $/;
        at: $" put: $";
        at: $b put: Character backspace;
        at: $f put: Character newPage;
        at: $n put: Character lf;
        at: $r put: Character cr;
        at: $t put: Character tab

Parsing numbers is only slightly simpler as a number can be positive or negative and integral or decimal. Additionally, a decimal number can be expressed with a floating number syntax.

Code $\PageIndex{8}$ (Pharo): Defining the JSON parser for number as represented in Figure $\PageIndex{5}$

PPJsonGrammar>>numberToken
    ^ number token trim
PPJsonGrammar>>number
    ^ $- asParser optional ,
    ($0 asParser / #digit asParser plus) ,
    ($. asParser , #digit asParser plus) optional ,
    (($e asParser / $E asParser) , ($- asParser / $+ asParser) optional , #digit asParser
        plus) optional

Syntax diagram representation for the JSON number parser. — Figure $\PageIndex{5}$: Syntax diagram representation for the JSON number parser defined in Code $\PageIndex{8}$.

The attentive reader will have noticed a small difference between the syntax diagram in Figure $\PageIndex{5}$ and the code in Code $\PageIndex{8}$. Numbers in JSON can not contain leading zeros: i.e., strings such as "01" do not represent valid numbers. The syntax diagram makes that particularly explicit by allowing either a 0 or a digit between 1 and 9. In the above code, the rule is made implicit by relying on the fact that the parser combinator $/ is ordered: the parser on the right of $/ is only tried if the parser on the left fails: thus, ($0 asParser / #digit asParser plus) defines numbers as being just a 0 or a sequence of digits not starting with 0.

The other parsers are fairly trivial:

Code $\PageIndex{9}$ (Pharo): Defining missing JSON parsers

PPJsonGrammar>>falseToken
    ^ 'false' asParser token trim
PPJsonGrammar>>nullToken
    ^ 'null' asParser token trim
PPJsonGrammar>>trueToken
    ^ 'true' asParser token trim

The only piece missing is the start parser.

Code $\PageIndex{10}$ (Pharo): Defining the JSON start parser as being a value (Figure $\PageIndex{3}$) with nothing following

PPJsonGrammar>>start
    ^ value end