17.4: Case Study — A JSON Parser
- Page ID
- 43767
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)In this section we illustrate PetitParser through the development of a JSON parser. JSON is a lightweight data-interchange format defined in http://www.json.org. We are going to use the specification on this website to define our own JSON parser.
JSON is a simple format based on nested pairs and arrays. The following script gives an example taken from Wikipedia http://en.Wikipedia.org/wiki/JSON.
Code \(\PageIndex{1}\) (Pharo): An example of JSON
{ "firstName" : "John", "lastName" : "Smith", "age" : 25, "address" : { "streetAddress" : "21 2nd Street", "city" : "New York", "state" : "NY", "postalCode" : "10021" }, "phoneNumber": [ { "type" : "home", "number" : "212 555-1234" }, { "type" : "fax", "number" : "646 555-4567" } ] }
JSON consists of object definitions (between curly braces “{}”) and arrays (between square brackets “[]”). An object definition is a set of key/value pairs whereas an array is a list of values. The previous JSON example then represents an object (a person) with several key/value pairs (e.g., for the person’s first name, last name, and age). The address of the person is represented by another object while the phone number is represented by an array of objects.
First we define a grammar as subclass of PPCompositeParser
. Let us call it PPJsonGrammar
.
Code \(\PageIndex{2}\) (Pharo): Defining the JSON grammar class
PPCompositeParser subclass: #PPJsonGrammar instanceVariableNames: '' classVariableNames: 'CharacterTable' poolDictionaries: '' category: 'PetitJson-Core'
We define the CharacterTable
class variable since we will later use it to parse strings.


Parsing objects and arrays
The syntax diagrams for JSON objects and arrays are in Figure \(\PageIndex{1}\) and Figure \(\PageIndex{2}\). A PetitParser can be defined for JSON objects with the following code:
Code \(\PageIndex{3}\) (Pharo): Defining the JSON parser for object as represented in Figure \(\PageIndex{1}\)
PPJsonGrammar>>object ^ ${ asParser token trim , members optional , $} asParser token trim PPJsonGrammar>>members ^ pair separatedBy: $, asParser token trim PPJsonGrammar>>pair ^ stringToken , $: asParser token trim , value
The only new thing here is the call to the PPParser»separatedBy:
convenience method which answers a new parser that parses the receiver (a value here) one or more times, separated by its parameter parser (a comma here).
Arrays are much simpler to parse as depicted in the Code \(\PageIndex{4}\).
Code \(\PageIndex{4}\) (Pharo): Defining the JSON parser for array as represented in Figure \(\PageIndex{2}\)
PPJsonGrammar>>array ^ $[ asParser token trim , elements optional , $] asParser token trim PPJsonGrammar>>elements ^ value separatedBy: $, asParser token trim
Parsing values
In JSON, a value is either a string, a number, an object, an array, a Boolean (true or false), or null. The value parser is defined as below and represented in Figure \(\PageIndex{3}\):
Code \(\PageIndex{5}\) (Pharo): Defining the JSON parser for value as represented in Figure \(\PageIndex{3}\)
PPJsonGrammar>>value ^ stringToken / numberToken / object / array / trueToken / falseToken / nullToken

A string requires quite some work to parse. A string starts and end with double-quotes. What is inside these double-quotes is a sequence of characters. Any character can either be an escape character, an octal character, or a normal character. An escape character is composed of a backslash immediately followed by a special character (e.g., '\n'
to get a new line in the string). An octal character is composed of a backslash, immediately followed by the letter 'u'
, immediately followed by 4 hexadecimal digits. Finally, a normal character is any character except a double quote (used to end the string) and a backslash (used to introduce an escape character).
Code \(\PageIndex{6}\) (Pharo): Defining the JSON parser for string as represented in Figure \(\PageIndex{4}\)
PPJsonGrammar>>stringToken ^ string token trim PPJsonGrammar>>string ^ $" asParser , char star , $" asParser PPJsonGrammar>>char ^ charEscape / charOctal / charNormal PPJsonGrammar>>charEscape ^ $\ asParser , (PPPredicateObjectParser anyOf: (String withAll: CharacterTable keys)) PPJsonGrammar>>charOctal ^ '\u' asParser , (#hex asParser min: 4 max: 4) PPJsonGrammar>>charNormal ^ PPPredicateObjectParser anyExceptAnyOf: '\"'
Special characters allowed after a slash and their meanings are defined in the CharacterTable
dictionary that we initialize in the initialize
class method. Please note that initialize
method on a class side is called when the class is loaded into the system. If you just created the initialize
method class was loaded without the method. To execute it, you shoud evaluate PPJsonGrammar initialize
in your workspace.

Code \(\PageIndex{7}\) (Pharo): Defining the JSON special characters and their meaning
PPJsonGrammar class>>initialize CharacterTable := Dictionary new. CharacterTable at: $\ put: $\; at: $/ put: $/; at: $" put: $"; at: $b put: Character backspace; at: $f put: Character newPage; at: $n put: Character lf; at: $r put: Character cr; at: $t put: Character tab
Parsing numbers is only slightly simpler as a number can be positive or negative and integral or decimal. Additionally, a decimal number can be expressed with a floating number syntax.
Code \(\PageIndex{8}\) (Pharo): Defining the JSON parser for number as represented in Figure \(\PageIndex{5}\)
PPJsonGrammar>>numberToken ^ number token trim PPJsonGrammar>>number ^ $- asParser optional , ($0 asParser / #digit asParser plus) , ($. asParser , #digit asParser plus) optional , (($e asParser / $E asParser) , ($- asParser / $+ asParser) optional , #digit asParser plus) optional

The attentive reader will have noticed a small difference between the syntax diagram in Figure \(\PageIndex{5}\) and the code in Code \(\PageIndex{8}\). Numbers in JSON can not contain leading zeros: i.e., strings such as "01" do not represent valid numbers. The syntax diagram makes that particularly explicit by allowing either a 0 or a digit between 1 and 9. In the above code, the rule is made implicit by relying on the fact that the parser combinator $/
is ordered: the parser on the right of $/
is only tried if the parser on the left fails: thus, ($0 asParser / #digit asParser plus)
defines numbers as being just a 0 or a sequence of digits not starting with 0.
The other parsers are fairly trivial:
Code \(\PageIndex{9}\) (Pharo): Defining missing JSON parsers
PPJsonGrammar>>falseToken ^ 'false' asParser token trim PPJsonGrammar>>nullToken ^ 'null' asParser token trim PPJsonGrammar>>trueToken ^ 'true' asParser token trim
The only piece missing is the start parser.
Code \(\PageIndex{10}\) (Pharo): Defining the JSON start parser as being a value (Figure \(\PageIndex{3}\)) with nothing following
PPJsonGrammar>>start ^ value end