JSON parser with PHP and Bison
Read this post if you don't know what Bison is.
Today I'll try to parse JSON into AST and compare it with the native PHP function json_decode()
.
To test our parser I will use this JSON file:
test.json
{
"fieldString": "string",
"fieldNumber": 99,
"fieldBoolTrue": true,
"fieldBoolFalse": false,
"fieldNull": null,
"fieldEmptyArray": [],
"fieldEmptyObject": {},
"fieldArray": [
"string",
99,
true,
false,
null,
{},
[]
]
}
First, we need to install PHP dependencies.
composer require --dev mrsuh/php-bison-skeleton
composer require mrsuh/tree-printer
composer require doctrine/lexer
- mrsuh/php-bison-skeleton - to build PHP parser with Bison
-
mrsuh/tree-printer - to print
AST
- doctrine/lexer - to parse text into tokens
We will store our files like this:
.
├── /ast-parser
├── /bin
│ └── parse.php # entry point to parse JSON
├── /lib
│ └── parser.php # generated file
├── /src
│ ├── Lexer.php
│ └── Node.php # AST node
└── grammar.y
The Node
class must implement Mrsuh\Tree\NodeInterface
to print AST
.
src/Node.php
<?php
namespace App;
use Mrsuh\Tree\NodeInterface;
class Node implements NodeInterface
{
private string $name;
private string $value;
/** @var Node[] */
private array $children;
public function __construct(string $name, string $value, array $children = [])
{
$this->name = $name;
$this->value = $value;
$this->children = $children;
}
public function getChildren(): array
{
return $this->children;
}
public function __toString(): string
{
if (!empty($this->value)) {
return sprintf("%s: '%s'", $this->name, $this->value);
}
return $this->name;
}
}
I'll use the Doctrine lexer library. It helps to parse complex text.
src/Lexer.php
<?php
namespace App;
use Doctrine\Common\Lexer\AbstractLexer;
class Lexer extends AbstractLexer implements LexerInterface
{
...
protected function getCatchablePatterns(): array
{
return [
'\:',
'\{',
'\}',
'\[',
'\]',
'\,',
"\"[^\"]+\"",
'true',
'false',
'null',
];
}
protected function getNonCatchablePatterns(): array
{
return [
' ',
'\n'
];
}
protected function getType(&$value): int
{
if (in_array($value, [':', '{', '}', '[', ']', ','], true)) {
return ord($value);
}
if (is_numeric($value)) {
return LexerInterface::T_NUMBER;
}
switch (strtolower($value)) {
case 'true':
case 'false':
return LexerInterface::T_BOOL;
case 'null':
return LexerInterface::T_NULL;
}
return LexerInterface::T_STRING;
}
...
}
For example, Lexer
will translate the JSON
{
"array": [
"string",
99,
true,
false,
null
]
}
into this:
word | token |
---|---|
{ | ASCII (123) |
"array" | LexerInterface::T_STRING (258) |
: | ASCII (58) |
[ | ASCII (91) |
"string" | LexerInterface::T_STRING (258) |
, | ASCII (44) |
99 | LexerInterface::T_NUMBER (259) |
, | ASCII (44) |
true | LexerInterface::T_BOOL (260) |
, | ASCII (44) |
false | LexerInterface::T_BOOL (260) |
, | ASCII (44) |
null | LexerInterface::T_NULL (261) |
, | ASCII (44) |
] | ASCII (93) |
} | ASCII (125) |
LexerInterface::YYEOF (0) |
Time to create grammar.y
file and build lib/parser.php
PHP already has the native function json_decode()
and it uses Bison to generate a C parser.
I think we can get ready Bison grammar file from the php-src repository and modify it.
The grammar file is very small because JSON
standard is very simple.
We will use block %code parser
to define variables and methods to store AST
into the Parser
class.
grammar.y
%define api.parser.class {Parser}
%define api.namespace {App}
%code parser {
private Node $ast;
public function setAst(Node $ast): void { $this->ast = $ast; }
public function getAst(): Node { return $this->ast; }
}
%token T_STRING
%token T_NUMBER
%token T_BOOL
%token T_NULL
%%
start:
value { self::setAst($1); }
;
object:
'{' members '}' { $$ = $2; }
;
members:
%empty { $$ = []; }
| member { $$ = [$1]; }
| members ',' member { $$ = $1; $$[] = $3; }
;
member:
T_STRING ':' value { $$ = new Node('T_STRING', $1, [$3]); }
;
array:
'[' elements ']' { $$ = $2; }
;
elements:
%empty { $$ = []; }
| value { $$ = [$1]; }
| elements ',' value { $$ = $1; $$[] = $3; }
;
value:
object { $$ = new Node('T_OBJECT', '', $1); }
| array { $$ = new Node('T_ARRAY', '', $1); }
| T_STRING { $$ = new Node('T_STRING', $1); }
| T_NUMBER { $$ = new Node('T_NUMBER', $1); }
| T_BOOL { $$ = new Node('T_BOOL', $1); }
| T_NULL { $$ = new Node('T_NULL', $1); }
;
%%
bison -S vendor/mrsuh/php-bison-skeleton/src/php-skel.m4 -o lib/parser.php grammar.y
Command options:
-
-S vendor/mrsuh/php-bison-skeleton/src/php-skel.m4
- path toskeleton
file -
-o parser.php
- output parser file -
grammar.y
- our grammar file
The final PHP file is the entry point for the parser.
bin/parse.php
<?php
require_once __DIR__ . '/../vendor/autoload.php';
use App\Parser;
use App\Lexer;
use Mrsuh\Tree\Printer;
$lexer = new Lexer(fopen($argv[1], 'r'));
$parser = new Parser($lexer);
if (!$parser->parse()) {
exit(1);
}
$printer = new Printer();
$printer->print($parser->getAst());
Autoload for generated lib/parser.php
file.
composer.json
{
"autoload": {
"psr-4": {
"App\\": "src/"
},
"files": ["lib/parser.php"]
},
...
}
Finally, we can test our parser.
php bin/parse.php test.json
.
├── T_OBJECT
├── T_STRING: 'fieldString'
│ └── T_STRING: 'string'
├── T_STRING: 'fieldNumber'
│ └── T_NUMBER: '99'
├── T_STRING: 'fieldBoolTrue'
│ └── T_BOOL: 'true'
├── T_STRING: 'fieldBoolFalse'
│ └── T_BOOL: 'false'
├── T_STRING: 'fieldNull'
│ └── T_NULL: 'null'
├── T_STRING: 'fieldEmptyArray'
│ └── T_ARRAY
├── T_STRING: 'fieldEmptyObject'
│ └── T_OBJECT
└── T_STRING: 'fieldArray'
└── T_ARRAY
├── T_STRING: 'string'
├── T_NUMBER: '99'
├── T_BOOL: 'true'
├── T_BOOL: 'false'
├── T_NULL: 'null'
├── T_OBJECT
└── T_ARRAY
It works!
I think it will be cool if we compare the native json_decode()
function and our parser.
First, I need a JSON file for benchmarks. I can get JSON info about Bulbasaur pokemon from API https://pokeapi.co.
curl 'https://pokeapi.co/api/v2/pokemon/bulbasaur' > bench.json
The file weight is 215KB.
We need to modify our grammar.y
file to avoid Node
creating.
grammar-bench.y
...
value:
object { $$ = $1; }
| array { $$ = $1; }
| T_STRING { $$ = $1; }
| T_NUMBER { $$ = $1; }
| T_BOOL { $$ = $1; }
| T_NULL { $$ = $1; }
...
bison -S ../../src/php-skel.m4 -o lib/parser.php grammar-bench.y
We are ready to start the comparison.
PHP 8.2
php vendor/bin/phpbench run tests --report=my-report
+-------------+----------+----------+--------+
| subject | mem_peak | mode | rstdev |
+-------------+----------+----------+--------+
| benchNative | 2.539mb | 1.570ms | ±0.89% |
| benchBison | 12.443mb | 84.283ms | ±1.08% |
+-------------+----------+----------+--------+
PHP 8.1
php vendor/bin/phpbench run tests --report=my-report
+-------------+----------+----------+--------+
| subject | mem_peak | mode | rstdev |
+-------------+----------+----------+--------+
| benchNative | 2.593mb | 1.595ms | ±0.68% |
| benchBison | 18.471mb | 87.471ms | ±0.68% |
+-------------+----------+----------+--------+
PHP 8.0
php vendor/bin/phpbench run tests --report=my-report
+-------------+----------+----------+--------+
| subject | mem_peak | mode | rstdev |
+-------------+----------+----------+--------+
| benchNative | 2.700mb | 1.586ms | ±0.90% |
| benchBison | 18.578mb | 87.533ms | ±0.83% |
+-------------+----------+----------+--------+
PHP 7.4
php vendor/bin/phpbench run tests --report=my-report
+-------------+----------+-----------+--------+
| subject | mem_peak | mode | rstdev |
+-------------+----------+-----------+--------+
| benchNative | 2.857mb | 1.725ms | ±1.00% |
| benchBison | 18.735mb | 105.099ms | ±0.91% |
+-------------+----------+-----------+--------+
PHP Bison parser shows the best result with PHP 8.2.
It is ~56 times slower than the native json_decode()
function.
I hope it was interesting for you!
You can get the parser source code here and test it by yourself.
Some useful links: