How I Wrote PHP Skeleton For Bison

devm.io/php/php-skeleton-bison-generics

Do you dream of generics in PHP?

I wanted it so much - I made a library that brings generics in PHP.

<?php

namespace App;

class Box<T> {

    private ?T $data = null;

    public function set(T $data): void {
        $this->data = $data;
    }

    public function get(): ?T {
        return $this->data;
    }
}

If you are interested you can test it. Only native PHP is required (without extensions).

But in this article, I want to tell you about a very important part of my library - AST parser.

I use a very popular library nikic/php-parser. Many other software uses it.

It helps you to build AST from source code like this:

<?php

namespace App;

class Test
{
    public function test($foo) {}
}
.
├── ZEND_AST_STMT_LIST
    ├── ZEND_AST_NAMESPACE
    │   └── ZEND_AST_ZVAL 'App'
    └── ZEND_AST_CLASS 'Test'
        └── ZEND_AST_STMT_LIST
            └── ZEND_AST_METHOD 'test'
                └── ZEND_AST_PARAM_LIST
                    └── ZEND_AST_PARAM
                        └── ZEND_AST_ZVAL 'foo'

Every AST parser has a lexical analyzer, syntax analyzer, and AST builder. Usually, it grouped into Lexer and Parser.

You don't need to write Lexer and Parser from scratch.

To build Lexer you can use tools:

How do Lexers work?

Lexers help you to parse text into tokens.

For example PHP engine's Lexer use re2c.

php-src Lexer example

/*!re2c
re2c:yyfill:check = 0;
LNUM    [0-9]+(_[0-9]+)*
DNUM    ({LNUM}?"."{LNUM})|({LNUM}"."{LNUM}?)

<ST_IN_SCRIPTING>"exit" {
    RETURN_TOKEN_WITH_IDENT(T_EXIT);
}

<ST_IN_SCRIPTING>"return" {
    RETURN_TOKEN_WITH_IDENT(T_RETURN);
}
*/

Below you can see PHP code and tokens from Lexer.

<?php    |   T_OPEN_TAG
         |   T_WHITESPACE
$a = 1;  |   T_VARIABLE T_WHITESPACE = T_WHITESPACE T_LNUMBER ;
         |   T_WHITESPACE
echo $a; |   T_ECHO T_WHITESPACE T_VARIABLE ;

We can think about PHP engine and php-parser Lexers as similar Lexers because function get_token_all() calls re2c functions under the hood.

After the Lexer we have tokens, and we need a Parser to build AST.

To build Parser you can use the tools:

How do parser generators work?

A generator takes your grammar.y BNF file, parses it, extracts all definitions, and then constructs a bunch of tables like this:

$yytable = [
    6, 3, 7, 20, 8, 51, 28, 1, 52, 4,
    9, 13, 10, 29, 15, 30, 18, 31, 16, 19,
    32, 22, 33, 34, 23, 24, 35, 11, 37, 25,
    21, 38, 39, 26, 45, 0, 40, 42, 0, 43,
    41, 0, 0, 49, 0, 0, 0, 0, 0, 47,
    48, 0, 50, 0, 53, 54
];

Then, this data is passed to a template that is called a Skeleton.

For Bison, Skeleton is a special file written in M4 language that renders your parser file.

By default, Bison Skeletons supports C/C++/D/Java languages.

PHP engine and php-parser use different parser generators but use very similar grammar files.

php-src grammar example

statement:
    |   T_BREAK optional_expr ';'    { $$ = zend_ast_create(ZEND_AST_BREAK, $2); }
    |   T_CONTINUE optional_expr ';' { $$ = zend_ast_create(ZEND_AST_CONTINUE, $2); }
    |   T_RETURN optional_expr ';'   { $$ = zend_ast_create(ZEND_AST_RETURN, $2); }

php-parser grammar example

non_empty_statement:
    |   T_BREAK optional_expr semi    { $$ = Stmt\Break_[$2]; }
    |   T_CONTINUE optional_expr semi { $$ = Stmt\Continue_[$2]; }
    |   T_RETURN optional_expr semi   { $$ = Stmt\Return_[$2]; }

After all this information about parsers, we can summarize it on the scheme:

I had thought about replacing KmYacc with Bison in php-parser.

It is great for PHP engine and php-parser to use the same tools to make the same job.

Even the fact, that Bison doesn't have PHP Skeleton didn't stop me.

I decided to create my own skeleton.

I translated Java skeleton to PHP. It took a few months for me.

Translating Java code to PHP is not very hard, but if your code is not written with m4 and has not very many options.

Java-skeleton example

]b4_yystype[ lval = yylexer.getLVal();]b4_locations_if([[
]b4_location_type[ yyloc = new ]b4_location_type[(yylexer.getStartPos(), yylexer.getEndPos());
status = push_parse(token, lval, yyloc);]], [[
status = push_parse(token, lval);]])[

PHP-skeleton example

/** @@var ]b4_yystype[ */
$lval = $this->yylexer->getLVal();]b4_locations_if([[
/** @@var ]b4_location_type[ */
$yyloc = new ]b4_location_type[($this->yylexer->getStartPos(), $this->yylexer->getEndPos());
$status = $this->push_parse($token, $lval, $yyloc);]], [[
$status = $this->push_parse($token, $lval);]])[

After a few months and many auto tests php-skeleton was ready!

[php-bison-skeleton] composer test
> php vendor/bin/phpunit
PHPUnit 9.6.5 by Sebastian Bergmann and contributors.

................................................................. 65 / 72 ( 90%)
.......                                                           72 / 72 (100%)

Time: 00:04.037, Memory: 6.00 MB

OK (72 tests, 384 assertions)

Then I tried to replace KmYacc with Bison.

You can reproduce the replacement with the steps:

Great! The parser is ready.

Time to compare PHP parser generated with Bison and KmYacc.

I had run tests with 3 different files sizes and different PHP versions (smaller is better):

PHP file 684 bytes

PHP file 8.8 kilobytes

PHP file 329 kilobytes

As you can see performance of the parser generated with Bison is slower than the parser generated with KmYacc.

I tried to optimize generated parser code, but it gave maximum ~15 percent improvement. Not such much.

In the end, I replaced KmYacc with Bison in php-parser, but it works not such well as I imagined.

Now I have a well-working php-skeleton for Bison.

Maybe next time I'll try to replace KmYacc with ANTLR.

You can found php-bison-skeleton, many examples and tests into this repository

Thank you for your time. Hope you find this article useful.