In this article I’ll show you how you can create the abstract syntax tree (AST) of an input stream while parsing it. A parser usually reads various tokens from its input by using a lexer as a helper coroutine and tries to match various grammar rules that specify the syntax of a language (the source language).

Usually, parsers while parsing an input stream of tokens create an AST that can be used later in the compilation process. In computer science, an AST, or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. An AST is usually the result of the syntax analysis phase of a compiler. It often serves as an intermediate representation of the program through several stages that the compiler requires, and has a strong impact on the final output of the compiler.

I’ll demonstrate this concept by building a hypothetical advanced calculator which seems that is more a programming language. At first, let’s present some Bison code structures grabbed from the parser of the calculator.

Here follows the union type for the “yylval” that stores the values of the various symbols:

%union {
  ... more symbol data types ...

  struct ast_node * ast;

  double number;

  /* which operator to use (for relational and equality operators). */
  int operator;

  struct symbol_node * symbol;

  ... more symbol data types ...
}

Let’s explain the various variables of the union type.

struct ast_node * ast :

This is a pointer to an AST node and is used for the non-terminal symbols of the grammar.

double number :

Since we are demonstrating an advanced calculator we deal with floating-point numbers.

int operator :

This is just an integer denoting which relational or equality operator is used.

struct symbol_node * symbol :

This is a pointer to a symbol node (identifier) indicating either a variable or a function.

Let’s continue by presenting some of the token declarations:

... more token declarations ...

%token <number> NUMBER
%token <symbol> NAME

... more token declarations ...

The NUMBER token has “number” as its type which is just a floating-point number.

The NAME token has “symbol” as its type which is just a pointer to a symbol node (identifier).

Next, follows the precedence declarations of the various tokens:

... more precedence declarations ...

%nonassoc NO_ELSE
%nonassoc ELSE
%left ','
%right '='
%left <operator> EQUALITY
%left <operator> RELATIONAL
%left '+' '-'
%left '*' '/'
%right UMINUS
%left '(' ')'

... more precedence declarations ...

These declarations specify the precedence level and the associativity property of the various tokens.

The EQUALITY and RELATIONAL tokens have type “operator” which is just an integer indicating the operator.

The UMINUS is just a pseudo token which is used for the unary ‘-‘ operator to support negative expressions.

The ELSE and NO_ELSE tokens are just a hack for solving the dangling else problem in “if/else” statements.

Now, let’s present the non-terminal symbol declarations:

... more non-terminal symbol declarations ...

%type <ast> exp exp_list statement selection_statement iteration_statement

... more non-terminal symbol declarations ...

The non-terminal symbols have “ast” as their type which is just a pointer to an AST node.

These AST nodes are used to construct the syntax tree of the input.

Here are some BNF rules from the grammar of the advanced calculator:

%%
... more grammar rules ...

exp
  : exp RELATIONAL exp    { $$ = new_ast_relational_node ($2, $1, $3); }
  | exp EQUALITY exp      { $$ = new_ast_equality_node ($2, $1, $3); }
  | exp '+' exp           { $$ = new_ast_node ('+', $1, $3); }
  | exp '-' exp           { $$ = new_ast_node ('-', $1, $3);}
  | exp '*' exp           { $$ = new_ast_node ('*', $1, $3); }
  | exp '/' exp           { $$ = new_ast_node ('/', $1, $3); }
  | '(' exp ')'           { $$ = $2; }
  | '-' exp %prec UMINUS  { $$ = new_ast_node ('M', $2, NULL); }
  | NUMBER                { $$ = new_ast_number_node ($1); }
  | NAME                  { $$ = new_ast_symbol_reference_node ($1); }
  | NAME '=' exp          { $$ = new_ast_assignment_node ($1, $3); }
  | NAME '(' ')'          { $$ = new_ast_function_node ($1, NULL); }
  | NAME '(' exp_list ')' { $$ = new_ast_function_node ($1, $3); }
;

exp_list
  : exp
  | exp ',' exp_list { $$ = new_ast_node ('L', $1, $3); }
;

statement
  : selection_statement
  | iteration_statement
  | ... other statements ...
;

selection_statement
  : IF '(' expression ')' statement %prec NO_ELSE
    {
      $$ = new_ast_if_node ($3, $5, NULL);
    }
  | IF '(' expression ')' statement ELSE statement
    {
      $$ = new_ast_if_node ($3, $5, $7);
    }
;

iteration_statement
  : WHILE '(' expression ')' statement { $$ = new_ast_while_node ($3, $5); }
;

... more grammar rules ...
%%

Please, take account that in most actions of the BNF rules we call functions with prefix “new_ast_”. These functions create AST nodes for the various code structures. While the parser parses its input, AST nodes are created for the code structures of the input. So, at the end of the parsing process when the last non-terminal symbol is reduced and the input is syntactically valid we have a complete syntax tree which is one of the most important results of the syntax analysis.

Let’s continue the article by specifying the various data types that we have seen so far.

Here are the enumerations for the relational and equality operators:

enum relational_operator
{
  LESS,
  LESS_OR_EQUAL,
  GREATER,
  GREATER_OR_EQUAL
};

enum equality_operator
{
  EQUAL,
  NOT_EQUAL
};

Next, follows the symbol node:

struct symbol_node
{
  char * name;

  double value;

  struct ast_node * function;

  ... more function related information (list of dummy arguments) ...
};

The symbol nodes are usually created in the lexical analysis and are maintained in a symbol table which is globally visible. In the lexer we can identify identifiers (NAME tokens) with the following Flex rule (the regular expression is for matching identifiers):

[a-zA-Z][a-zA-Z0-9]* { yylval.symbol = lookup (yytext); return NAME; }

The “lookup” function has a great functionality. First, it tries to get a symbol node from the symbol table by using the name of the identifier. If the symbol node exists, the function returns it. Otherwise, it creates a new symbol node in the symbol table for the specified name of identifier and returns it. Since the input search key is the name of the identifier (which is a string value) a hash table data structure providing O(1) time complexity could be a nice implementation of the symbol table.

A symbol node denotes an identifier with a specific name. Also, a symbol node can be either a variable which can have a value (floating-point number) or a function which can have a body of code (AST node). Any further function related information is extracted and injected in the appropriate symbol node from the parser after the lexical analysis. Please, take account that a function needs also a nullable list of dummy parameters. This list can be used later for validation when parsing function calls. However, this process is beyond the purpose of the article.

Next, follows the various AST nodes:

struct ast_node // for binary/unary operators and expression lists
{
  int node_type;

  struct ast_node * left;

  struct ast_node * right;
};

struct ast_relational_node // for relational operators
{
  int node_type;

  enum relational_operator operator;

  struct ast_node * left;

  struct ast_node * right;
};

struct ast_equality_node // for equality operators
{
  int node_type;

  enum equality_operator operator;

  struct ast_node * left;

  struct ast_node * right;
};

struct ast_function_node // for function calls
{
  int node_type;

  struct ast_node * arguments;

  struct symbol_node * symbol;
};

struct ast_symbol_reference_node // for symbol references
{
  int node_type;

  struct symbol_node * symbol;
};

struct ast_if_node // for "if/else" statements
{
  int node_type;

  struct ast_node * condition;

  struct ast_node * if_branch;

  struct ast_node * else_branch;
};

struct ast_while_node // for "while" statements
{
  int node_type;

  struct ast_node * condition;

  struct ast_node * while_branch;
};

struct ast_assignment_node // for assignment expressions
{
  int node_type;

  struct symbol_node * symbol;

  struct ast_node * value;
};

struct ast_number_node // for constant floating-point numbers
{
  int node_type;

  double value;
};

You should notice that all AST nodes have a “node_type” variable indicating the type of the node.

This is important since this is the only information needed in order to manage an AST node appropriately.

We continue with the various AST functions with prefix “new_ast_” that create AST nodes:

struct ast_node *
new_ast_node (int node_type,
              struct ast_node * left,
              struct ast_node * right)
{
  struct ast_node * ast_node =
    emalloc (sizeof (struct ast_node));

  ast_node->node_type = node_type;

  ast_node->left = left;
  ast_node->right = right;

  return ast_node;
}

struct ast_node *
new_ast_relational_node (enum relational_operator operator,
                         struct ast_node * left,
                         struct ast_node * right)
{
  struct ast_relational_node * ast_node =
    emalloc (sizeof (struct ast_relational_node));

  ast_node->node_type = 'R';

  ast_node->operator = operator;
  ast_node->left = left;
  ast_node->right = right;

  return (struct ast_node *) ast_node;
}

struct ast_node *
new_ast_equality_node (enum equality_operator operator,
                       struct ast_node * left,
                       struct ast_node * right)
{
  struct ast_equality_node * ast_node =
    emalloc (sizeof (struct ast_equality_node));

  ast_node->node_type = 'E';

  ast_node->operator = operator;
  ast_node->left = left;
  ast_node->right = right;

  return (struct ast_node *) ast_node;
}

struct ast_node *
new_ast_function_node (struct symbol_node * symbol,
                       struct ast_node * arguments)
{
  struct ast_function_node * ast_node =
    emalloc (sizeof (struct ast_function_node));

  ast_node->node_type = 'F';

  ast_node->arguments = arguments;
  ast_node->symbol = symbol;

  return (struct ast_node *) ast_node;
}

struct ast_node *
new_ast_symbol_reference_node (struct symbol_node * symbol)
{
  struct ast_symbol_reference_node * ast_node =
    emalloc (sizeof (struct ast_symbol_reference_node));

  ast_node->node_type = 'S';

  ast_node->symbol = symbol;

  return (struct ast_node *) ast_node;
}

struct ast_node *
new_ast_if_node (struct ast_node * condition,
                 struct ast_node * if_branch,
                 struct ast_node * else_branch)
{
  struct ast_if_node * ast_node =
    emalloc (sizeof (struct ast_if_node));

  ast_node->node_type = 'I';

  ast_node->condition = condition;
  ast_node->if_branch = if_branch;
  ast_node->else_branch = else_branch;
  
  return (struct ast_node *) ast_node;
}

struct ast_node *
new_ast_while_node (struct ast_node * condition,
                    struct ast_node * while_branch)
{
  struct ast_while_node * ast_node =
    emalloc (sizeof (struct ast_while_node));

  ast_node->node_type = 'W';

  ast_node->condition = condition;
  ast_node->while_branch = while_branch;
  
  return (struct ast_node *) ast_node;
}

struct ast_node *
new_ast_assignment_node (struct symbol_node * symbol,
                         struct ast_node * value)
{
  struct ast_assignment_node * ast_node =
    emalloc (sizeof (struct ast_assignment_node));

  ast_node->node_type = 'A';

  ast_node->symbol = symbol;
  ast_node->value = value;

  return (struct ast_node *) ast_node;
}

struct ast_node *
new_ast_number_node (double value)
{
  struct ast_number_node * ast_node =
    emalloc (sizeof (struct ast_number_node));

  ast_node->node_type = 'N';

  ast_node->value = value;

  return (struct ast_node *) ast_node;
}

Next, I would like to present to you a recursive function that is used in the advanced calculator in order to deallocate safely the memory taken from the operating system for the needs of the syntax tree. The call of this function assures us that the program will not have memory leaks and memory taken will be returned back to the operating system when we really don’t need it any more.

Here follows the function:

void
free_ast_tree (struct ast_node * ast_tree)
{
  if (!ast_tree) return;

  switch (ast_tree->node_type)
  {
    /* two sub trees */
    case '+':
    case '-':
    case '*':
    case '/':
    case 'L':
      free_ast_tree (ast_tree->right);

    /* one sub tree */
    case 'M':
      free_ast_tree (ast_tree->left);

    /* no sub trees */
    case 'S':
    case 'N':
      break;

    case 'R':
      {
        struct ast_relational_node * node =
          (struct ast_relational_node *) ast_tree;

        free_ast_tree (node->left);

        free_ast_tree (node->right);
      }
      break;

    case 'E':
      {
        struct ast_equality_node * node =
          (struct ast_equality_node *) ast_tree;

        free_ast_tree (node->left);

        free_ast_tree (node->right);
      }
      break;

    case 'A':
      {
        struct ast_assignment_node * node =
          (struct ast_assignment_node *) ast_tree;

        free_ast_tree (node->value);
      }
      break;

    case 'I':
      {
        struct ast_if_node * node =
          (struct ast_if_node *) ast_tree;

        free_ast_tree (node->condition);

        if (node->if_branch)
        {
          free_ast_tree (node->if_branch);
        }

        if (node->else_branch)
        {
          free_ast_tree (node->else_branch);
        }
      }
      break;

    case 'W':
      {
        struct ast_while_node * node =
          (struct ast_while_node *) ast_tree;

        free_ast_tree (node->condition);

        if (node->while_branch)
        {
          free_ast_tree (node->while_branch);
        }
      }
      break;

    case 'F':
      {
        struct ast_function_node * node =
          (struct ast_function_node *) ast_tree;

        if (node->arguments)
        {
          free_ast_tree (node->arguments);
        }
      }
      break;

    default:
      eprintf ("Error: Bad node type '%c' to free!\n", ast_tree->node_type);
  }

  free (ast_tree);
}

Please, take account that the symbol table contains various symbol nodes that have been created by using also dynamic allocated memory. At some point of the execution this memory should also be freed. This operation is one of the responsibilities of the symbol table and we will not present it here since it is beyond the purpose of the article. However, you might notice that a symbol node indicating a function has a body code which is an AST node. This AST node can be freed safely with the presented recursive function.

We close the article by presenting two helper wrapper functions used in the parser:

void *
emalloc (size_t size)
{
  void * pointer = malloc (size);

  if (!pointer)
    eprintf ("Error: malloc(%u) failed!\n", size);

  return pointer;
}

void
eprintf (char * format, ...)
{
  va_list args;

  va_start (args, format);

  vfprintf (stderr, format, args);

  va_end (args);

  exit (EXIT_FAILURE);
}

Happy Hacking!