Using literal character tokens when designing lexers and parsers.

Filed under: Articles, Compilers & Interpreters — Leave a comment

May 11, 2014

Sometimes while I exploring the source code of various free software Flex lexers and Bison parsers I see name declarations for single character tokens.

I present some of these that will be used later on for demonstration reasons:

"+", "-", "*", "/", "=", "|", "(", ")"

Software architects usually use named tokens for these characters. I believe that there is no need to declare literal character tokens unless we need to declare the type of their values. Rather than giving every token a name, it’s possible to use a single quoted character as a token, with the ASCII value of the token being the token number (Bison starts the numbers for named tokens at 258, so there’s no problem of collisions). By convention, literal character tokens are used to represent input tokens consisting of the same character; for example, the token ‘+’ represents the input token +, so in practice they are used only for punctuation and operators.

There is a common idiom or a design pattern in which we can handle all single-character operators with the same rule that returns “yytext[0]”, the character itself, as the token. Here is a code snippet of a simple Flex lexer that uses this common idiom:

%%
... more lexer rules ...

"+" |
"-" |
"*" |
"/" |
"=" |
"|" |
"(" |
")" { return yytext[0]; }

... more lexer rules ...
%%

Also, a Bison parser can use in its BNF grammar rules the literal character tokens as single characters. Here follows a small code snippet as an example for a grammar rule that describes an expression in a programming language:

%%
... more grammar rules ...

exp
  : exp '+' exp           { $$ = new_ast_node ('+', $1, $3); }
  | exp '-' exp           { $$ = new_ast_node ('-', $1, $3);}
  | exp '*' exp           { $$ = new_ast_node ('*', $1, $3); }
  | exp '/' exp           { $$ = new_ast_node ('/', $1, $3); }
  | '|' exp               { $$ = new_ast_node ('|', $2, NULL); }
  | '(' exp ')'           { $$ = $2; }
  | '-' exp %prec UMINUS  { $$ = new_ast_node ('M', $2, NULL); }
  | NUMBER                { $$ = new_ast_number_node ($1); }
  | NAME                  { $$ = new_ast_symbol_reference_node ($1); }
  | NAME '=' exp          { $$ = new_ast_assignment_node ($1, $3); }
  | NAME '(' ')'          { $$ = new_ast_function_node ($1, NULL); }
  | NAME '(' exp_list ')' { $$ = new_ast_function_node ($1, $3); }
;

... more grammar rules ...
%%

Tags: compiler, flex, free software, grammar, lexer, literal, open source, token

Comments RSS feed

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Efstathios Chatzikyriakidis