blob: 07ce3ac6c7e05b7259c83490d31983f870178d3d [file] [log] [blame]
\input texinfo.tex @c -*-texinfo-*-
@c %**start of header
@include version.texi
@settitle Lexical Analysis With Flex, for Flex @value{VERSION}
@set authors Vern Paxson, Will Estes and John Millaway
@c "Macro Hooks" index
@defindex hk
@c "Options" index
@defindex op
@dircategory Programming
* flex: (flex). Fast lexical analyzer generator (lex replacement).
@end direntry
@c %**end of header
The flex manual is placed under the same licensing conditions as the
rest of flex:
Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012
The Flex Project.
Copyright @copyright{} 1990, 1997 The Regents of the University of California.
All rights reserved.
This code is derived from software contributed to Berkeley by
Vern Paxson.
The United States Government has rights in this work pursuant
to contract no. DE-AC03-76SF00098 between the United States
Department of Energy and the University of California.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
@end enumerate
Neither the name of the University nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.
@end copying
@title Lexical Analysis with Flex
@subtitle Edition @value{EDITION}, @value{UPDATED}
@author @value{authors}
@vskip 0pt plus 1filll
@end titlepage
@node Top, Copyright, (dir), (dir)
@top flex
This manual describes @code{flex}, a tool for generating programs that
perform pattern-matching on text. The manual includes both tutorial and
reference sections.
This edition of @cite{The flex Manual} documents @code{flex} version
@value{VERSION}. It was last updated on @value{UPDATED}.
This manual was written by @value{authors}.
* Copyright::
* Reporting Bugs::
* Introduction::
* Simple Examples::
* Format::
* Patterns::
* Matching::
* Actions::
* Generated Scanner::
* Start Conditions::
* Multiple Input Buffers::
* EOF::
* Misc Macros::
* User Values::
* Yacc::
* Scanner Options::
* Performance::
* Cxx::
* Reentrant::
* Lex and Posix::
* Memory Management::
* Serialized Tables::
* Diagnostics::
* Limitations::
* Bibliography::
* FAQ::
* Appendices::
* Indices::
--- The Detailed Node Listing ---
Format of the Input File
* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::
Scanner Options
* Options for Specifying Filenames::
* Options Affecting Scanner Behavior::
* Code-Level And API Options::
* Options for Scanner Speed and Size::
* Debugging Options::
* Miscellaneous Options::
Reentrant C Scanners
* Reentrant Uses::
* Reentrant Overview::
* Reentrant Example::
* Reentrant Detail::
* Reentrant Functions::
The Reentrant API in Detail
* Specify Reentrant::
* Extra Reentrant Argument::
* Global Replacement::
* Init and Destroy Functions::
* Accessor Methods::
* Extra Data::
* About yyscan_t::
Memory Management
* The Default Memory Management::
* Overriding The Default Memory Management::
* A Note About yytext And Memory::
Serialized Tables
* Creating Serialized Tables::
* Loading and Unloading Serialized Tables::
* Tables File Format::
* When was flex born?::
* How do I expand backslash-escape sequences in C-style quoted strings?::
* Why do flex scanners call fileno if it is not ANSI compatible?::
* Does flex support recursive pattern definitions?::
* How do I skip huge chunks of input (tens of megabytes) while using flex?::
* Flex is not matching my patterns in the same order that I defined them.::
* My actions are executing out of order or sometimes not at all.::
* How can I have multiple input sources feed into the same scanner at the same time?::
* Can I build nested parsers that work with the same input file?::
* How can I match text only at the end of a file?::
* How can I make REJECT cascade across start condition boundaries?::
* Why cant I use fast or full tables with interactive mode?::
* How much faster is -F or -f than -C?::
* If I have a simple grammar cant I just parse it with flex?::
* Why doesn't yyrestart() set the start state back to INITIAL?::
* How can I match C-style comments?::
* The period isn't working the way I expected.::
* Can I get the flex manual in another format?::
* Does there exist a "faster" NDFA->DFA algorithm?::
* How does flex compile the DFA so quickly?::
* How can I use more than 8192 rules?::
* How do I abandon a file in the middle of a scan and switch to a new file?::
* How do I execute code only during initialization (only before the first scan)?::
* How do I execute code at termination?::
* Where else can I find help?::
* Can I include comments in the "rules" section of the file?::
* I get an error about undefined yywrap().::
* How can I change the matching pattern at run time?::
* How can I expand macros in the input?::
* How can I build a two-pass scanner?::
* How do I match any string not matched in the preceding rules?::
* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
* Is there a way to make flex treat NULL like a regular character?::
* Whenever flex can not match the input it says "flex scanner jammed".::
* Why doesn't flex have non-greedy operators like perl does?::
* Memory leak - 16386 bytes allocated by malloc.::
* How do I track the byte offset for lseek()?::
* How do I use my own I/O classes in a C++ scanner?::
* How do I skip as many chars as possible?::
* deleteme00::
* Are certain equivalent patterns faster than others?::
* Is backing up a big deal?::
* Can I fake multi-byte character support?::
* deleteme01::
* Can you discuss some flex internals?::
* unput() messes up yy_at_bol::
* The | operator is not doing what I want::
* Why can't flex understand this variable trailing context pattern?::
* The ^ operator isn't working::
* Trailing context is getting confused with trailing optional patterns::
* Is flex GNU or not?::
* I need to scan if-then-else blocks and while loops::
* Is there a repository for flex scanners?::
* How can I conditionally compile or preprocess my flex input file?::
* Where can I find grammars for lex and yacc?::
* I get an end-of-buffer message for each character scanned.::
* unnamed-faq-62::
* unnamed-faq-63::
* unnamed-faq-64::
* unnamed-faq-65::
* unnamed-faq-66::
* unnamed-faq-67::
* unnamed-faq-68::
* unnamed-faq-69::
* unnamed-faq-70::
* unnamed-faq-71::
* unnamed-faq-72::
* unnamed-faq-73::
* unnamed-faq-74::
* unnamed-faq-75::
* unnamed-faq-76::
* unnamed-faq-77::
* unnamed-faq-78::
* unnamed-faq-79::
* unnamed-faq-80::
* unnamed-faq-81::
* unnamed-faq-82::
* unnamed-faq-83::
* unnamed-faq-84::
* unnamed-faq-85::
* unnamed-faq-86::
* unnamed-faq-87::
* unnamed-faq-88::
* unnamed-faq-90::
* unnamed-faq-91::
* unnamed-faq-92::
* unnamed-faq-93::
* unnamed-faq-94::
* unnamed-faq-95::
* unnamed-faq-96::
* unnamed-faq-97::
* unnamed-faq-98::
* unnamed-faq-99::
* unnamed-faq-100::
* unnamed-faq-101::
* What is the difference between YYLEX_PARAM and YY_DECL?::
* Why do I get "conflicting types for yylex" error?::
* How do I access the values set in a Flex action from within a Bison action?::
* Makefiles and Flex::
* Bison Bridge::
* M4 Dependency::
* Common Patterns::
* Concept Index::
* Index of Functions and Macros::
* Index of Variables::
* Index of Data Types::
* Index of Hooks::
* Index of Scanner Options::
@end detailmenu
@end menu
@end ifnottex
@node Copyright, Reporting Bugs, Top, Top
@chapter Copyright
@cindex copyright of flex
@cindex distributing flex
@node Reporting Bugs, Introduction, Copyright, Top
@chapter Reporting Bugs
@cindex bugs, reporting
@cindex reporting bugs
If you find a bug in @code{flex}, please report it using
the SourceForge Bug Tracking facilities which can be found on
@url{,flex's SourceForge Page}.
@node Introduction, Simple Examples, Reporting Bugs, Top
@chapter Introduction
@cindex scanner, definition of
@code{flex} is a tool for generating @dfn{scanners}. A scanner is a
program which recognizes lexical patterns in text. The @code{flex}
program reads the given input files, or its standard input if no file
names are given, for a description of a scanner to generate. The
description is in the form of pairs of regular expressions and C code,
called @dfn{rules}. @code{flex} generates as output a C source file,
@file{lex.yy.c} by default, which defines a routine @code{yylex()}.
This file can be compiled and linked with the flex runtime library to
produce an executable. When the executable is run, it analyzes its
input for occurrences of the regular expressions. Whenever it finds
one, it executes the corresponding C code.
@node Simple Examples, Format, Introduction, Top
@chapter Some Simple Examples
First some simple examples to get the flavor of how one uses
@cindex username expansion
The following @code{flex} input specifies a scanner which, when it
encounters the string @samp{username} will replace it with the user's
login name:
username printf( "%s", getlogin() );
@end verbatim
@end example
@cindex default rule
@cindex rules, default
By default, any text not matched by a @code{flex} scanner is copied to
the output, so the net effect of this scanner is to copy its input file
to its output with each occurrence of @samp{username} expanded. In this
input, there is just one rule. @samp{username} is the @dfn{pattern} and
the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the
beginning of the rules.
Here's another simple example:
@cindex counting characters and lines
int num_lines = 0, num_chars = 0;
\n ++num_lines; ++num_chars;
. ++num_chars;
int main()
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
@end verbatim
@end example
This scanner counts the number of characters and the number of lines in
its input. It produces no output other than the final report on the
character and line counts. The first line declares two globals,
@code{num_lines} and @code{num_chars}, which are accessible both inside
@code{yylex()} and in the @code{main()} routine declared after the
second @samp{%%}. There are two rules, one which matches a newline
(@samp{\n}) and increments both the line count and the character count,
and one which matches any character other than a newline (indicated by
the @samp{.} regular expression).
A somewhat more complicated example:
@cindex Pascal-like language
/* scanner for a toy Pascal-like language */
/* need this for the call to atof() below */
#include <math.h>
DIGIT [0-9]
ID [a-z][a-z0-9]*
{DIGIT}+ {
printf( "An integer: %s (%d)\n", yytext,
atoi( yytext ) );
{DIGIT}+"."{DIGIT}* {
printf( "A float: %s (%g)\n", yytext,
atof( yytext ) );
if|then|begin|end|procedure|function {
printf( "A keyword: %s\n", yytext );
{ID} printf( "An identifier: %s\n", yytext );
"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
"{"[\^{}}\n]*"}" /* eat up one-line comments */
[ \t\n]+ /* eat up whitespace */
. printf( "Unrecognized character: %s\n", yytext );
int main( int argc, char **argv )
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
yyin = stdin;
@end verbatim
@end example
This is the beginnings of a simple scanner for a language like Pascal.
It identifies different types of @dfn{tokens} and reports on what it has
The details of this example will be explained in the following
@node Format, Patterns, Simple Examples, Top
@chapter Format of the Input File
@cindex format of flex input
@cindex input, format of
@cindex file format
@cindex sections of flex input
The @code{flex} input file consists of three sections, separated by a
line containing only @samp{%%}.
@cindex format of input file
user code
@end verbatim
@end example
* Definitions Section::
* Rules Section::
* User Code Section::
* Comments in the Input::
@end menu
@node Definitions Section, Rules Section, Format, Format
@section Format of the Definitions Section
@cindex input file, Definitions section
@cindex Definitions, in flex input
The @dfn{definitions section} contains declarations of simple @dfn{name}
definitions to simplify the scanner specification, and declarations of
@dfn{start conditions}, which are explained in a later section.
@cindex aliases, how to define
@cindex pattern aliases, how to define
Name definitions have the form:
name definition
@end verbatim
@end example
The @samp{name} is a word beginning with a letter or an underscore
(@samp{_}) followed by zero or more letters, digits, @samp{_}, or
@samp{-} (dash). The definition is taken to begin at the first
non-whitespace character following the name and continuing to the end of
the line. The definition can subsequently be referred to using
@samp{@{name@}}, which will expand to @samp{(definition)}. For example,
@cindex pattern aliases, defining
@cindex defining pattern aliases
DIGIT [0-9]
ID [a-z][a-z0-9]*
@end verbatim
@end example
Defines @samp{DIGIT} to be a regular expression which matches a single
digit, and @samp{ID} to be a regular expression which matches a letter
followed by zero-or-more letters-or-digits. A subsequent reference to
@cindex pattern aliases, use of
@end verbatim
@end example
is identical to
@end verbatim
@end example
and matches one-or-more digits followed by a @samp{.} followed by
zero-or-more digits.
@cindex comments in flex input
An unindented comment (i.e., a line
beginning with @samp{/*}) is copied verbatim to the output up
to the next @samp{*/}.
@cindex %@{ and %@}, in Definitions Section
@cindex embedding C code in flex input
@cindex C code in flex input
Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
is also copied verbatim to the output (with the %@{ and %@} symbols
removed). The %@{ and %@} symbols must appear unindented on lines by
@cindex %top
A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except
that the code in a @code{%top} block is relocated to the @emph{top} of the
generated file, before any flex definitions @footnote{Actually,
@code{yyIN_HEADER} is defined before the @samp{%top} block.}.
The @code{%top} block is useful when you want certain preprocessor macros to be
defined or certain files to be included before the generated code.
The single characters, @samp{@{} and @samp{@}} are used to delimit the
@code{%top} block, as show in the example below:
/* This code goes at the "top" of the generated file. */
#include <stdint.h>
#include <inttypes.h>
@end verbatim
@end example
Multiple @code{%top} blocks are allowed, and their order is preserved.
@node Rules Section, User Code Section, Definitions Section, Format
@section Format of the Rules Section
@cindex input file, Rules Section
@cindex rules, in flex input
The @dfn{rules} section of the @code{flex} input contains a series of
rules of the form:
pattern action
@end verbatim
@end example
where the pattern must be unindented and the action must begin
on the same line.
@xref{Patterns}, for a further description of patterns and actions.
In the rules section, any indented or %@{ %@} enclosed text appearing
before the first rule may be used to declare variables which are local
to the scanning routine and (after the declarations) code which is to be
executed whenever the scanning routine is entered. Other indented or
%@{ %@} text in the rule section is still copied to the output, but its
meaning is not well-defined and it may well cause compile-time errors
(this feature is present for @acronym{POSIX} compliance. @xref{Lex and
Posix}, for other such features).
Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}}
is copied verbatim to the output (with the %@{ and %@} symbols removed).
The %@{ and %@} symbols must appear unindented on lines by themselves.
@node User Code Section, Comments in the Input, Rules Section, Format
@section Format of the User Code Section
@cindex input file, user code Section
@cindex user code, in flex input
The user code section is simply copied to @file{lex.yy.c} verbatim. It
is used for companion routines which call or are called by the scanner.
The presence of this section is optional; if it is missing, the second
@samp{%%} in the input file may be skipped, too.
@node Comments in the Input, , User Code Section, Format
@section Comments in the Input
@cindex comments, syntax of
Flex supports C-style comments, that is, anything between @samp{/*} and
@samp{*/} is
considered a comment. Whenever flex encounters a comment, it copies the
entire comment verbatim to the generated source code. Comments may
appear just about anywhere, but with the following exceptions:
@cindex comments, in rules section
Comments may not appear in the Rules Section wherever flex is expecting
a regular expression. This means comments may not appear at the
beginning of a line, or immediately following a list of scanner states.
Comments may not appear on an @samp{%option} line in the Definitions
@end itemize
If you want to follow a simple rule, then always begin a comment on a
new line, with one or more whitespace characters before the initial
@samp{/*}). This rule will work anywhere in the input file.
All the comments in the following example are valid:
@cindex comments, valid uses of
@cindex comments in the input
/* code block */
/* Definitions Section */
/* Rules Section */
ruleA /* after regex */ { /* code block */ } /* after code block */
/* Rules Section (indented) */
ruleC ECHO;
ruleD ECHO;
/* code block */
/* User Code Section */
@end verbatim
@end example
@node Patterns, Matching, Format, Top
@chapter Patterns
@cindex patterns, in rules section
@cindex regular expressions, in patterns
The patterns in the input (see @ref{Rules Section}) are written using an
extended set of regular expressions. These are:
@cindex patterns, syntax
@cindex patterns, syntax
@table @samp
@item x
match the character 'x'
@item .
any character (byte) except newline
@cindex [] in patterns
@cindex character classes in patterns, syntax of
@cindex POSIX, character classes in patterns, syntax of
@item [xyz]
a @dfn{character class}; in this case, the pattern
matches either an 'x', a 'y', or a 'z'
@cindex ranges in patterns
@item [abj-oZ]
a "character class" with a range in it; matches
an 'a', a 'b', any letter from 'j' through 'o',
or a 'Z'
@cindex ranges in patterns, negating
@cindex negating ranges in patterns
@item [^A-Z]
a "negated character class", i.e., any character
but those in the class. In this case, any
character EXCEPT an uppercase letter.
@item [^A-Z\n]
any character EXCEPT an uppercase letter or
a newline
@item [a-z]@{-@}[aeiou]
the lowercase consonants
@item r*
zero or more r's, where r is any regular expression
@item r+
one or more r's
@item r?
zero or one r's (that is, ``an optional r'')
@cindex braces in patterns
@item r@{2,5@}
anywhere from two to five r's
@item r@{2,@}
two or more r's
@item r@{4@}
exactly 4 r's
@cindex pattern aliases, expansion of
@item @{name@}
the expansion of the @samp{name} definition
@cindex literal text in patterns, syntax of
@cindex verbatim text in patterns, syntax of
@item "[xyz]\"foo"
the literal string: @samp{[xyz]"foo}
@cindex escape sequences in patterns, syntax of
@item \X
if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or
@samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a
literal @samp{X} (used to escape operators such as @samp{*})
@cindex NULL character in patterns, syntax of
@item \0
a NUL character (ASCII code 0)
@cindex octal characters in patterns
@item \123
the character with octal value 123
@item \x2a
the character with hexadecimal value 2a
@item (r)
match an @samp{r}; parentheses are used to override precedence (see below)
@item (?r-s:pattern)
apply option @samp{r} and omit option @samp{s} while interpreting pattern.
Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}.
@samp{i} means case-insensitive. @samp{-i} means case-sensitive.
@samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever.
@samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}.
@samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless
it is backslash-escaped, contained within @samp{""}s, or appears inside a
character class.
The following are all valid:
(?:foo) same as (foo)
(?i:ab7) same as ([aA][bB]7)
(?-i:ab) same as (ab)
(?s:.) same as [\x00-\xFF]
(?-s:.) same as [^\n]
(?ix-s: a . b) same as ([Aa][^\n][bB])
(?x:a b) same as ("ab")
(?x:a\ b) same as ("a b")
(?x:a" "b) same as ("a b")
(?x:a[ ]b) same as ("a b")
/* comment */
c) same as (abc)
@end verbatim
@item (?# comment )
omit everything within @samp{()}. The first @samp{)}
character encountered ends the pattern. It is not possible to for the comment
to contain a @samp{)} character. The comment may span lines.
@cindex concatenation, in patterns
@item rs
the regular expression @samp{r} followed by the regular expression @samp{s}; called
@item r|s
either an @samp{r} or an @samp{s}
@cindex trailing context, in patterns
@item r/s
an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is
included when determining whether this rule is the longest match, but is
then returned to the input before the action is executed. So the action
only sees the text matched by @samp{r}. This type of pattern is called
@dfn{trailing context}. (There are some combinations of @samp{r/s} that flex
cannot match correctly. @xref{Limitations}, regarding dangerous trailing
@cindex beginning of line, in patterns
@cindex BOL, in patterns
@item ^r
an @samp{r}, but only at the beginning of a line (i.e.,
when just starting to scan, or right after a
newline has been scanned).
@cindex end of line, in patterns
@cindex EOL, in patterns
@item r$
an @samp{r}, but only at the end of a line (i.e., just before a
newline). Equivalent to @samp{r/\n}.
@cindex newline, matching in patterns
Note that @code{flex}'s notion of ``newline'' is exactly
whatever the C compiler used to compile @code{flex}
interprets @samp{\n} as; in particular, on some DOS
systems you must either filter out @samp{\r}s in the
input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}.
@cindex start conditions, in patterns
@item <s>r
an @samp{r}, but only in start condition @code{s} (see @ref{Start
Conditions} for discussion of start conditions).
@item <s1,s2,s3>r
same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}.
@item <*>r
an @samp{r} in any start condition, even an exclusive one.
@cindex end of file, in patterns
@cindex EOF in patterns, syntax of
@item <<EOF>>
an end-of-file.
@item <s1,s2><<EOF>>
an end-of-file when in start condition @code{s1} or @code{s2}
@end table
Note that inside of a character class, all regular expression operators
lose their special meaning except escape (@samp{\}) and the character class
operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}.
@cindex patterns, precedence of operators
The regular expressions listed above are grouped according to
precedence, from highest precedence at the top to lowest at the bottom.
Those grouped together have equal precedence (see special note on the
precedence of the repeat operator, @samp{@{@}}, under the documentation
for the @samp{--posix} POSIX compliance option). For example,
@cindex patterns, grouping and precedence
@end verbatim
@end example
is the same as
@end verbatim
@end example
since the @samp{*} operator has higher precedence than concatenation,
and concatenation higher than alternation (@samp{|}). This pattern
therefore matches @emph{either} the string @samp{foo} @emph{or} the
string @samp{ba} followed by zero-or-more @samp{r}'s. To match
@samp{foo} or zero-or-more repetitions of the string @samp{bar}, use:
@end verbatim
@end example
And to match a sequence of zero or more repetitions of @samp{foo} and
@cindex patterns, repetitions with grouping
@end verbatim
@end example
@cindex character classes in patterns
In addition to characters and ranges of characters, character classes
can also contain @dfn{character class expressions}. These are
expressions enclosed inside @samp{[}: and @samp{:]} delimiters (which
themselves must appear between the @samp{[} and @samp{]} of the
character class. Other elements may occur inside the character class,
too). The valid expressions are:
@cindex patterns, valid character classes
[:alnum:] [:alpha:] [:blank:]
[:cntrl:] [:digit:] [:graph:]
[:lower:] [:print:] [:punct:]
[:space:] [:upper:] [:xdigit:]
@end verbatim
@end example
These expressions all designate a set of characters equivalent to the
corresponding standard C @code{isXXX} function. For example,
@samp{[:alnum:]} designates those characters for which @code{isalnum()}
returns true - i.e., any alphabetic or numeric character. Some systems
don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a
blank or a tab.
For example, the following character classes are all equivalent:
@cindex character classes, equivalence of
@cindex patterns, character class equivalence
@end verbatim
@end example
A word of caution. Character classes are expanded immediately when seen in the @code{flex} input.
This means the character classes are sensitive to the locale in which @code{flex}
is executed, and the resulting scanner will not be sensitive to the runtime locale.
This may or may not be desirable.
@cindex case-insensitive, effect on character classes
@item If your scanner is case-insensitive (the @samp{-i} flag), then
@samp{[:upper:]} and @samp{[:lower:]} are equivalent to
@anchor{case and character ranges}
@item Character classes with ranges, such as @samp{[a-Z]}, should be used with
caution in a case-insensitive scanner if the range spans upper or lowercase
characters. Flex does not know if you want to fold all upper and lowercase
characters together, or if you want the literal numeric range specified (with
no case folding). When in doubt, flex will assume that you meant the literal
numeric range, and will issue a warning. The exception to this rule is a
character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you
want case-folding to occur. Here are some examples with the @samp{-i} flag
@multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}}
@item Range @tab Result @tab Literal Range @tab Alternate Range
@item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab
@item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab
@item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]}
@item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]}
@item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]}
@end multitable
@cindex end of line, in negated character classes
@cindex EOL, in negated character classes
A negated character class such as the example @samp{[^A-Z]} above
@emph{will} match a newline unless @samp{\n} (or an equivalent escape
sequence) is one of the characters explicitly present in the negated
character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other
regular expression tools treat negated character classes, but
unfortunately the inconsistency is historically entrenched. Matching
newlines means that a pattern like @samp{[^"]*} can match the entire
input unless there's another quote in the input.
Flex allows negation of character class expressions by prepending @samp{^} to
the POSIX character class name.
[:^alnum:] [:^alpha:] [:^blank:]
[:^cntrl:] [:^digit:] [:^graph:]
[:^lower:] [:^print:] [:^punct:]
[:^space:] [:^upper:] [:^xdigit:]
@end verbatim
@end example
Flex will issue a warning if the expressions @samp{[:^upper:]} and
@samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is
unclear. The current behavior is to skip them entirely, but this may change
without notice in future revisions of flex.
The @samp{@{-@}} operator computes the difference of two character classes. For
example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class
@samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is
just the single character @samp{a}). The @samp{@{-@}} operator is left
associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful
not to accidentally create an empty set, which will never match.
The @samp{@{+@}} operator computes the union of two character classes. For
example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator
is useful when preceded by the result of a difference operation, as in,
@samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to
@samp{[A-Zq]} in the "C" locale.
@cindex trailing context, limits of
@cindex ^ as non-special character in patterns
@cindex $ as normal character in patterns
A rule can have at most one instance of trailing context (the @samp{/} operator
or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns
can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$},
cannot be grouped inside parentheses. A @samp{^} which does not occur at
the beginning of a rule or a @samp{$} which does not occur at the end of
a rule loses its special properties and is treated as a normal character.
The following are invalid:
@cindex patterns, invalid trailing context
@end verbatim
@end example
Note that the first of these can be written @samp{foo/bar\n}.
The following will result in @samp{$} or @samp{^} being treated as a normal character:
@cindex patterns, special characters treated as non-special
@end verbatim
@end example
If the desired meaning is a @samp{foo} or a
@samp{bar}-followed-by-a-newline, the following could be used (the
special @code{|} action is explained below, @pxref{Actions}):
@cindex patterns, end of line
foo |
bar$ /* action goes here */
@end verbatim
@end example
A similar trick will work for matching a @samp{foo} or a
@end itemize
@node Matching, Actions, Patterns, Top
@chapter How the Input Is Matched
@cindex patterns, matching
@cindex input, matching
@cindex trailing context, matching
@cindex matching, and trailing context
@cindex matching, length of
@cindex matching, multiple matches
When the generated scanner is run, it analyzes its input looking for
strings which match any of its patterns. If it finds more than one
match, it takes the one matching the most text (for trailing context
rules, this includes the length of the trailing part, even though it
will then be returned to the input). If it finds two or more matches of
the same length, the rule listed first in the @code{flex} input file is
@cindex token
@cindex yytext
@cindex yyleng
Once the match is determined, the text corresponding to the match
(called the @dfn{token}) is made available in the global character
pointer @code{yytext}, and its length in the global integer
@code{yyleng}. The @dfn{action} corresponding to the matched pattern is
then executed (@pxref{Actions}), and then the remaining input is scanned
for another match.
@cindex default rule
If no match is found, then the @dfn{default rule} is executed: the next
character in the input is considered matched and copied to the standard
output. Thus, the simplest valid @code{flex} input is:
@cindex minimal scanner
@end verbatim
@end example
which generates a scanner that simply copies its input (one character at
a time) to its output.
@cindex yytext, two types of
@cindex %array, use of
@cindex %pointer, use of
@vindex yytext
Note that @code{yytext} can be defined in two different ways: either as
a character @emph{pointer} or as a character @emph{array}. You can
control which definition @code{flex} uses by including one of the
special directives @code{%pointer} or @code{%array} in the first
(definitions) section of your flex input. The default is
@code{%pointer}, unless you use the @samp{-l} lex compatibility option,
in which case @code{yytext} will be an array. The advantage of using
@code{%pointer} is substantially faster scanning and no buffer overflow
when matching very large tokens (unless you run out of dynamic memory).
The disadvantage is that you are restricted in how your actions can
modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()}
function destroys the present contents of @code{yytext}, which can be a
considerable porting headache when moving between different @code{lex}
@cindex %array, advantages of
The advantage of @code{%array} is that you can then modify @code{yytext}
to your heart's content, and calls to @code{unput()} do not destroy
@code{yytext} (@pxref{Actions}). Furthermore, existing @code{lex}
programs sometimes access @code{yytext} externally using declarations of
the form:
extern char yytext[];
@end verbatim
@end example
This definition is erroneous when used with @code{%pointer}, but correct
for @code{%array}.
The @code{%array} declaration defines @code{yytext} to be an array of
@code{YYLMAX} characters, which defaults to a fairly large value. You
can change the size by simply #define'ing @code{YYLMAX} to a different
value in the first section of your @code{flex} input. As mentioned
above, with @code{%pointer} yytext grows dynamically to accommodate
large tokens. While this means your @code{%pointer} scanner can
accommodate very large tokens (such as matching entire blocks of
comments), bear in mind that each time the scanner must resize
@code{yytext} it also must rescan the entire token from the beginning,
so matching such tokens can prove slow. @code{yytext} presently does
@emph{not} dynamically grow if a call to @code{unput()} results in too
much text being pushed back; instead, a run-time error results.
@cindex %array, with C++
Also note that you cannot use @code{%array} with C++ scanner classes
@node Actions, Generated Scanner, Matching, Top
@chapter Actions
@cindex actions
Each pattern in a rule has a corresponding @dfn{action}, which can be
any arbitrary C statement. The pattern ends at the first non-escaped
whitespace character; the remainder of the line is its action. If the
action is empty, then when the pattern is matched the input token is
simply discarded. For example, here is the specification for a program
which deletes all occurrences of @samp{zap me} from its input:
@cindex deleting lines from input
"zap me"
@end verbatim
@end example
This example will copy all other characters in the input to the output
since they will be matched by the default rule.
Here is a program which compresses multiple blanks and tabs down to a
single blank, and throws away whitespace found at the end of a line:
@cindex whitespace, compressing
@cindex compressing whitespace
[ \t]+ putchar( ' ' );
[ \t]+$ /* ignore this token */
@end verbatim
@end example
@cindex %@{ and %@}, in Rules Section
@cindex actions, use of @{ and @}
@cindex actions, embedded C strings
@cindex C-strings, in actions
@cindex comments, in actions
If the action contains a @samp{@{}, then the action spans till the
balancing @samp{@}} is found, and the action may cross multiple lines.
@code{flex} knows about C strings and comments and won't be fooled by
braces found within them, but also allows actions to begin with
@samp{%@{} and will consider the action to be all the text up to the
next @samp{%@}} (regardless of ordinary braces inside the action).
@cindex |, in actions
An action consisting solely of a vertical bar (@samp{|}) means ``same as the
action for the next rule''. See below for an illustration.
Actions can include arbitrary C code, including @code{return} statements
to return a value to whatever routine called @code{yylex()}. Each time
@code{yylex()} is called it continues processing tokens from where it
last left off until it either reaches the end of the file or executes a
@cindex yytext, modification of
Actions are free to modify @code{yytext} except for lengthening it
(adding characters to its end--these will overwrite later characters in
the input stream). This however does not apply when using @code{%array}
(@pxref{Matching}). In that case, @code{yytext} may be freely modified
in any way.
@cindex yyleng, modification of
@cindex yymore, and yyleng
Actions are free to modify @code{yyleng} except they should not do so if
the action also includes use of @code{yymore()} (see below).
@cindex preprocessor macros, for use in actions
There are a number of special directives which can be included within an
@table @code
@item ECHO
@cindex ECHO
copies yytext to the scanner's output.
@item BEGIN
@cindex BEGIN
followed by the name of a start condition places the scanner in the
corresponding start condition (see below).
@item REJECT
@cindex REJECT
directs the scanner to proceed on to the ``second best'' rule which
matched the input (or a prefix of the input). The rule is chosen as
described above in @ref{Matching}, and @code{yytext} and @code{yyleng}
set up appropriately. It may either be one which matched as much text
as the originally chosen rule but came later in the @code{flex} input
file, or one which matched less text. For example, the following will
both count the words in the input and call the routine @code{special()}
whenever @samp{frob} is seen:
int word_count = 0;
frob special(); REJECT;
[^ \t\n]+ ++word_count;
@end verbatim
@end example
Without the @code{REJECT}, any occurrences of @samp{frob} in the input
would not be counted as words, since the scanner normally executes only
one action per token. Multiple uses of @code{REJECT} are allowed, each
one finding the next best choice to the currently active rule. For
example, when the following scanner scans the token @samp{abcd}, it will
write @samp{abcdabcaba} to the output:
@cindex REJECT, calling multiple times
@cindex |, use of
a |
ab |
abc |
.|\n /* eat up any unmatched character */
@end verbatim
@end example
The first three rules share the fourth's action since they use the
special @samp{|} action.
@code{REJECT} is a particularly expensive feature in terms of scanner
performance; if it is used in @emph{any} of the scanner's actions it
will slow down @emph{all} of the scanner's matching. Furthermore,
@code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options
(@pxref{Scanner Options}).
Note also that unlike the other special actions, @code{REJECT} is a
@emph{branch}. Code immediately following it in the action will
@emph{not} be executed.
@item yymore()
@cindex yymore()
tells the scanner that the next time it matches a rule, the
corresponding token should be @emph{appended} onto the current value of
@code{yytext} rather than replacing it. For example, given the input
@samp{mega-kludge} the following will write @samp{mega-mega-kludge} to
the output:
@cindex yymore(), mega-kludge
@cindex yymore() to append token to previous token
mega- ECHO; yymore();
kludge ECHO;
@end verbatim
@end example
First @samp{mega-} is matched and echoed to the output. Then @samp{kludge}
is matched, but the previous @samp{mega-} is still hanging around at the
beginning of
so the
for the @samp{kludge} rule will actually write @samp{mega-kludge}.
@end table
@cindex yymore, performance penalty of
Two notes regarding use of @code{yymore()}. First, @code{yymore()}
depends on the value of @code{yyleng} correctly reflecting the size of
the current token, so you must not modify @code{yyleng} if you are using
@code{yymore()}. Second, the presence of @code{yymore()} in the
scanner's action entails a minor performance penalty in the scanner's
matching speed.
@cindex yyless()
@code{yyless(n)} returns all but the first @code{n} characters of the
current token back to the input stream, where they will be rescanned
when the scanner looks for the next match. @code{yytext} and
@code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now
be equal to @code{n}). For example, on the input @samp{foobar} the
following will write out @samp{foobarbar}:
@cindex yyless(), pushing back characters
@cindex pushing back characters with yyless
foobar ECHO; yyless(3);
[a-z]+ ECHO;
@end verbatim
@end example
An argument of 0 to @code{yyless()} will cause the entire current input
string to be scanned again. Unless you've changed how the scanner will
subsequently process its input (using @code{BEGIN}, for example), this
will result in an endless loop.
Note that @code{yyless()} is a macro and can only be used in the flex
input file, not from other source files.
@cindex unput()
@cindex pushing back characters with unput
@code{unput(c)} puts the character @code{c} back onto the input stream.
It will be the next character scanned. The following action will take
the current token and cause it to be rescanned enclosed in parentheses.
@cindex unput(), pushing back characters
@cindex pushing back characters with unput()
int i;
/* Copy yytext because unput() trashes yytext */
char *yycopy = strdup( yytext );
unput( ')' );
for ( i = yyleng - 1; i >= 0; --i )
unput( yycopy[i] );
unput( '(' );
free( yycopy );
@end verbatim
@end example
Note that since each @code{unput()} puts the given character back at the
@emph{beginning} of the input stream, pushing back strings must be done
@cindex %pointer, and unput()
@cindex unput(), and %pointer
An important potential problem when using @code{unput()} is that if you
are using @code{%pointer} (the default), a call to @code{unput()}
@emph{destroys} the contents of @code{yytext}, starting with its
rightmost character and devouring one character to the left with each
call. If you need the value of @code{yytext} preserved after a call to
@code{unput()} (as in the above example), you must either first copy it
elsewhere, or build your scanner using @code{%array} instead
@cindex pushing back EOF
@cindex EOF, pushing back
Finally, note that you cannot put back @samp{EOF} to attempt to mark the
input stream with an end-of-file.
@cindex input()
@code{input()} reads the next character from the input stream. For
example, the following is one way to eat up C comments:
@cindex comments, discarding
@cindex discarding C comments
"/*" {
register int c;
for ( ; ; )
while ( (c = input()) != '*' &&
c != EOF )
; /* eat up text of comment */
if ( c == '*' )
while ( (c = input()) == '*' )
if ( c == '/' )
break; /* found the end */
if ( c == EOF )
error( "EOF in comment" );
@end verbatim
@end example
@cindex input(), and C++
@cindex yyinput()
(Note that if the scanner is compiled using @code{C++}, then
@code{input()} is instead referred to as @b{yyinput()}, in order to
avoid a name clash with the @code{C++} stream by the name of
@cindex flushing the internal buffer
@code{YY_FLUSH_BUFFER;} flushes the scanner's internal buffer so that
the next time the scanner attempts to match a token, it will first
refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}).
This action is a special case of the more general
@code{yy_flush_buffer;} function, described below (@pxref{Multiple
Input Buffers})
@cindex yyterminate()
@cindex terminating with yyterminate()
@cindex exiting with yyterminate()
@cindex halting with yyterminate()
@code{yyterminate()} can be used in lieu of a return statement in an
action. It terminates the scanner and returns a 0 to the scanner's
caller, indicating ``all done''. By default, @code{yyterminate()} is
also called when an end-of-file is encountered. It is a macro and may
be redefined.
@node Generated Scanner, Start Conditions, Actions, Top
@chapter The Generated Scanner
@cindex yylex(), in generated scanner
The output of @code{flex} is the file @file{lex.yy.c}, which contains
the scanning routine @code{yylex()}, a number of tables used by it for
matching tokens, and a number of auxiliary routines and macros. By
default, @code{yylex()} is declared as follows:
int yylex()
... various definitions and the actions in here ...
@end verbatim
@end example
@cindex yylex(), overriding
(If your environment supports function prototypes, then it will be
@code{int yylex( void )}.) This definition may be changed by defining
the @code{YY_DECL} macro. For example, you could use:
@cindex yylex, overriding the prototype of
#define YY_DECL float lexscan( a, b ) float a, b;
@end verbatim
@end example
to give the scanning routine the name @code{lexscan}, returning a float,
and taking two floats as arguments. Note that if you give arguments to
the scanning routine using a K&R-style/non-prototyped function
declaration, you must terminate the definition with a semi-colon (;).
@code{flex} generates @samp{C99} function definitions by
default. However flex does have the ability to generate obsolete, er,
@samp{traditional}, function definitions. This is to support
bootstrapping gcc on old systems. Unfortunately, traditional
definitions prevent us from using any standard data types smaller than
int (such as short, char, or bool) as function arguments. For this
reason, future versions of @code{flex} may generate standard C99 code
only, leaving K&R-style functions to the historians. Currently, if you
do @strong{not} want @samp{C99} definitions, then you must use
@code{%option noansi-definitions}.
@cindex stdin, default for yyin
@cindex yyin
Whenever @code{yylex()} is called, it scans tokens from the global input
file @file{yyin} (which defaults to stdin). It continues until it
either reaches an end-of-file (at which point it returns the value 0) or
one of its actions executes a @code{return} statement.
@cindex EOF and yyrestart()
@cindex end-of-file, and yyrestart()
@cindex yyrestart()
If the scanner reaches an end-of-file, subsequent calls are undefined
unless either @file{yyin} is pointed at a new input file (in which case
scanning continues from that file), or @code{yyrestart()} is called.
@code{yyrestart()} takes one argument, a @code{FILE *} pointer (which
can be NULL, if you've set up @code{YY_INPUT} to scan from a source other
than @code{yyin}), and initializes @file{yyin} for scanning from that
file. Essentially there is no difference between just assigning
@file{yyin} to a new input file or using @code{yyrestart()} to do so;
the latter is available for compatibility with previous versions of
@code{flex}, and because it can be used to switch input files in the
middle of scanning. It can also be used to throw away the current input
buffer, by calling it with an argument of @file{yyin}; but it would be
better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that
@code{yyrestart()} does @emph{not} reset the start condition to
@code{INITIAL} (@pxref{Start Conditions}).
@cindex RETURN, within actions
If @code{yylex()} stops scanning due to executing a @code{return}
statement in one of the actions, the scanner may then be called again
and it will resume scanning where it left off.
@cindex YY_INPUT
By default (and for purposes of efficiency), the scanner uses
block-reads rather than simple @code{getc()} calls to read characters
from @file{yyin}. The nature of how it gets its input can be controlled
by defining the @code{YY_INPUT} macro. The calling sequence for
@code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}. Its action
is to place up to @code{max_size} characters in the character array
@code{buf} and return in the integer variable @code{result} either the
number of characters read or the constant @code{YY_NULL} (0 on Unix
systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from
the global file-pointer @file{yyin}.
@cindex YY_INPUT, overriding
Here is a sample definition of @code{YY_INPUT} (in the definitions
section of the input file):
#define YY_INPUT(buf,result,max_size) \
{ \
int c = getchar(); \
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
@end verbatim
@end example
This definition will change the input processing to occur one character
at a time.
@cindex yywrap()
When the scanner receives an end-of-file indication from YY_INPUT, it
then checks the @code{yywrap()} function. If @code{yywrap()} returns
false (zero), then it is assumed that the function has gone ahead and
set up @file{yyin} to point to another input file, and scanning
continues. If it returns true (non-zero), then the scanner terminates,
returning 0 to its caller. Note that in either case, the start
condition remains unchanged; it does @emph{not} revert to
@cindex yywrap, default for
@cindex noyywrap, %option
@cindex %option noyywrapp
If you do not supply your own version of @code{yywrap()}, then you must
either use @code{%option noyywrap} (in which case the scanner behaves as
though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to
obtain the default version of the routine, which always returns 1.
For scanning from in-memory buffers (e.g., scanning strings), see
@ref{Scanning Strings}. @xref{Multiple Input Buffers}.
@cindex ECHO, and yyout
@cindex yyout
@cindex stdout, as default for yyout
The scanner writes its @code{ECHO} output to the @file{yyout} global
(default, @file{stdout}), which may be redefined by the user simply by
assigning it to some other @code{FILE} pointer.
@node Start Conditions, Multiple Input Buffers, Generated Scanner, Top
@chapter Start Conditions
@cindex start conditions
@code{flex} provides a mechanism for conditionally activating rules.
Any rule whose pattern is prefixed with @samp{<sc>} will only be active
when the scanner is in the @dfn{start condition} named @code{sc}. For
@c proofread edit stopped here
<STRING>[^"]* { /* eat up the string body ... */
@end verbatim
@end example
will be active only when the scanner is in the @code{STRING} start
condition, and
@cindex start conditions, multiple
<INITIAL,STRING,QUOTE>\. { /* handle an escape ... */
@end verbatim
@end example
will be active only when the current start condition is either
@code{INITIAL}, @code{STRING}, or @code{QUOTE}.
@cindex start conditions, inclusive v.s.@: exclusive
Start conditions are declared in the definitions (first) section of the
input using unindented lines beginning with either @samp{%s} or
@samp{%x} followed by a list of names. The former declares
@dfn{inclusive} start conditions, the latter @dfn{exclusive} start
conditions. A start condition is activated using the @code{BEGIN}
action. Until the next @code{BEGIN} action is executed, rules with the
given start condition will be active and rules with other start
conditions will be inactive. If the start condition is inclusive, then
rules with no start conditions at all will also be active. If it is
exclusive, then @emph{only} rules qualified with the start condition
will be active. A set of rules contingent on the same exclusive start
condition describe a scanner which is independent of any of the other
rules in the @code{flex} input. Because of this, exclusive start
conditions make it easy to specify ``mini-scanners'' which scan portions
of the input that are syntactically different from the rest (e.g.,
If the distinction between inclusive and exclusive start conditions
is still a little vague, here's a simple example illustrating the
connection between the two. The set of rules:
@cindex start conditions, inclusive
%s example
<example>foo do_something();
bar something_else();
@end verbatim
@end example
is equivalent to
@cindex start conditions, exclusive
%x example
<example>foo do_something();
<INITIAL,example>bar something_else();
@end verbatim
@end example
Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in
the second example wouldn't be active (i.e., couldn't match) when in
start condition @code{example}. If we just used @code{<example>} to
qualify @code{bar}, though, then it would only be active in
@code{example} and not in @code{INITIAL}, while in the first example
it's active in both, because in the first example the @code{example}
start condition is an inclusive @code{(%s)} start condition.
@cindex start conditions, special wildcard condition
Also note that the special start-condition specifier
matches every start condition. Thus, the above example could also
have been written:
@cindex start conditions, use of wildcard condition (<*>)
%x example
<example>foo do_something();
<*>bar something_else();
@end verbatim
@end example
The default rule (to @code{ECHO} any unmatched character) remains active
in start conditions. It is equivalent to:
@cindex start conditions, behavior of default rule
<*>.|\n ECHO;
@end verbatim
@end example
@cindex BEGIN, explanation
@findex BEGIN
@vindex INITIAL
@code{BEGIN(0)} returns to the original state where only the rules with
no start conditions are active. This state can also be referred to as
the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is
equivalent to @code{BEGIN(0)}. (The parentheses around the start
condition name are not required but are considered good style.)
@code{BEGIN} actions can also be given as indented code at the beginning
of the rules section. For example, the following will cause the scanner
to enter the @code{SPECIAL} start condition whenever @code{yylex()} is
called and the global variable @code{enter_special} is true:
@cindex start conditions, using BEGIN
int enter_special;
if ( enter_special )
...more rules follow...
@end verbatim
@end example
To illustrate the uses of start conditions, here is a scanner which
provides two different interpretations of a string like @samp{123.456}.
By default it will treat it as three tokens, the integer @samp{123}, a
dot (@samp{.}), and the integer @samp{456}. But if the string is
preceded earlier in the line by the string @samp{expect-floats} it will
treat it as a single token, the floating-point number @samp{123.456}:
@cindex start conditions, for different interpretations of same input
#include <math.h>
%s expect
expect-floats BEGIN(expect);
<expect>[0-9]+.[0-9]+ {
printf( "found a float, = %f\n",
atof( yytext ) );
<expect>\n {
/* that's the end of the line, so
* we need another "expect-number"
* before we'll recognize any more
* numbers
[0-9]+ {
printf( "found an integer, = %d\n",
atoi( yytext ) );
"." printf( "found a dot\n" );
@end verbatim
@end example
@cindex comments, example of scanning C comments
Here is a scanner which recognizes (and discards) C comments while
maintaining a count of the current input line.
@cindex recognizing C comments
%x comment
int line_num = 1;
"/*" BEGIN(comment);
<comment>[^*\n]* /* eat anything that's not a '*' */
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);
@end verbatim
@end example
This scanner goes to a bit of trouble to match as much
text as possible with each rule. In general, when attempting to write
a high-speed scanner try to match as much possible in each rule, as
it's a big win.
Note that start-conditions names are really integer values and
can be stored as such. Thus, the above could be extended in the
following fashion:
@cindex start conditions, integer values
@cindex using integer values of start condition names
%x comment foo
int line_num = 1;
int comment_caller;
"/*" {
comment_caller = INITIAL;
<foo>"/*" {
comment_caller = foo;
<comment>[^*\n]* /* eat anything that's not a '*' */
<comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(comment_caller);
@end verbatim
@end example
@cindex YY_START, example
Furthermore, you can access the current start condition using the
integer-valued @code{YY_START} macro. For example, the above
assignments to @code{comment_caller} could instead be written
@cindex getting current start state with YY_START
comment_caller = YY_START;
@end verbatim
@end example
@vindex YY_START
Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that
is what's used by AT&T @code{lex}).
For historical reasons, start conditions do not have their own
name-space within the generated scanner. The start condition names are
unmodified in the generated scanner and generated header.
@xref{option-header}. @xref{option-prefix}.
Finally, here's an example of how to match C-style quoted strings using
exclusive start conditions, including expanded escape sequences (but
not including checking for a string that's too long):
@cindex matching C-style double-quoted strings
%x str
char string_buf[MAX_STR_CONST];
char *string_buf_ptr;
\" string_buf_ptr = string_buf; BEGIN(str);
<str>\" { /* saw closing quote - all done */
*string_buf_ptr = '\0';
/* return string constant token type and
* value to parser
<str>\n {
/* error - unterminated string constant */
/* generate error message */
<str>\\[0-7]{1,3} {
/* octal escape sequence */
int result;
(void) sscanf( yytext + 1, "%o", &result );
if ( result > 0xff )
/* error, constant is out-of-bounds */
*string_buf_ptr++ = result;
<str>\\[0-9]+ {
/* generate error - bad escape sequence; something
* like '\48' or '\0777777'
<str>\\n *string_buf_ptr++ = '\n';
<str>\\t *string_buf_ptr++ = '\t';
<str>\\r *string_buf_ptr++ = '\r';
<str>\\b *string_buf_ptr++ = '\b';
<str>\\f *string_buf_ptr++ = '\f';
<str>\\(.|\n) *string_buf_ptr++ = yytext[1];
<str>[^\\\n\"]+ {
char *yptr = yytext;
while ( *yptr )
*string_buf_ptr++ = *yptr++;
@end verbatim
@end example
@cindex start condition, applying to multiple patterns
Often, such as in some of the examples above, you wind up writing a
whole bunch of rules all preceded by the same start condition(s). Flex
makes this a little easier and cleaner by introducing a notion of start
condition @dfn{scope}. A start condition scope is begun with:
@end verbatim
@end example
where @code{SCs} is a list of one or more start conditions. Inside the
start condition scope, every rule automatically has the prefix
@code{SCs>} applied to it, until a @samp{@}} which matches the initial
@samp{@{}. So, for example,
@cindex extended scope of start conditions
"\\n" return '\n';
"\\r" return '\r';
"\\f" return '\f';
"\\0" return '\0';
@end verbatim
@end example
is equivalent to:
<ESC>"\\n" return '\n';
<ESC>"\\r" return '\r';
<ESC>"\\f" return '\f';
<ESC>"\\0" return '\0';
@end verbatim
@end example
Start condition scopes may be nested.
@cindex stacks, routines for manipulating
@cindex start conditions, use of a stack
The following routines are available for manipulating stacks of start conditions:
@deftypefun void yy_push_state ( int @code{new_state} )
pushes the current start condition onto the top of the start condition
stack and switches to
as though you had used
@code{BEGIN new_state}
(recall that start condition names are also integers).
@end deftypefun
@deftypefun void yy_pop_state ()
pops the top of the stack and switches to it via
@end deftypefun
@deftypefun int yy_top_state ()
returns the top of the stack without altering the stack's contents.
@end deftypefun
@cindex memory, for start condition stacks
The start condition stack grows dynamically and so has no built-in size
limitation. If memory is exhausted, program execution aborts.
To use start condition stacks, your scanner must include a @code{%option
stack} directive (@pxref{Scanner Options}).
@node Multiple Input Buffers, EOF, Start Conditions, Top
@chapter Multiple Input Buffers
@cindex multiple input streams
Some scanners (such as those which support ``include'' files) require
reading from several input streams. As @code{flex} scanners do a large
amount of buffering, one cannot control where the next input will be
read from by simply writing a @code{YY_INPUT()} which is sensitive to
the scanning context. @code{YY_INPUT()} is only called when the scanner
reaches the end of its buffer, which may be a long time after scanning a
statement such as an @code{include} statement which requires switching
the input source.
To negotiate these sorts of problems, @code{flex} provides a mechanism
for creating and switching between multiple input buffers. An input
buffer is created by using:
@cindex memory, allocating input buffers
@deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size )
@end deftypefun
which takes a @code{FILE} pointer and a size and creates a buffer
associated with the given file and large enough to hold @code{size}
characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It
returns a @code{YY_BUFFER_STATE} handle, which may then be passed to
other routines (see below).
The @code{YY_BUFFER_STATE} type is a
pointer to an opaque @code{struct yy_buffer_state} structure, so you may
safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE)
0)} if you wish, and also refer to the opaque structure in order to
correctly declare input buffers in source files other than that of your
scanner. Note that the @code{FILE} pointer in the call to
@code{yy_create_buffer} is only used as the value of @file{yyin} seen by
@code{YY_INPUT}. If you redefine @code{YY_INPUT()} so it no longer uses
@file{yyin}, then you can safely pass a NULL @code{FILE} pointer to
@code{yy_create_buffer}. You select a particular buffer to scan from
@deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer )
@end deftypefun
The above function switches the scanner's input buffer so subsequent tokens
will come from @code{new_buffer}. Note that @code{yy_switch_to_buffer()} may
be used by @code{yywrap()} to set things up for continued scanning, instead of
opening a new file and pointing @file{yyin} at it. If you are looking for a
stack of input buffers, then you want to use @code{yypush_buffer_state()}
instead of this function. Note also that switching input sources via either
@code{yy_switch_to_buffer()} or @code{yywrap()} does @emph{not} change the
start condition.
@cindex memory, deleting input buffers
@deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer )
@end deftypefun
is used to reclaim the storage associated with a buffer. (@code{buffer}
can be NULL, in which case the routine does nothing.) You can also clear
the current contents of a buffer using:
@cindex pushing an input buffer
@cindex stack, input buffer push
@deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer )
@end deftypefun
This function pushes the new buffer state onto an internal stack. The pushed
state becomes the new current state. The stack is maintained by flex and will
grow as required. This function is intended to be used instead of
@code{yy_switch_to_buffer}, when you want to change states, but preserve the
current state for later use.
@cindex popping an input buffer
@cindex stack, input buffer pop
@deftypefun void yypop_buffer_state ( )
@end deftypefun
This function removes the current state from the top of the stack, and deletes
it by calling @code{yy_delete_buffer}. The next state on the stack, if any,
becomes the new current state.
@cindex clearing an input buffer
@cindex flushing an input buffer
@deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer )
@end deftypefun
This function discards the buffer's contents,
so the next time the scanner attempts to match a token from the
buffer, it will first fill the buffer anew using
@deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size )
@end deftypefun
is an alias for @code{yy_create_buffer()},
provided for compatibility with the C++ use of @code{new} and
@code{delete} for creating and destroying dynamic objects.
@cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro
@code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the
current buffer. It should not be used as an lvalue.
@cindex EOF, example using multiple input buffers
Here are two examples of using these features for writing a scanner
which expands include files (the
feature is discussed below).
This first example uses yypush_buffer_state and yypop_buffer_state. Flex
maintains the stack internally.
@cindex handling include files with multiple input buffers
/* the "incl" state is used for picking up the name
* of an include file
%x incl
include BEGIN(incl);
[a-z]+ ECHO;
[^a-z\n]*\n? ECHO;
<incl>[ \t]* /* eat the whitespace */
<incl>[^ \t\n]+ { /* got the include file name */
yyin = fopen( yytext, "r" );
if ( ! yyin )
error( ... );
yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE ));
<<EOF>> {
@end verbatim
@end example
The second example, below, does the same thing as the previous example did, but
manages its own input buffer stack manually (instead of letting flex do it).
@cindex handling include files with multiple input buffers
/* the "incl" state is used for picking up the name
* of an include file
%x incl
int include_stack_ptr = 0;
include BEGIN(incl);
[a-z]+ ECHO;
[^a-z\n]*\n? ECHO;
<incl>[ \t]* /* eat the whitespace */
<incl>[^ \t\n]+ { /* got the include file name */
if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
fprintf( stderr, "Includes nested too deeply" );
exit( 1 );
include_stack[include_stack_ptr++] =
yyin = fopen( yytext, "r" );
if ( ! yyin )
error( ... );
yy_create_buffer( yyin, YY_BUF_SIZE ) );
<<EOF>> {
if ( --include_stack_ptr 0 )
yy_delete_buffer( YY_CURRENT_BUFFER );
include_stack[include_stack_ptr] );
@end verbatim
@end example
@anchor{Scanning Strings}
@cindex strings, scanning strings instead of files
The following routines are available for setting up input buffers for
scanning in-memory strings instead of files. All of them create a new
input buffer for scanning the string, and return a corresponding
@code{YY_BUFFER_STATE} handle (which you should delete with
@code{yy_delete_buffer()} when done with it). They also switch to the
new buffer using @code{yy_switch_to_buffer()}, so the next call to
@code{yylex()} will start scanning the string.
@deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str )
scans a NUL-terminated string.
@end deftypefun
@deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len )
scans @code{len} bytes (including possibly @code{NUL}s) starting at location
@end deftypefun
Note that both of these functions create and scan a @emph{copy} of the
string or bytes. (This may be desirable, since @code{yylex()} modifies
the contents of the buffer it is scanning.) You can avoid the copy by
@deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size)
which scans in place the buffer starting at @code{base}, consisting of
@code{size} bytes, the last two bytes of which @emph{must} be
@code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not
scanned; thus, scanning consists of @code{base[0]} through
@code{base[size-2]}, inclusive.
@end deftypefun
If you fail to set up @code{base} in this manner (i.e., forget the final
two @code{YY_END_OF_BUFFER_CHAR} bytes), then @code{yy_scan_buffer()}
returns a NULL pointer instead of creating a new input buffer.
@deftp {Data type} yy_size_t
is an integral type to which you can cast an integer expression
reflecting the size of the buffer.
@end deftp
@node EOF, Misc Macros, Multiple Input Buffers, Top
@chapter End-of-File Rules
@cindex EOF, explanation
The special rule @code{<<EOF>>} indicates
actions which are to be taken when an end-of-file is
encountered and @code{yywrap()} returns non-zero (i.e., indicates
no further files to process). The action must finish
by doing one of the following things:
@findex YY_NEW_FILE (now obsolete)
assigning @file{yyin} to a new input file (in previous versions of
@code{flex}, after doing the assignment you had to call the special
action @code{YY_NEW_FILE}. This is no longer necessary.)
executing a @code{return} statement;
executing the special @code{yyterminate()} action.
or, switching to a new buffer using @code{yy_switch_to_buffer()} as
shown in the example above.
@end itemize
<<EOF>> rules may not be used with other patterns; they may only be
qualified with a list of start conditions. If an unqualified <<EOF>>
rule is given, it applies to @emph{all} start conditions which do not
already have <<EOF>> actions. To specify an <<EOF>> rule for only the
initial start condition, use:
@end verbatim
@end example
These rules are useful for catching things like unclosed comments. An
@cindex <<EOF>>, use of
%x quote
...other rules for dealing with quotes...
<quote><<EOF>> {
error( "unterminated quote" );
<<EOF>> {
if ( *++filelist )
yyin = fopen( *filelist, "r" );
@end verbatim
@end example
@node Misc Macros, User Values, EOF, Top
@chapter Miscellaneous Macros
The macro @code{YY_USER_ACTION} can be defined to provide an action
which is always executed prior to the matched rule's action. For
example, it could be #define'd to call a routine to convert yytext to
lower-case. When @code{YY_USER_ACTION} is invoked, the variable
@code{yy_act} gives the number of the matched rule (rules are numbered
starting with 1). Suppose you want to profile how often each of your
rules is matched. The following would do the trick:
@cindex YY_USER_ACTION to track each time a rule is matched
#define YY_USER_ACTION ++ctr[yy_act]
@end verbatim
@end example
@vindex YY_NUM_RULES
where @code{ctr} is an array to hold the counts for the different rules.
Note that the macro @code{YY_NUM_RULES} gives the total number of rules
(including the default rule), even if you use @samp{-s)}, so a correct
declaration for @code{ctr} is:
int ctr[YY_NUM_RULES];
@end verbatim
@end example
@hkindex YY_USER_INIT
The macro @code{YY_USER_INIT} may be defined to provide an action which
is always executed before the first scan (and before the scanner's
internal initializations are done). For example, it could be used to
call a routine to read in a data table or open a logging file.
@findex yy_set_interactive
The macro @code{yy_set_interactive(is_interactive)} can be used to
control whether the current buffer is considered @dfn{interactive}. An
interactive buffer is processed more slowly, but must be used when the
scanner's input source is indeed interactive to avoid problems due to
waiting to fill buffers (see the discussion of the @samp{-I} flag in
@ref{Scanner Options}). A non-zero value in the macro invocation marks
the buffer as interactive, a zero value as non-interactive. Note that
use of this macro overrides @code{%option always-interactive} or
@code{%option never-interactive} (@pxref{Scanner Options}).
@code{yy_set_interactive()} must be invoked prior to beginning to scan
the buffer that is (or is not) to be considered interactive.
@cindex BOL, setting it
@findex yy_set_bol
The macro @code{yy_set_bol(at_bol)} can be used to control whether the
current buffer's scanning context for the next token match is done as
though at the beginning of a line. A non-zero macro argument makes
rules anchored with @samp{^} active, while a zero argument makes
@samp{^} rules inactive.
@cindex BOL, checking the BOL flag
@findex YY_AT_BOL
The macro @code{YY_AT_BOL()} returns true if the next token scanned from
the current buffer will have @samp{^} rules active, false otherwise.
@cindex actions, redefining YY_BREAK
@hkindex YY_BREAK
In the generated scanner, the actions are all gathered in one large
switch statement and separated using @code{YY_BREAK}, which may be
redefined. By default, it is simply a @code{break}, to separate each
rule's action from the following rule's. Redefining @code{YY_BREAK}
allows, for example, C++ users to #define YY_BREAK to do nothing (while
being very careful that every rule ends with a @code{break} or a
@code{return}!) to avoid suffering from unreachable statement warnings
where because a rule's action ends with @code{return}, the
@code{YY_BREAK} is inaccessible.
@node User Values, Yacc, Misc Macros, Top
@chapter Values Available To the User
This chapter summarizes the various values available to the user in the
rule actions.
@table @code
@vindex yytext
@item char *yytext
holds the text of the current token. It may be modified but not
lengthened (you cannot append characters to the end).
@cindex yytext, default array size
@cindex array, default size for yytext
@vindex YYLMAX
If the special directive @code{%array} appears in the first section of
the scanner description, then @code{yytext} is instead declared
@code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition
that you can redefine in the first section if you don't like the default
value (generally 8KB). Using @code{%array} results in somewhat slower
scanners, but the value of @code{yytext} becomes immune to calls to
@code{unput()}, which potentially destroy its value when @code{yytext} is
a character pointer. The opposite of @code{%array} is @code{%pointer},
which is the default.
@cindex C++ and %array
You cannot use @code{%array} when generating C++ scanner classes (the
@samp{-+} flag).
@vindex yyleng
@item int yyleng
holds the length of the current token.
@vindex yyin
@item FILE *yyin
is the file which by default @code{flex} reads from. It may be
redefined but doing so only makes sense before scanning begins or after
an EOF has been encountered. Changing it in the midst of scanning will
have unexpected results since @code{flex} buffers its input; use
@code{yyrestart()} instead. Once scanning terminates because an
end-of-file has been seen, you can assign @file{yyin} at the new input
file and then call the scanner again to continue scanning.
@findex yyrestart
@item void yyrestart( FILE *new_file )
may be called to point @file{yyin} at the new input file. The
switch-over to the new file is immediate (any previously buffered-up
input is lost). Note that calling @code{yyrestart()} with @file{yyin}
as an argument thus throws away the current input buffer and continues
scanning the same input file.
@vindex yyout
@item FILE *yyout
is the file to which @code{ECHO} actions are done. It can be reassigned
by the user.
returns a @code{YY_BUFFER_STATE} handle to the current buffer.
@vindex YY_START
@item YY_START
returns an integer value corresponding to the current start condition.
You can subsequently use this value with @code{BEGIN} to return to that
start condition.
@end table
@node Yacc, Scanner Options, User Values, Top
@chapter Interfacing with Yacc
@cindex yacc, interface
@vindex yylval, with yacc
One of the main uses of @code{flex} is as a companion to the @code{yacc}
parser-generator. @code{yacc} parsers expect to call a routine named
@code{yylex()} to find the next input token. The routine is supposed to
return the type of the next token as well as putting any associated
value in the global @code{yylval}. To use @code{flex} with @code{yacc},
one specifies the @samp{-d} option to @code{yacc} to instruct it to
generate the file @file{} containing definitions of all the
@code{%tokens} appearing in the @code{yacc} input. This file is then
included in the @code{flex} scanner. For example, if one of the tokens
is @code{TOK_NUMBER}, part of the scanner might look like:
@cindex yacc interface
#include ""
[0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
@end verbatim
@end example
@node Scanner Options, Performance, Yacc, Top
@chapter Scanner Options
@cindex command-line options
@cindex options, command-line
@cindex arguments, command-line
The various @code{flex} options are categorized by function in the following
menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}.
* Options for Specifying Filenames::
* Options Affecting Scanner Behavior::
* Code-Level And API Options::
* Options for Scanner Speed and Size::
* Debugging Options::
* Miscellaneous Options::
@end menu
Even though there are many scanner options, a typical scanner might only
specify the following options:
%option 8bit reentrant bison-bridge
%option warn nodefault
%option yylineno
%option outfile="scanner.c" header-file="scanner.h"
@end verbatim
@end example
The first line specifies the general type of scanner we want. The second line
specifies that we are being careful. The third line asks flex to track line
numbers. The last line tells flex what to name the files. (The options can be
specified in any order. We just divided them.)
@code{flex} also provides a mechanism for controlling options within the
scanner specification itself, rather than from the flex command-line.
This is done by including @code{%option} directives in the first section
of the scanner specification. You can specify multiple options with a
single @code{%option} directive, and multiple directives in the first
section of your flex input file.
Most options are given simply as names, optionally preceded by the
word @samp{no} (with no intervening whitespace) to negate their meaning.
The names are the same as their long-option equivalents (but without the
leading @samp{--} ).
@code{flex} scans your rule actions to determine whether you use the
@code{REJECT} or @code{yymore()} features. The @code{REJECT} and
@code{yymore} options are available to override its decision as to
whether you use the options, either by setting them (e.g., @code{%option
reject)} to indicate the feature is indeed used, or unsetting them to
indicate it actually is not used (e.g., @code{%option noyymore)}.
A number of options are available for lint purists who want to suppress
the appearance of unneeded routines in the generated scanner. Each of
the following, if unset (e.g., @code{%option nounput}), results in the
corresponding routine not appearing in the generated scanner:
input, unput
yy_push_state, yy_pop_state, yy_top_state
yy_scan_buffer, yy_scan_bytes, yy_scan_string
yyget_extra, yyset_extra, yyget_leng, yyget_text,
yyget_lineno, yyset_lineno, yyget_in, yyset_in,
yyget_out, yyset_out, yyget_lval, yyset_lval,
yyget_lloc, yyset_lloc, yyget_debug, yyset_debug
@end verbatim
@end example
(though @code{yy_push_state()} and friends won't appear anyway unless
you use @code{%option stack)}.
@node Options for Specifying Filenames, Options Affecting Scanner Behavior, Scanner Options, Scanner Options
@section Options for Specifying Filenames
@table @samp
@opindex ---header-file
@opindex header-file
@item --header-file=FILE, @code{%option header-file="FILE"}
instructs flex to write a C header to @file{FILE}. This file contains
function prototypes, extern variables, and types used by the scanner.
Only the external API is exported by the header file. Many macros that
are usable from within scanner actions are not exported to the header
file. This is due to namespace problems and the goal of a clean
external API.
While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy}
is substituted with the appropriate prefix.
The @samp{--header-file} option is not compatible with the @samp{--c++} option,
since the C++ scanner provides its own header in @file{yyFlexLexer.h}.
@opindex -o
@opindex ---outfile
@opindex outfile
@item -oFILE, --outfile=FILE, @code{%option outfile="FILE"}
directs flex to write the scanner to the file @file{FILE} instead of
@file{lex.yy.c}. If you combine @samp{--outfile} with the @samp{--stdout} option,
then the scanner is written to @file{stdout} but its @code{#line}
directives (see the @samp{-l} option above) refer to the file
@opindex -t
@opindex ---stdout
@opindex stdout
@item -t, --stdout, @code{%option stdout}
instructs @code{flex} to write the scanner it generates to standard
output instead of @file{lex.yy.c}.
@opindex ---skel
@item -SFILE, --skel=FILE
overrides the default skeleton file from which
constructs its scanners. You'll never need this option unless you are doing
maintenance or development.
@opindex ---tables-file
@opindex tables-file
@item --tables-file=FILE
Write serialized scanner dfa tables to FILE. The generated scanner will not
contain the tables, and requires them to be loaded at runtime.
@opindex ---tables-verify
@opindex tables-verify
@item --tables-verify
This option is for flex development. We document it here in case you stumble
upon it by accident or in case you suspect some inconsistency in the serialized
tables. Flex will serialize the scanner dfa tables but will also generate the
in-code tables as it normally does. At runtime, the scanner will verify that
the serialized tables match the in-code tables, instead of loading them.
@end table
@node Options Affecting Scanner Behavior, Code-Level And API Options, Options for Specifying Filenames, Scanner Options
@section Options Affecting Scanner Behavior
@table @samp
@opindex -i
@opindex ---case-insensitive
@opindex case-insensitive
@item -i, --case-insensitive, @code{%option case-insensitive}
instructs @code{flex} to generate a @dfn{case-insensitive} scanner. The
case of letters given in the @code{flex} input patterns will be ignored,
and tokens in the input will be matched regardless of case. The matched
text given in @code{yytext} will have the preserved case (i.e., it will
not be folded). For tricky behavior, see @ref{case and character ranges}.
@opindex -l
@opindex ---lex-compat
@opindex lex-compat
@item -l, --lex-compat, @code{%option lex-compat}
turns on maximum compatibility with the original AT&T @code{lex}
implementation. Note that this does not mean @emph{full} compatibility.
Use of this option costs a considerable amount of performance, and it
cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or
@samp{-CF} options. For details on the compatibilities it provides, see
@ref{Lex and Posix}. This option also results in the name
@code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner.
@opindex -B
@opindex ---batch
@opindex batch
@item -B, --batch, @code{%option batch}
instructs @code{flex} to generate a @dfn{batch} scanner, the opposite of
@emph{interactive} scanners generated by @samp{--interactive} (see below). In
general, you use @samp{-B} when you are @emph{certain} that your scanner
will never be used interactively, and you want to squeeze a
@emph{little} more performance out of it. If your goal is instead to
squeeze out a @emph{lot} more performance, you should be using the
@samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically
@opindex -I
@opindex ---interactive
@opindex interactive
@item -I, --interactive, @code{%option interactive}
instructs @code{flex} to generate an @i{interactive} scanner. An
interactive scanner is one that only looks ahead to decide what token
has been matched if it absolutely must. It turns out that always
looking one extra character ahead, even if the scanner has already seen
enough text to disambiguate the current token, is a bit faster than only
looking ahead when necessary. But scanners that always look ahead give
dreadful interactive performance; for example, when a user types a
newline, it is not recognized as a newline token until they enter
@emph{another} token, which often means typing in another whole line.
@code{flex} scanners default to @code{interactive} unless you use the
@samp{-Cf} or @samp{-CF} table-compression options
(@pxref{Performance}). That's because if you're looking for
high-performance you should be using one of these options, so if you
didn't, @code{flex} assumes you'd rather trade off a bit of run-time
performance for intuitive interactive behavior. Note also that you
@emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or
@samp{-CF}. Thus, this option is not really needed; it is on by default
for all those cases in which it is allowed.
You can force a scanner to
be interactive by using
@opindex -7
@opindex ---7bit
@opindex 7bit
@item -7, --7bit, @code{%option 7bit}
instructs @code{flex} to generate a 7-bit scanner, i.e., one which can
only recognize 7-bit characters in its input. The advantage of using
@samp{--7bit} is that the scanner's tables can be up to half the size of
those generated using the @samp{--8bit}. The disadvantage is that such
scanners often hang or crash if their input contains an 8-bit character.
Note, however, that unless you generate your scanner using the
@samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit}
will save only a small amount of table space, and make your scanner
considerably less portable. @code{Flex}'s default behavior is to
generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF},
in which case @code{flex} defaults to generating 7-bit scanners unless
your site was always configured to generate 8-bit scanners (as will
often be the case with non-USA sites). You can tell whether flex
generated a 7-bit or an 8-bit scanner by inspecting the flag summary in
the @samp{--verbose} output as described above.
Note that if you use @samp{-Cfe} or @samp{-CFe} @code{flex} still
defaults to generating an 8-bit scanner, since usually with these
compression options full 8-bit tables are not much more expensive than
7-bit tables.
@opindex -8
@opindex ---8bit
@opindex 8bit
@item -8, --8bit, @code{%option 8bit}
instructs @code{flex} to generate an 8-bit scanner, i.e., one which can
recognize 8-bit characters. This flag is only needed for scanners
generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to
generating an 8-bit scanner anyway.
See the discussion of
above for @code{flex}'s default behavior and the tradeoffs between 7-bit
and 8-bit scanners.
@opindex ---default
@opindex default
@item --default, @code{%option default}
generate the default rule.
@opindex ---always-interactive
@opindex always-interactive
@item --always-interactive, @code{%option always-interactive}
instructs flex to generate a scanner which always considers its input
@emph{interactive}. Normally, on each new input file the scanner calls
@code{isatty()} in an attempt to determine whether the scanner's input
source is interactive and thus should be read a character at a time.
When this option is used, however, then no such call is made.
@opindex ---never-interactive
@item --never-interactive, @code{--never-interactive}
instructs flex to generate a scanner which never considers its input
interactive. This is the opposite of @code{always-interactive}.
@opindex -X
@opindex ---posix
@opindex posix
@item -X, --posix, @code{%option posix}
turns on maximum compatibility with the POSIX 1003.2-1992 definition of
@code{lex}. Since @code{flex} was originally designed to implement the
POSIX definition of @code{lex} this generally involves very few changes
in behavior. At the current writing the known differences between
@code{flex} and the POSIX standard are:
In POSIX and AT&T @code{lex}, the repeat operator, @samp{@{@}}, has lower
precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}).
Most POSIX utilities use an Extended Regular Expression (ERE) precedence
that has the precedence of the repeat operator higher than concatenation
(which causes @samp{ab@{3@}} to yield @samp{abbb}). By default, @code{flex}
places the precedence of the repeat operator higher than concatenation
which matches the ERE processing of other POSIX utilities. When either
@samp{--posix} or @samp{-l} are specified, @code{flex} will use the
traditional AT&T and POSIX-compliant precedence for the repeat operator
where concatenation has higher precedence than the repeat operator.
@end itemize
@opindex ---stack
@opindex stack
@item --stack, @code{%option stack}
enables the use of
start condition stacks (@pxref{Start Conditions}).
@opindex ---stdinit
@opindex stdinit
@item --stdinit, @code{%option stdinit}
if set (i.e., @b{%option stdinit)} initializes @code{yyin} and
@code{yyout} to @file{stdin} and @file{stdout}, instead of the default of
@file{NULL}. Some existing @code{lex} programs depend on this behavior,
even though it is not compliant with ANSI C, which does not require
@file{stdin} and @file{stdout} to be compile-time constant. In a
reentrant scanner, however, this is not a problem since initialization
is performed in @code{yylex_init} at runtime.
@opindex ---yylineno
@opindex yylineno
@item --yylineno, @code{%option yylineno}
directs @code{flex} to generate a scanner
that maintains the number of the current line read from its input in the
global variable @code{yylineno}. This option is implied by @code{%option
lex-compat}. In a reentrant C scanner, the macro @code{yylineno} is
accessible regardless of the value of @code{%option yylineno}, however, its
value is not modified by @code{flex} unless @code{%option yylineno} is enabled.
@opindex ---yywrap
@opindex yywrap
@item --yywrap, @code{%option yywrap}
if unset (i.e., @code{--noyywrap)}, makes the scanner not call
@code{yywrap()} upon an end-of-file, but simply assume that there are no
more files to scan (until the user points @file{yyin} at a new file and
calls @code{yylex()} again).
@end table
@node Code-Level And API Options, Options for Scanner Speed and Size, Options Affecting Scanner Behavior, Scanner Options
@section Code-Level And API Options
@table @samp
@opindex ---option-ansi-definitions
@opindex ansi-definitions
@item --ansi-definitions, @code{%option ansi-definitions}
instruct flex to generate ANSI C99 definitions for functions.
This option is enabled by default.
If @code{%option noansi-definitions} is specified, then the obsolete style
is generated.
@opindex ---option-ansi-prototypes
@opindex ansi-prototypes
@item --ansi-prototypes, @code{%option ansi-prototypes}
instructs flex to generate ANSI C99 prototypes for functions.
This option is enabled by default.
If @code{noansi-prototypes} is specified, then
prototypes will have empty parameter lists.
@opindex ---bison-bridge
@opindex bison-bridge
@item --bison-bridge, @code{%option bison-bridge}
instructs flex to generate a C scanner that is
meant to be called by a
@code{GNU bison}
parser. The scanner has minor API changes for
compatibility. In particular, the declaration of
is modified to take an additional parameter,
@xref{Bison Bridge}.
@opindex ---bison-locations
@opindex bison-locations
@item --bison-locations, @code{%option bison-locations}
instruct flex that
@code{GNU bison} @code{%locations} are being used.
This means @code{yylex} will be passed
an additional parameter, @code{yylloc}. This option
implies @code{%option bison-bridge}.
@xref{Bison Bridge}.
@opindex -L
@opindex ---noline
@opindex noline
@item -L, --noline, @code{%option noline}
not to generate
directives. Without this option,
peppers the generated scanner
with @code{#line} directives so error messages in the actions will be correctly
located with respect to either the original
input file (if the errors are due to code in the input file), or
(if the errors are
fault -- you should report these sorts of errors to the email address
given in @ref{Reporting Bugs}).
@opindex -R
@opindex ---reentrant
@opindex reentrant
@item -R, --reentrant, @code{%option reentrant}
instructs flex to generate a reentrant C scanner. The generated scanner
may safely be used in a multi-threaded environment. The API for a
reentrant scanner is different than for a non-reentrant scanner
@pxref{Reentrant}). Because of the API difference between
reentrant and non-reentrant @code{flex} scanners, non-reentrant flex
code must be modified before it is suitable for use with this option.
This option is not compatible with the @samp{--c++} option.
The option @samp{--reentrant} does not affect the performance of
the scanner.
@opindex -+
@opindex ---c++
@opindex c++
@item -+, --c++, @code{%option c++}
specifies that you want flex to generate a C++
scanner class. @xref{Cxx}, for
@opindex ---array
@opindex array
@item --array, @code{%option array}
specifies that you want yytext to be an array instead of a char*
@opindex ---pointer
@opindex pointer
@item --pointer, @code{%option pointer}
specify that @code{yytext} should be a @code{char *}, not an array.
This default is @code{char *}.
@opindex -P
@opindex ---prefix
@opindex prefix
@item -PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"}
changes the default @samp{yy} prefix used by @code{flex} for all
globally-visible variable and function names to instead be
@samp{PREFIX}. For example, @samp{--prefix=foo} changes the name of
@code{yytext} to @code{footext}. It also changes the name of the default
output file from @file{lex.yy.c} to @file{}. Here is a partial
list of the names affected:
@end verbatim
@end example
(If you are using a C++ scanner, then only @code{yywrap} and
@code{yyFlexLexer} are affected.) Within your scanner itself, you can
still refer to the global variables and functions using either version
of their name; but externally, they have the modified name.
This option lets you easily link together multiple
programs into the same executable. Note, though, that using this
option also renames
so you now
provide your own (appropriately-named) version of the routine for your
scanner, or use
@code{%option noyywrap},
as linking with
no longer provides one for you by default.
@opindex ---main
@opindex main
@item --main, @code{%option main}
directs flex to provide a default @code{main()} program for the
scanner, which simply calls @code{yylex()}. This option implies
@code{noyywrap} (see below).
@opindex ---nounistd
@opindex nounistd
@item --nounistd, @code{%option nounistd}
suppresses inclusion of the non-ANSI header file @file{unistd.h}. This option
is meant to target environments in which @file{unistd.h} does not exist. Be aware
that certain options may cause flex to generate code that relies on functions
normally found in @file{unistd.h}, (e.g. @code{isatty()}, @code{read()}.)
If you wish to use these functions, you will have to inform your compiler where
to find them.
@xref{option-always-interactive}. @xref{option-read}.
@opindex ---yyclass
@opindex yyclass
@item --yyclass=NAME, @code{%option yyclass="NAME"}
only applies when generating a C++ scanner (the @samp{--c++} option). It
informs @code{flex} that you have derived @code{NAME} as a subclass of
@code{yyFlexLexer}, so @code{flex} will place your actions in the member
function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}. It
also generates a @code{yyFlexLexer::yylex()} member function that emits
a run-time error (by invoking @code{yyFlexLexer::LexerError())} if
called. @xref{Cxx}.
@end table
@node Options for Scanner Speed and Size, Debugging Options, Code-Level And API Options, Scanner Options
@section Options for Scanner Speed and Size
@table @samp
@item -C[aefFmr]
controls the degree of table compression and, more generally, trade-offs
between small scanners and fast scanners.
@table @samp
@opindex -C
@item -C
A lone @samp{-C} specifies that the scanner tables should be compressed
but neither equivalence classes nor meta-equivalence classes should be
@opindex -Ca
@opindex ---align
@opindex align
@item -Ca, --align, @code{%option align}
(``align'') instructs flex to trade off larger tables in the
generated scanner for faster performance because the elements of
the tables are better aligned for memory access and computation. On some
RISC architectures, fetching and manipulating longwords is more efficient
than with smaller-sized units such as shortwords. This option can
quadruple the size of the tables used by your scanner.
@opindex -Ce
@opindex ---ecs
@opindex ecs
@item -Ce, --ecs, @code{%option ecs}
directs @code{flex} to construct @dfn{equivalence classes}, i.e., sets
of characters which have identical lexical properties (for example, if
the only appearance of digits in the @code{flex} input is in the
character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be
put in the same equivalence class). Equivalence classes usually give
dramatic reductions in the final table/object file sizes (typically a
factor of 2-5) and are pretty cheap performance-wise (one array look-up
per character scanned).
@opindex -Cf
@item -Cf
specifies that the @dfn{full} scanner tables should be generated -
@code{flex} should not compress the tables by taking advantages of
similar transition functions for different states.
@opindex -CF
@item -CF
specifies that the alternate fast scanner representation (described
above under the @samp{--fast} flag) should be used. This option cannot be
used with @samp{--c++}.
@opindex -Cm
@opindex ---meta-ecs
@opindex meta-ecs
@item -Cm, --meta-ecs, @code{%option meta-ecs}
to construct
@dfn{meta-equivalence classes},
which are sets of equivalence classes (or characters, if equivalence
classes are not being used) that are commonly used together. Meta-equivalence
classes are often a big win when using compressed tables, but they
have a moderate performance impact (one or two @code{if} tests and one
array look-up per character scanned).
@opindex -Cr
@opindex ---read
@opindex read
@item -Cr, --read, @code{%option read}
causes the generated scanner to @emph{bypass} use of the standard I/O
library (@code{stdio}) for input. Instead of calling @code{fread()} or
@code{getc()}, the scanner will use the @code{read()} system call,
resulting in a performance gain which varies from system to system, but
in general is probably negligible unless you are also using @samp{-Cf}
or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for
example, you read from @file{yyin} using @code{stdio} prior to calling
the scanner (because the scanner will miss whatever text your previous
reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect
if you define @code{YY_INPUT()} (@pxref{Generated Scanner}).
@end table
The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense
together - there is no opportunity for meta-equivalence classes if the
table is not being compressed. Otherwise the options may be freely
mixed, and are cumulative.
The default setting is @samp{-Cem}, which specifies that @code{flex}
should generate equivalence classes and meta-equivalence classes. This
setting provides the highest degree of table compression. You can trade
off faster-executing scanners at the cost of larger tables with the
following generally being true:
slowest & smallest
fastest & largest
@end verbatim
@end example
Note that scanners with the smallest tables are usually generated and
compiled the quickest, so during development you will usually want to
use the default, maximal compression.
@samp{-Cfe} is often a good compromise between speed and size for
production scanners.
@opindex -f
@opindex ---full
@opindex full
@item -f, --full, @code{%option full}
@dfn{fast scanner}.
No table compression is done and @code{stdio} is bypassed.
The result is large but fast. This option is equivalent to
@opindex -F
@opindex ---fast
@opindex fast
@item -F, --fast, @code{%option fast}
specifies that the @emph{fast} scanner table representation should be
used (and @code{stdio} bypassed). This representation is about as fast
as the full table representation @samp{--full}, and for some sets of
patterns will be considerably smaller (and for others, larger). In
general, if the pattern set contains both @emph{keywords} and a
catch-all, @emph{identifier} rule, such as in the set:
"case" return TOK_CASE;
"switch" return TOK_SWITCH;
"default" return TOK_DEFAULT;
[a-z]+ return TOK_ID;
@end verbatim
@end example
then you're better off using the full table representation. If only
the @emph{identifier} rule is present and you then use a hash table or some such
to detect the keywords, you're better off using
This option is equivalent to @samp{-CFr}. It cannot be used
with @samp{--c++}.
@end table
@node Debugging Options, Miscellaneous Options, Options for Scanner Speed and Size, Scanner Options
@section Debugging Options
@table @samp
@opindex -b
@opindex ---backup
@opindex backup
@item -b, --backup, @code{%option backup}
Generate backing-up information to @file{lex.backup}. This is a list of
scanner states which require backing up and the input characters on
which they do so. By adding rules one can remove backing-up states. If
@emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF}
is used, the generated scanner will run faster (see the @samp{--perf-report} flag).
Only users who wish to squeeze every last cycle out of their scanners
need worry about this option. (@pxref{Performance}).
@opindex -d
@opindex ---debug
@opindex debug
@item -d, --debug, @code{%option debug}
makes the generated scanner run in @dfn{debug} mode. Whenever a pattern
is recognized and the global variable @code{yy_flex_debug} is non-zero
(which is the default), the scanner will write to @file{stderr} a line
of the form:
-accepting rule at line 53 ("the matched text")
@end verbatim
@end example
The line number refers to the location of the rule in the file defining
the scanner (i.e., the file that was fed to flex). Messages are also
generated when the scanner backs up, accepts the default rule, reaches
the end of its input buffer (or encounters a NUL; at this point, the two
look the same as far as the scanner's concerned), or reaches an
@opindex -p
@opindex ---perf-report
@opindex perf-report
@item -p, --perf-report, @code{%option perf-report}
generates a performance report to @file{stderr}. The report consists of
comments regarding features of the @code{flex} input file which will
cause a serious loss of performance in the resulting scanner. If you
give the flag twice, you will also get comments regarding features that
lead to minor performance losses.
Note that the use of @code{REJECT}, and
variable trailing context (@pxref{Limitations}) entails a substantial
performance penalty; use of @code{yymore()}, the @samp{^} operator, and
the @samp{--interactive} flag entail minor performance penalties.
@opindex -s
@opindex ---nodefault
@opindex nodefault
@item -s, --nodefault, @code{%option nodefault}
causes the @emph{default rule} (that unmatched scanner input is echoed
to @file{stdout)} to be suppressed. If the scanner encounters input
that does not match any of its rules, it aborts with an error. This
option is useful for finding holes in a scanner's rule set.
@opindex -T
@opindex ---trace
@opindex trace
@item -T, --trace, @code{%option trace}
makes @code{flex} run in @dfn{trace} mode. It will generate a lot of
messages to @file{stderr} concerning the form of the input and the
resultant non-deterministic and deterministic finite automata. This
option is mostly for use in maintaining @code{flex}.
@opindex -w
@opindex ---nowarn
@opindex nowarn
@item -w, --nowarn, @code{%option nowarn}
suppresses warning messages.
@opindex -v
@opindex ---verbose
@opindex verbose
@item -v, --verbose, @code{%option verbose}
specifies that @code{flex} should write to @file{stderr} a summary of
statistics regarding the scanner it generates. Most of the statistics
are meaningless to the casual @code{flex} user, but the first line
identifies the version of @code{flex} (same as reported by @samp{--version}),
and the next line the flags used when generating the scanner, including
those that are on by default.
@opindex ---warn
@opindex warn
@item --warn, @code{%option warn}
warn about certain things. In particular, if the default rule can be
matched but no default rule has been given, the flex will warn you.
We recommend using this option always.
@end table
@node Miscellaneous Options, , Debugging Options, Scanner Options
@section Miscellaneous Options
@table @samp
@opindex -c
@item -c
A do-nothing option included for POSIX compliance.
@opindex -h
@opindex ---help
@item -h, -?, --help
generates a ``help'' summary of @code{flex}'s options to @file{stdout}
and then exits.
@opindex -n
@item -n
Another do-nothing option included for
POSIX compliance.
@opindex -V
@opindex ---version
@item -V, --version
prints the version number to @file{stdout} and exits.
@end table
@node Performance, Cxx, Scanner Options, Top
@chapter Performance Considerations
@cindex performance, considerations
The main design goal of @code{flex} is that it generate high-performance
scanners. It has been optimized for dealing well with large sets of
rules. Aside from the effects on scanner speed of the table compression
@samp{-C} options outlined above, there are a number of options/actions
which degrade performance. These are, from most expensive to least:
@cindex REJECT, performance costs
@cindex yylineno, performance costs
@cindex trailing context, performance costs
arbitrary trailing context
pattern sets that require backing up
%option yylineno
%option interactive
%option always-interactive
^ beginning-of-line operator
@end verbatim
@end example
with the first two all being quite expensive and the last two being
quite cheap. Note also that @code{unput()} is implemented as a routine
call that potentially does quite a bit of work, while @code{yyless()} is
a quite-cheap macro. So if you are just putting back some excess text
you scanned, use @code{yyless()}.
@code{REJECT} should be avoided at all costs when performance is
important. It is a particularly expensive option.
There is one case when @code{%option yylineno} can be expensive. That is when
your patterns match long tokens that could @emph{possibly} contain a newline
character. There is no performance penalty for rules that can not possibly
match newlines, since flex does not need to check them for newlines. In
general, you should avoid rules such as @code{[^f]+}, which match very long
tokens, including newlines, and may possibly match your entire file! A better
approach is to separate @code{[^f]+} into two rules:
%option yylineno
@end verbatim
@end example
The above scanner does not incur a performance penalty.
@cindex patterns, tuning for performance
@cindex performance, backing up
@cindex backing up, example of eliminating
Getting rid of backing up is messy and often may be an enormous amount
of work for a complicated scanner. In principal, one begins by using
the @samp{-b} flag to generate a @file{lex.backup} file. For example,
on the input:
@cindex backing up, eliminating
foo return TOK_KEYWORD;
foobar return TOK_KEYWORD;
@end verbatim
@end example
the file looks like:
State #6 is non-accepting -
associated rule line numbers:
2 3
out-transitions: [ o ]
jam-transitions: EOF [ \001-n p-\177 ]
State #8 is non-accepting -
associated rule line numbers:
out-transitions: [ a ]
jam-transitions: EOF [ \001-` b-\177 ]
State #9 is non-accepting -
associated rule line numbers:
out-transitions: [ r ]
jam-transitions: EOF [ \001-q s-\177 ]
Compressed tables always back up.
@end verbatim
@end example
The first few lines tell us that there's a scanner state in which it can
make a transition on an 'o' but not on any other character, and that in
that state the currently scanned text does not match any rule. The
state occurs when trying to match the rules found at lines 2 and 3 in
the input file. If the scanner is in that state and then reads
something other than an 'o', it will have to back up to find a rule
which is matched. With a bit of headscratching one can see that this
must be the state it's in when it has seen @samp{fo}. When this has
happened, if anything other than another @samp{o} is seen, the scanner
will have to back up to simply match the @samp{f} (by the default rule).
The comment regarding State #8 indicates there's a problem when
@samp{foob} has been scanned. Indeed, on any character other than an
@samp{a}, the scanner will have to back up to accept "foo". Similarly,
the comment for State #9 concerns when @samp{fooba} has been scanned and
an @samp{r} does not follow.
The final comment reminds us that there's no point going to all the
trouble of removing backing up from the rules unless we're using
@samp{-Cf} or @samp{-CF}, since there's no performance gain doing so
with compressed scanners.
@cindex error rules, to eliminate backing up
The way to remove the backing up is to add ``error'' rules:
@cindex backing up, eliminating by adding error rules
foo return TOK_KEYWORD;
foobar return TOK_KEYWORD;
fooba |
foob |
fo {
/* false alarm, not really a keyword */
return TOK_ID;
@end verbatim
@end example
Eliminating backing up among a list of keywords can also be done using a
``catch-all'' rule:
@cindex backing up, eliminating with catch-all rule
foo return TOK_KEYWORD;
foobar return TOK_KEYWORD;
[a-z]+ return TOK_ID;
@end verbatim
@end example
This is usually the best solution when appropriate.
Backing up messages tend to cascade. With a complicated set of rules
it's not uncommon to get hundreds of messages. If one can decipher
them, though, it often only takes a dozen or so rules to eliminate the
backing up (though it's easy to make a mistake and have an error rule
accidentally match a valid token. A possible future @code{flex} feature
will be to automatically add rules to eliminate backing up).
It's important to keep in mind that you gain the benefits of eliminating
backing up only if you eliminate @emph{every} instance of backing up.
Leaving just one means you gain nothing.
@emph{Variable} trailing context (where both the leading and trailing
parts do not have a fixed length) entails almost the same performance
loss as @code{REJECT} (i.e., substantial). So when possible a rule
@cindex trailing context, variable length
mouse|rat/(cat|dog) run();
@end verbatim
@end example
is better written:
mouse/cat|dog run();
rat/cat|dog run();
@end verbatim
@end example
or as
mouse|rat/cat run();
mouse|rat/dog run();
@end verbatim
@end example
Note that here the special '|' action does @emph{not} provide any
savings, and can even make things worse (@pxref{Limitations}).
Another area where the user can increase a scanner's performance (and
one that's easier to implement) arises from the fact that the longer the
tokens matched, the faster the scanner will run. This is because with
long tokens the processing of most input characters takes place in the
(short) inner scanning loop, and does not often have to go through the
additional work of setting up the scanning environment (e.g.,
@code{yytext}) for the action. Recall the scanner for C comments:
@cindex performance optimization, matching longer tokens
%x comment
int line_num = 1;
"/*" BEGIN(comment);
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);
@end verbatim
@end example
This could be sped up by writing it as:
%x comment
int line_num = 1;
"/*" BEGIN(comment);
<comment>[^*\n]*\n ++line_num;
<comment>"*"+[^*/\n]*\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);
@end verbatim
@end example
Now instead of each newline requiring the processing of another action,
recognizing the newlines is distributed over the other rules to keep the
matched text as long as possible. Note that @emph{adding} rules does
@emph{not} slow down the scanner! The speed of the scanner is
independent of the number of rules or (modulo the considerations given
at the beginning of this section) how complicated the rules are with
regard to operators such as @samp{*} and @samp{|}.
@cindex keywords, for performance
@cindex performance, using keywords
A final example in speeding up a scanner: suppose you want to scan
through a file containing identifiers and keywords, one per line
and with no other extraneous characters, and recognize all the
keywords. A natural first approach is:
@cindex performance optimization, recognizing keywords
asm |
auto |
break |
... etc ...
volatile |
while /* it's a keyword */
.|\n /* it's not a keyword */
@end verbatim
@end example
To eliminate the back-tracking, introduce a catch-all rule:
asm |
auto |
break |
... etc ...
volatile |
while /* it's a keyword */
[a-z]+ |
.|\n /* it's not a keyword */
@end verbatim
@end example
Now, if it's guaranteed that there's exactly one word per line, then we
can reduce the total number of matches by a half by merging in the
recognition of newlines with that of the other tokens:
asm\n |
auto\n |
break\n |
... etc ...
volatile\n |
while\n /* it's a keyword */
[a-z]+\n |
.|\n /* it's not a keyword */
@end verbatim
@end example
One has to be careful here, as we have now reintroduced backing up
into the scanner. In particular, while
know that there will never be any characters in the input stream
other than letters or newlines,
can't figure this out, and it will plan for possibly needing to back up
when it has scanned a token like @samp{auto} and then the next character
is something other than a newline or a letter. Previously it would
then just match the @samp{auto} rule and be done, but now it has no @samp{auto}
rule, only a @samp{auto\n} rule. To eliminate the possibility of backing up,
we could either duplicate all rules but without final newlines, or,
since we never expect to encounter such an input and therefore don't
how it's classified, we can introduce one more catch-all rule, this
one which doesn't include a newline:
asm\n |
auto\n |
break\n |
... etc ...
volatile\n |
while\n /* it's a keyword */
[a-z]+\n |
[a-z]+ |
.|\n /* it's not a keyword */
@end verbatim
@end example
Compiled with @samp{-Cf}, this is about as fast as one can get a
@code{flex} scanner to go for this particular problem.