| \input texinfo.tex @c -*-texinfo-*- |
| @c %**start of header |
| @setfilename flex.info |
| @include version.texi |
| @settitle Lexical Analysis With Flex, for Flex @value{VERSION} |
| @set authors Vern Paxson, Will Estes and John Millaway |
| @c "Macro Hooks" index |
| @defindex hk |
| @c "Options" index |
| @defindex op |
| @dircategory Programming |
| @direntry |
| * flex: (flex). Fast lexical analyzer generator (lex replacement). |
| @end direntry |
| @c %**end of header |
| |
| @copying |
| |
| The flex manual is placed under the same licensing conditions as the |
| rest of flex: |
| |
| Copyright @copyright{} 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2012 |
| The Flex Project. |
| |
| Copyright @copyright{} 1990, 1997 The Regents of the University of California. |
| All rights reserved. |
| |
| This code is derived from software contributed to Berkeley by |
| Vern Paxson. |
| |
| The United States Government has rights in this work pursuant |
| to contract no. DE-AC03-76SF00098 between the United States |
| Department of Energy and the University of California. |
| |
| Redistribution and use in source and binary forms, with or without |
| modification, are permitted provided that the following conditions |
| are met: |
| |
| @enumerate |
| @item |
| Redistributions of source code must retain the above copyright |
| notice, this list of conditions and the following disclaimer. |
| |
| @item |
| Redistributions in binary form must reproduce the above copyright |
| notice, this list of conditions and the following disclaimer in the |
| documentation and/or other materials provided with the distribution. |
| @end enumerate |
| |
| Neither the name of the University nor the names of its contributors |
| may be used to endorse or promote products derived from this software |
| without specific prior written permission. |
| |
| THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR |
| IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED |
| WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR |
| PURPOSE. |
| @end copying |
| |
| @titlepage |
| @title Lexical Analysis with Flex |
| @subtitle Edition @value{EDITION}, @value{UPDATED} |
| @author @value{authors} |
| @page |
| @vskip 0pt plus 1filll |
| @insertcopying |
| @end titlepage |
| @contents |
| @ifnottex |
| @node Top, Copyright, (dir), (dir) |
| @top flex |
| |
| This manual describes @code{flex}, a tool for generating programs that |
| perform pattern-matching on text. The manual includes both tutorial and |
| reference sections. |
| |
| This edition of @cite{The flex Manual} documents @code{flex} version |
| @value{VERSION}. It was last updated on @value{UPDATED}. |
| |
| This manual was written by @value{authors}. |
| |
| @menu |
| * Copyright:: |
| * Reporting Bugs:: |
| * Introduction:: |
| * Simple Examples:: |
| * Format:: |
| * Patterns:: |
| * Matching:: |
| * Actions:: |
| * Generated Scanner:: |
| * Start Conditions:: |
| * Multiple Input Buffers:: |
| * EOF:: |
| * Misc Macros:: |
| * User Values:: |
| * Yacc:: |
| * Scanner Options:: |
| * Performance:: |
| * Cxx:: |
| * Reentrant:: |
| * Lex and Posix:: |
| * Memory Management:: |
| * Serialized Tables:: |
| * Diagnostics:: |
| * Limitations:: |
| * Bibliography:: |
| * FAQ:: |
| * Appendices:: |
| * Indices:: |
| |
| @detailmenu |
| --- The Detailed Node Listing --- |
| |
| Format of the Input File |
| |
| * Definitions Section:: |
| * Rules Section:: |
| * User Code Section:: |
| * Comments in the Input:: |
| |
| Scanner Options |
| |
| * Options for Specifying Filenames:: |
| * Options Affecting Scanner Behavior:: |
| * Code-Level And API Options:: |
| * Options for Scanner Speed and Size:: |
| * Debugging Options:: |
| * Miscellaneous Options:: |
| |
| Reentrant C Scanners |
| |
| * Reentrant Uses:: |
| * Reentrant Overview:: |
| * Reentrant Example:: |
| * Reentrant Detail:: |
| * Reentrant Functions:: |
| |
| The Reentrant API in Detail |
| |
| * Specify Reentrant:: |
| * Extra Reentrant Argument:: |
| * Global Replacement:: |
| * Init and Destroy Functions:: |
| * Accessor Methods:: |
| * Extra Data:: |
| * About yyscan_t:: |
| |
| Memory Management |
| |
| * The Default Memory Management:: |
| * Overriding The Default Memory Management:: |
| * A Note About yytext And Memory:: |
| |
| Serialized Tables |
| |
| * Creating Serialized Tables:: |
| * Loading and Unloading Serialized Tables:: |
| * Tables File Format:: |
| |
| FAQ |
| |
| * When was flex born?:: |
| * How do I expand backslash-escape sequences in C-style quoted strings?:: |
| * Why do flex scanners call fileno if it is not ANSI compatible?:: |
| * Does flex support recursive pattern definitions?:: |
| * How do I skip huge chunks of input (tens of megabytes) while using flex?:: |
| * Flex is not matching my patterns in the same order that I defined them.:: |
| * My actions are executing out of order or sometimes not at all.:: |
| * How can I have multiple input sources feed into the same scanner at the same time?:: |
| * Can I build nested parsers that work with the same input file?:: |
| * How can I match text only at the end of a file?:: |
| * How can I make REJECT cascade across start condition boundaries?:: |
| * Why cant I use fast or full tables with interactive mode?:: |
| * How much faster is -F or -f than -C?:: |
| * If I have a simple grammar cant I just parse it with flex?:: |
| * Why doesn't yyrestart() set the start state back to INITIAL?:: |
| * How can I match C-style comments?:: |
| * The period isn't working the way I expected.:: |
| * Can I get the flex manual in another format?:: |
| * Does there exist a "faster" NDFA->DFA algorithm?:: |
| * How does flex compile the DFA so quickly?:: |
| * How can I use more than 8192 rules?:: |
| * How do I abandon a file in the middle of a scan and switch to a new file?:: |
| * How do I execute code only during initialization (only before the first scan)?:: |
| * How do I execute code at termination?:: |
| * Where else can I find help?:: |
| * Can I include comments in the "rules" section of the file?:: |
| * I get an error about undefined yywrap().:: |
| * How can I change the matching pattern at run time?:: |
| * How can I expand macros in the input?:: |
| * How can I build a two-pass scanner?:: |
| * How do I match any string not matched in the preceding rules?:: |
| * I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: |
| * Is there a way to make flex treat NULL like a regular character?:: |
| * Whenever flex can not match the input it says "flex scanner jammed".:: |
| * Why doesn't flex have non-greedy operators like perl does?:: |
| * Memory leak - 16386 bytes allocated by malloc.:: |
| * How do I track the byte offset for lseek()?:: |
| * How do I use my own I/O classes in a C++ scanner?:: |
| * How do I skip as many chars as possible?:: |
| * deleteme00:: |
| * Are certain equivalent patterns faster than others?:: |
| * Is backing up a big deal?:: |
| * Can I fake multi-byte character support?:: |
| * deleteme01:: |
| * Can you discuss some flex internals?:: |
| * unput() messes up yy_at_bol:: |
| * The | operator is not doing what I want:: |
| * Why can't flex understand this variable trailing context pattern?:: |
| * The ^ operator isn't working:: |
| * Trailing context is getting confused with trailing optional patterns:: |
| * Is flex GNU or not?:: |
| * ERASEME53:: |
| * I need to scan if-then-else blocks and while loops:: |
| * ERASEME55:: |
| * ERASEME56:: |
| * ERASEME57:: |
| * Is there a repository for flex scanners?:: |
| * How can I conditionally compile or preprocess my flex input file?:: |
| * Where can I find grammars for lex and yacc?:: |
| * I get an end-of-buffer message for each character scanned.:: |
| * unnamed-faq-62:: |
| * unnamed-faq-63:: |
| * unnamed-faq-64:: |
| * unnamed-faq-65:: |
| * unnamed-faq-66:: |
| * unnamed-faq-67:: |
| * unnamed-faq-68:: |
| * unnamed-faq-69:: |
| * unnamed-faq-70:: |
| * unnamed-faq-71:: |
| * unnamed-faq-72:: |
| * unnamed-faq-73:: |
| * unnamed-faq-74:: |
| * unnamed-faq-75:: |
| * unnamed-faq-76:: |
| * unnamed-faq-77:: |
| * unnamed-faq-78:: |
| * unnamed-faq-79:: |
| * unnamed-faq-80:: |
| * unnamed-faq-81:: |
| * unnamed-faq-82:: |
| * unnamed-faq-83:: |
| * unnamed-faq-84:: |
| * unnamed-faq-85:: |
| * unnamed-faq-86:: |
| * unnamed-faq-87:: |
| * unnamed-faq-88:: |
| * unnamed-faq-90:: |
| * unnamed-faq-91:: |
| * unnamed-faq-92:: |
| * unnamed-faq-93:: |
| * unnamed-faq-94:: |
| * unnamed-faq-95:: |
| * unnamed-faq-96:: |
| * unnamed-faq-97:: |
| * unnamed-faq-98:: |
| * unnamed-faq-99:: |
| * unnamed-faq-100:: |
| * unnamed-faq-101:: |
| * What is the difference between YYLEX_PARAM and YY_DECL?:: |
| * Why do I get "conflicting types for yylex" error?:: |
| * How do I access the values set in a Flex action from within a Bison action?:: |
| |
| Appendices |
| |
| * Makefiles and Flex:: |
| * Bison Bridge:: |
| * M4 Dependency:: |
| * Common Patterns:: |
| |
| Indices |
| |
| * Concept Index:: |
| * Index of Functions and Macros:: |
| * Index of Variables:: |
| * Index of Data Types:: |
| * Index of Hooks:: |
| * Index of Scanner Options:: |
| |
| @end detailmenu |
| @end menu |
| @end ifnottex |
| @node Copyright, Reporting Bugs, Top, Top |
| @chapter Copyright |
| |
| @cindex copyright of flex |
| @cindex distributing flex |
| @insertcopying |
| |
| @node Reporting Bugs, Introduction, Copyright, Top |
| @chapter Reporting Bugs |
| |
| @cindex bugs, reporting |
| @cindex reporting bugs |
| |
| If you find a bug in @code{flex}, please report it using |
| the SourceForge Bug Tracking facilities which can be found on |
| @url{http://sourceforge.net/projects/flex,flex's SourceForge Page}. |
| |
| @node Introduction, Simple Examples, Reporting Bugs, Top |
| @chapter Introduction |
| |
| @cindex scanner, definition of |
| @code{flex} is a tool for generating @dfn{scanners}. A scanner is a |
| program which recognizes lexical patterns in text. The @code{flex} |
| program reads the given input files, or its standard input if no file |
| names are given, for a description of a scanner to generate. The |
| description is in the form of pairs of regular expressions and C code, |
| called @dfn{rules}. @code{flex} generates as output a C source file, |
| @file{lex.yy.c} by default, which defines a routine @code{yylex()}. |
| This file can be compiled and linked with the flex runtime library to |
| produce an executable. When the executable is run, it analyzes its |
| input for occurrences of the regular expressions. Whenever it finds |
| one, it executes the corresponding C code. |
| |
| @node Simple Examples, Format, Introduction, Top |
| @chapter Some Simple Examples |
| |
| First some simple examples to get the flavor of how one uses |
| @code{flex}. |
| |
| @cindex username expansion |
| The following @code{flex} input specifies a scanner which, when it |
| encounters the string @samp{username} will replace it with the user's |
| login name: |
| |
| @example |
| @verbatim |
| %% |
| username printf( "%s", getlogin() ); |
| @end verbatim |
| @end example |
| |
| @cindex default rule |
| @cindex rules, default |
| By default, any text not matched by a @code{flex} scanner is copied to |
| the output, so the net effect of this scanner is to copy its input file |
| to its output with each occurrence of @samp{username} expanded. In this |
| input, there is just one rule. @samp{username} is the @dfn{pattern} and |
| the @samp{printf} is the @dfn{action}. The @samp{%%} symbol marks the |
| beginning of the rules. |
| |
| Here's another simple example: |
| |
| @cindex counting characters and lines |
| @example |
| @verbatim |
| int num_lines = 0, num_chars = 0; |
| |
| %% |
| \n ++num_lines; ++num_chars; |
| . ++num_chars; |
| |
| %% |
| |
| int main() |
| { |
| yylex(); |
| printf( "# of lines = %d, # of chars = %d\n", |
| num_lines, num_chars ); |
| } |
| @end verbatim |
| @end example |
| |
| This scanner counts the number of characters and the number of lines in |
| its input. It produces no output other than the final report on the |
| character and line counts. The first line declares two globals, |
| @code{num_lines} and @code{num_chars}, which are accessible both inside |
| @code{yylex()} and in the @code{main()} routine declared after the |
| second @samp{%%}. There are two rules, one which matches a newline |
| (@samp{\n}) and increments both the line count and the character count, |
| and one which matches any character other than a newline (indicated by |
| the @samp{.} regular expression). |
| |
| A somewhat more complicated example: |
| |
| @cindex Pascal-like language |
| @example |
| @verbatim |
| /* scanner for a toy Pascal-like language */ |
| |
| %{ |
| /* need this for the call to atof() below */ |
| #include <math.h> |
| %} |
| |
| DIGIT [0-9] |
| ID [a-z][a-z0-9]* |
| |
| %% |
| |
| {DIGIT}+ { |
| printf( "An integer: %s (%d)\n", yytext, |
| atoi( yytext ) ); |
| } |
| |
| {DIGIT}+"."{DIGIT}* { |
| printf( "A float: %s (%g)\n", yytext, |
| atof( yytext ) ); |
| } |
| |
| if|then|begin|end|procedure|function { |
| printf( "A keyword: %s\n", yytext ); |
| } |
| |
| {ID} printf( "An identifier: %s\n", yytext ); |
| |
| "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext ); |
| |
| "{"[\^{}}\n]*"}" /* eat up one-line comments */ |
| |
| [ \t\n]+ /* eat up whitespace */ |
| |
| . printf( "Unrecognized character: %s\n", yytext ); |
| |
| %% |
| |
| int main( int argc, char **argv ) |
| { |
| ++argv, --argc; /* skip over program name */ |
| if ( argc > 0 ) |
| yyin = fopen( argv[0], "r" ); |
| else |
| yyin = stdin; |
| |
| yylex(); |
| } |
| @end verbatim |
| @end example |
| |
| This is the beginnings of a simple scanner for a language like Pascal. |
| It identifies different types of @dfn{tokens} and reports on what it has |
| seen. |
| |
| The details of this example will be explained in the following |
| sections. |
| |
| @node Format, Patterns, Simple Examples, Top |
| @chapter Format of the Input File |
| |
| |
| @cindex format of flex input |
| @cindex input, format of |
| @cindex file format |
| @cindex sections of flex input |
| |
| The @code{flex} input file consists of three sections, separated by a |
| line containing only @samp{%%}. |
| |
| @cindex format of input file |
| @example |
| @verbatim |
| definitions |
| %% |
| rules |
| %% |
| user code |
| @end verbatim |
| @end example |
| |
| @menu |
| * Definitions Section:: |
| * Rules Section:: |
| * User Code Section:: |
| * Comments in the Input:: |
| @end menu |
| |
| @node Definitions Section, Rules Section, Format, Format |
| @section Format of the Definitions Section |
| |
| @cindex input file, Definitions section |
| @cindex Definitions, in flex input |
| The @dfn{definitions section} contains declarations of simple @dfn{name} |
| definitions to simplify the scanner specification, and declarations of |
| @dfn{start conditions}, which are explained in a later section. |
| |
| @cindex aliases, how to define |
| @cindex pattern aliases, how to define |
| Name definitions have the form: |
| |
| @example |
| @verbatim |
| name definition |
| @end verbatim |
| @end example |
| |
| The @samp{name} is a word beginning with a letter or an underscore |
| (@samp{_}) followed by zero or more letters, digits, @samp{_}, or |
| @samp{-} (dash). The definition is taken to begin at the first |
| non-whitespace character following the name and continuing to the end of |
| the line. The definition can subsequently be referred to using |
| @samp{@{name@}}, which will expand to @samp{(definition)}. For example, |
| |
| @cindex pattern aliases, defining |
| @cindex defining pattern aliases |
| @example |
| @verbatim |
| DIGIT [0-9] |
| ID [a-z][a-z0-9]* |
| @end verbatim |
| @end example |
| |
| Defines @samp{DIGIT} to be a regular expression which matches a single |
| digit, and @samp{ID} to be a regular expression which matches a letter |
| followed by zero-or-more letters-or-digits. A subsequent reference to |
| |
| @cindex pattern aliases, use of |
| @example |
| @verbatim |
| {DIGIT}+"."{DIGIT}* |
| @end verbatim |
| @end example |
| |
| is identical to |
| |
| @example |
| @verbatim |
| ([0-9])+"."([0-9])* |
| @end verbatim |
| @end example |
| |
| and matches one-or-more digits followed by a @samp{.} followed by |
| zero-or-more digits. |
| |
| @cindex comments in flex input |
| An unindented comment (i.e., a line |
| beginning with @samp{/*}) is copied verbatim to the output up |
| to the next @samp{*/}. |
| |
| @cindex %@{ and %@}, in Definitions Section |
| @cindex embedding C code in flex input |
| @cindex C code in flex input |
| Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} |
| is also copied verbatim to the output (with the %@{ and %@} symbols |
| removed). The %@{ and %@} symbols must appear unindented on lines by |
| themselves. |
| |
| @cindex %top |
| |
| A @code{%top} block is similar to a @samp{%@{} ... @samp{%@}} block, except |
| that the code in a @code{%top} block is relocated to the @emph{top} of the |
| generated file, before any flex definitions @footnote{Actually, |
| @code{yyIN_HEADER} is defined before the @samp{%top} block.}. |
| The @code{%top} block is useful when you want certain preprocessor macros to be |
| defined or certain files to be included before the generated code. |
| The single characters, @samp{@{} and @samp{@}} are used to delimit the |
| @code{%top} block, as show in the example below: |
| |
| @example |
| @verbatim |
| %top{ |
| /* This code goes at the "top" of the generated file. */ |
| #include <stdint.h> |
| #include <inttypes.h> |
| } |
| @end verbatim |
| @end example |
| |
| Multiple @code{%top} blocks are allowed, and their order is preserved. |
| |
| @node Rules Section, User Code Section, Definitions Section, Format |
| @section Format of the Rules Section |
| |
| @cindex input file, Rules Section |
| @cindex rules, in flex input |
| The @dfn{rules} section of the @code{flex} input contains a series of |
| rules of the form: |
| |
| @example |
| @verbatim |
| pattern action |
| @end verbatim |
| @end example |
| |
| where the pattern must be unindented and the action must begin |
| on the same line. |
| @xref{Patterns}, for a further description of patterns and actions. |
| |
| In the rules section, any indented or %@{ %@} enclosed text appearing |
| before the first rule may be used to declare variables which are local |
| to the scanning routine and (after the declarations) code which is to be |
| executed whenever the scanning routine is entered. Other indented or |
| %@{ %@} text in the rule section is still copied to the output, but its |
| meaning is not well-defined and it may well cause compile-time errors |
| (this feature is present for @acronym{POSIX} compliance. @xref{Lex and |
| Posix}, for other such features). |
| |
| Any @emph{indented} text or text enclosed in @samp{%@{} and @samp{%@}} |
| is copied verbatim to the output (with the %@{ and %@} symbols removed). |
| The %@{ and %@} symbols must appear unindented on lines by themselves. |
| |
| @node User Code Section, Comments in the Input, Rules Section, Format |
| @section Format of the User Code Section |
| |
| @cindex input file, user code Section |
| @cindex user code, in flex input |
| The user code section is simply copied to @file{lex.yy.c} verbatim. It |
| is used for companion routines which call or are called by the scanner. |
| The presence of this section is optional; if it is missing, the second |
| @samp{%%} in the input file may be skipped, too. |
| |
| @node Comments in the Input, , User Code Section, Format |
| @section Comments in the Input |
| |
| @cindex comments, syntax of |
| Flex supports C-style comments, that is, anything between @samp{/*} and |
| @samp{*/} is |
| considered a comment. Whenever flex encounters a comment, it copies the |
| entire comment verbatim to the generated source code. Comments may |
| appear just about anywhere, but with the following exceptions: |
| |
| @itemize |
| @cindex comments, in rules section |
| @item |
| Comments may not appear in the Rules Section wherever flex is expecting |
| a regular expression. This means comments may not appear at the |
| beginning of a line, or immediately following a list of scanner states. |
| @item |
| Comments may not appear on an @samp{%option} line in the Definitions |
| Section. |
| @end itemize |
| |
| If you want to follow a simple rule, then always begin a comment on a |
| new line, with one or more whitespace characters before the initial |
| @samp{/*}). This rule will work anywhere in the input file. |
| |
| All the comments in the following example are valid: |
| |
| @cindex comments, valid uses of |
| @cindex comments in the input |
| @example |
| @verbatim |
| %{ |
| /* code block */ |
| %} |
| |
| /* Definitions Section */ |
| %x STATE_X |
| |
| %% |
| /* Rules Section */ |
| ruleA /* after regex */ { /* code block */ } /* after code block */ |
| /* Rules Section (indented) */ |
| <STATE_X>{ |
| ruleC ECHO; |
| ruleD ECHO; |
| %{ |
| /* code block */ |
| %} |
| } |
| %% |
| /* User Code Section */ |
| |
| @end verbatim |
| @end example |
| |
| @node Patterns, Matching, Format, Top |
| @chapter Patterns |
| |
| @cindex patterns, in rules section |
| @cindex regular expressions, in patterns |
| The patterns in the input (see @ref{Rules Section}) are written using an |
| extended set of regular expressions. These are: |
| |
| @cindex patterns, syntax |
| @cindex patterns, syntax |
| @table @samp |
| @item x |
| match the character 'x' |
| |
| @item . |
| any character (byte) except newline |
| |
| @cindex [] in patterns |
| @cindex character classes in patterns, syntax of |
| @cindex POSIX, character classes in patterns, syntax of |
| @item [xyz] |
| a @dfn{character class}; in this case, the pattern |
| matches either an 'x', a 'y', or a 'z' |
| |
| @cindex ranges in patterns |
| @item [abj-oZ] |
| a "character class" with a range in it; matches |
| an 'a', a 'b', any letter from 'j' through 'o', |
| or a 'Z' |
| |
| @cindex ranges in patterns, negating |
| @cindex negating ranges in patterns |
| @item [^A-Z] |
| a "negated character class", i.e., any character |
| but those in the class. In this case, any |
| character EXCEPT an uppercase letter. |
| |
| @item [^A-Z\n] |
| any character EXCEPT an uppercase letter or |
| a newline |
| |
| @item [a-z]@{-@}[aeiou] |
| the lowercase consonants |
| |
| @item r* |
| zero or more r's, where r is any regular expression |
| |
| @item r+ |
| one or more r's |
| |
| @item r? |
| zero or one r's (that is, ``an optional r'') |
| |
| @cindex braces in patterns |
| @item r@{2,5@} |
| anywhere from two to five r's |
| |
| @item r@{2,@} |
| two or more r's |
| |
| @item r@{4@} |
| exactly 4 r's |
| |
| @cindex pattern aliases, expansion of |
| @item @{name@} |
| the expansion of the @samp{name} definition |
| (@pxref{Format}). |
| |
| @cindex literal text in patterns, syntax of |
| @cindex verbatim text in patterns, syntax of |
| @item "[xyz]\"foo" |
| the literal string: @samp{[xyz]"foo} |
| |
| @cindex escape sequences in patterns, syntax of |
| @item \X |
| if X is @samp{a}, @samp{b}, @samp{f}, @samp{n}, @samp{r}, @samp{t}, or |
| @samp{v}, then the ANSI-C interpretation of @samp{\x}. Otherwise, a |
| literal @samp{X} (used to escape operators such as @samp{*}) |
| |
| @cindex NULL character in patterns, syntax of |
| @item \0 |
| a NUL character (ASCII code 0) |
| |
| @cindex octal characters in patterns |
| @item \123 |
| the character with octal value 123 |
| |
| @item \x2a |
| the character with hexadecimal value 2a |
| |
| @item (r) |
| match an @samp{r}; parentheses are used to override precedence (see below) |
| |
| @item (?r-s:pattern) |
| apply option @samp{r} and omit option @samp{s} while interpreting pattern. |
| Options may be zero or more of the characters @samp{i}, @samp{s}, or @samp{x}. |
| |
| @samp{i} means case-insensitive. @samp{-i} means case-sensitive. |
| |
| @samp{s} alters the meaning of the @samp{.} syntax to match any single byte whatsoever. |
| @samp{-s} alters the meaning of @samp{.} to match any byte except @samp{\n}. |
| |
| @samp{x} ignores comments and whitespace in patterns. Whitespace is ignored unless |
| it is backslash-escaped, contained within @samp{""}s, or appears inside a |
| character class. |
| |
| The following are all valid: |
| |
| @verbatim |
| (?:foo) same as (foo) |
| (?i:ab7) same as ([aA][bB]7) |
| (?-i:ab) same as (ab) |
| (?s:.) same as [\x00-\xFF] |
| (?-s:.) same as [^\n] |
| (?ix-s: a . b) same as ([Aa][^\n][bB]) |
| (?x:a b) same as ("ab") |
| (?x:a\ b) same as ("a b") |
| (?x:a" "b) same as ("a b") |
| (?x:a[ ]b) same as ("a b") |
| (?x:a |
| /* comment */ |
| b |
| c) same as (abc) |
| @end verbatim |
| |
| @item (?# comment ) |
| omit everything within @samp{()}. The first @samp{)} |
| character encountered ends the pattern. It is not possible to for the comment |
| to contain a @samp{)} character. The comment may span lines. |
| |
| @cindex concatenation, in patterns |
| @item rs |
| the regular expression @samp{r} followed by the regular expression @samp{s}; called |
| @dfn{concatenation} |
| |
| @item r|s |
| either an @samp{r} or an @samp{s} |
| |
| @cindex trailing context, in patterns |
| @item r/s |
| an @samp{r} but only if it is followed by an @samp{s}. The text matched by @samp{s} is |
| included when determining whether this rule is the longest match, but is |
| then returned to the input before the action is executed. So the action |
| only sees the text matched by @samp{r}. This type of pattern is called |
| @dfn{trailing context}. (There are some combinations of @samp{r/s} that flex |
| cannot match correctly. @xref{Limitations}, regarding dangerous trailing |
| context.) |
| |
| @cindex beginning of line, in patterns |
| @cindex BOL, in patterns |
| @item ^r |
| an @samp{r}, but only at the beginning of a line (i.e., |
| when just starting to scan, or right after a |
| newline has been scanned). |
| |
| @cindex end of line, in patterns |
| @cindex EOL, in patterns |
| @item r$ |
| an @samp{r}, but only at the end of a line (i.e., just before a |
| newline). Equivalent to @samp{r/\n}. |
| |
| @cindex newline, matching in patterns |
| Note that @code{flex}'s notion of ``newline'' is exactly |
| whatever the C compiler used to compile @code{flex} |
| interprets @samp{\n} as; in particular, on some DOS |
| systems you must either filter out @samp{\r}s in the |
| input yourself, or explicitly use @samp{r/\r\n} for @samp{r$}. |
| |
| @cindex start conditions, in patterns |
| @item <s>r |
| an @samp{r}, but only in start condition @code{s} (see @ref{Start |
| Conditions} for discussion of start conditions). |
| |
| @item <s1,s2,s3>r |
| same, but in any of start conditions @code{s1}, @code{s2}, or @code{s3}. |
| |
| @item <*>r |
| an @samp{r} in any start condition, even an exclusive one. |
| |
| @cindex end of file, in patterns |
| @cindex EOF in patterns, syntax of |
| @item <<EOF>> |
| an end-of-file. |
| |
| @item <s1,s2><<EOF>> |
| an end-of-file when in start condition @code{s1} or @code{s2} |
| @end table |
| |
| Note that inside of a character class, all regular expression operators |
| lose their special meaning except escape (@samp{\}) and the character class |
| operators, @samp{-}, @samp{]]}, and, at the beginning of the class, @samp{^}. |
| |
| @cindex patterns, precedence of operators |
| The regular expressions listed above are grouped according to |
| precedence, from highest precedence at the top to lowest at the bottom. |
| Those grouped together have equal precedence (see special note on the |
| precedence of the repeat operator, @samp{@{@}}, under the documentation |
| for the @samp{--posix} POSIX compliance option). For example, |
| |
| @cindex patterns, grouping and precedence |
| @example |
| @verbatim |
| foo|bar* |
| @end verbatim |
| @end example |
| |
| is the same as |
| |
| @example |
| @verbatim |
| (foo)|(ba(r*)) |
| @end verbatim |
| @end example |
| |
| since the @samp{*} operator has higher precedence than concatenation, |
| and concatenation higher than alternation (@samp{|}). This pattern |
| therefore matches @emph{either} the string @samp{foo} @emph{or} the |
| string @samp{ba} followed by zero-or-more @samp{r}'s. To match |
| @samp{foo} or zero-or-more repetitions of the string @samp{bar}, use: |
| |
| @example |
| @verbatim |
| foo|(bar)* |
| @end verbatim |
| @end example |
| |
| And to match a sequence of zero or more repetitions of @samp{foo} and |
| @samp{bar}: |
| |
| @cindex patterns, repetitions with grouping |
| @example |
| @verbatim |
| (foo|bar)* |
| @end verbatim |
| @end example |
| |
| @cindex character classes in patterns |
| In addition to characters and ranges of characters, character classes |
| can also contain @dfn{character class expressions}. These are |
| expressions enclosed inside @samp{[}: and @samp{:]} delimiters (which |
| themselves must appear between the @samp{[} and @samp{]} of the |
| character class. Other elements may occur inside the character class, |
| too). The valid expressions are: |
| |
| @cindex patterns, valid character classes |
| @example |
| @verbatim |
| [:alnum:] [:alpha:] [:blank:] |
| [:cntrl:] [:digit:] [:graph:] |
| [:lower:] [:print:] [:punct:] |
| [:space:] [:upper:] [:xdigit:] |
| @end verbatim |
| @end example |
| |
| These expressions all designate a set of characters equivalent to the |
| corresponding standard C @code{isXXX} function. For example, |
| @samp{[:alnum:]} designates those characters for which @code{isalnum()} |
| returns true - i.e., any alphabetic or numeric character. Some systems |
| don't provide @code{isblank()}, so flex defines @samp{[:blank:]} as a |
| blank or a tab. |
| |
| For example, the following character classes are all equivalent: |
| |
| @cindex character classes, equivalence of |
| @cindex patterns, character class equivalence |
| @example |
| @verbatim |
| [[:alnum:]] |
| [[:alpha:][:digit:]] |
| [[:alpha:][0-9]] |
| [a-zA-Z0-9] |
| @end verbatim |
| @end example |
| |
| A word of caution. Character classes are expanded immediately when seen in the @code{flex} input. |
| This means the character classes are sensitive to the locale in which @code{flex} |
| is executed, and the resulting scanner will not be sensitive to the runtime locale. |
| This may or may not be desirable. |
| |
| |
| @itemize |
| @cindex case-insensitive, effect on character classes |
| @item If your scanner is case-insensitive (the @samp{-i} flag), then |
| @samp{[:upper:]} and @samp{[:lower:]} are equivalent to |
| @samp{[:alpha:]}. |
| |
| @anchor{case and character ranges} |
| @item Character classes with ranges, such as @samp{[a-Z]}, should be used with |
| caution in a case-insensitive scanner if the range spans upper or lowercase |
| characters. Flex does not know if you want to fold all upper and lowercase |
| characters together, or if you want the literal numeric range specified (with |
| no case folding). When in doubt, flex will assume that you meant the literal |
| numeric range, and will issue a warning. The exception to this rule is a |
| character range such as @samp{[a-z]} or @samp{[S-W]} where it is obvious that you |
| want case-folding to occur. Here are some examples with the @samp{-i} flag |
| enabled: |
| |
| @multitable {@samp{[a-zA-Z]}} {ambiguous} {@samp{[A-Z\[\\\]_`a-t]}} {@samp{[@@A-Z\[\\\]_`abc]}} |
| @item Range @tab Result @tab Literal Range @tab Alternate Range |
| @item @samp{[a-t]} @tab ok @tab @samp{[a-tA-T]} @tab |
| @item @samp{[A-T]} @tab ok @tab @samp{[a-tA-T]} @tab |
| @item @samp{[A-t]} @tab ambiguous @tab @samp{[A-Z\[\\\]_`a-t]} @tab @samp{[a-tA-T]} |
| @item @samp{[_-@{]} @tab ambiguous @tab @samp{[_`a-z@{]} @tab @samp{[_`a-zA-Z@{]} |
| @item @samp{[@@-C]} @tab ambiguous @tab @samp{[@@ABC]} @tab @samp{[@@A-Z\[\\\]_`abc]} |
| @end multitable |
| |
| @cindex end of line, in negated character classes |
| @cindex EOL, in negated character classes |
| @item |
| A negated character class such as the example @samp{[^A-Z]} above |
| @emph{will} match a newline unless @samp{\n} (or an equivalent escape |
| sequence) is one of the characters explicitly present in the negated |
| character class (e.g., @samp{[^A-Z\n]}). This is unlike how many other |
| regular expression tools treat negated character classes, but |
| unfortunately the inconsistency is historically entrenched. Matching |
| newlines means that a pattern like @samp{[^"]*} can match the entire |
| input unless there's another quote in the input. |
| |
| Flex allows negation of character class expressions by prepending @samp{^} to |
| the POSIX character class name. |
| |
| @example |
| @verbatim |
| [:^alnum:] [:^alpha:] [:^blank:] |
| [:^cntrl:] [:^digit:] [:^graph:] |
| [:^lower:] [:^print:] [:^punct:] |
| [:^space:] [:^upper:] [:^xdigit:] |
| @end verbatim |
| @end example |
| |
| Flex will issue a warning if the expressions @samp{[:^upper:]} and |
| @samp{[:^lower:]} appear in a case-insensitive scanner, since their meaning is |
| unclear. The current behavior is to skip them entirely, but this may change |
| without notice in future revisions of flex. |
| |
| @item |
| |
| The @samp{@{-@}} operator computes the difference of two character classes. For |
| example, @samp{[a-c]@{-@}[b-z]} represents all the characters in the class |
| @samp{[a-c]} that are not in the class @samp{[b-z]} (which in this case, is |
| just the single character @samp{a}). The @samp{@{-@}} operator is left |
| associative, so @samp{[abc]@{-@}[b]@{-@}[c]} is the same as @samp{[a]}. Be careful |
| not to accidentally create an empty set, which will never match. |
| |
| @item |
| |
| The @samp{@{+@}} operator computes the union of two character classes. For |
| example, @samp{[a-z]@{+@}[0-9]} is the same as @samp{[a-z0-9]}. This operator |
| is useful when preceded by the result of a difference operation, as in, |
| @samp{[[:alpha:]]@{-@}[[:lower:]]@{+@}[q]}, which is equivalent to |
| @samp{[A-Zq]} in the "C" locale. |
| |
| @cindex trailing context, limits of |
| @cindex ^ as non-special character in patterns |
| @cindex $ as normal character in patterns |
| @item |
| A rule can have at most one instance of trailing context (the @samp{/} operator |
| or the @samp{$} operator). The start condition, @samp{^}, and @samp{<<EOF>>} patterns |
| can only occur at the beginning of a pattern, and, as well as with @samp{/} and @samp{$}, |
| cannot be grouped inside parentheses. A @samp{^} which does not occur at |
| the beginning of a rule or a @samp{$} which does not occur at the end of |
| a rule loses its special properties and is treated as a normal character. |
| |
| @item |
| The following are invalid: |
| |
| @cindex patterns, invalid trailing context |
| @example |
| @verbatim |
| foo/bar$ |
| <sc1>foo<sc2>bar |
| @end verbatim |
| @end example |
| |
| Note that the first of these can be written @samp{foo/bar\n}. |
| |
| @item |
| The following will result in @samp{$} or @samp{^} being treated as a normal character: |
| |
| @cindex patterns, special characters treated as non-special |
| @example |
| @verbatim |
| foo|(bar$) |
| foo|^bar |
| @end verbatim |
| @end example |
| |
| If the desired meaning is a @samp{foo} or a |
| @samp{bar}-followed-by-a-newline, the following could be used (the |
| special @code{|} action is explained below, @pxref{Actions}): |
| |
| @cindex patterns, end of line |
| @example |
| @verbatim |
| foo | |
| bar$ /* action goes here */ |
| @end verbatim |
| @end example |
| |
| A similar trick will work for matching a @samp{foo} or a |
| @samp{bar}-at-the-beginning-of-a-line. |
| @end itemize |
| |
| @node Matching, Actions, Patterns, Top |
| @chapter How the Input Is Matched |
| |
| @cindex patterns, matching |
| @cindex input, matching |
| @cindex trailing context, matching |
| @cindex matching, and trailing context |
| @cindex matching, length of |
| @cindex matching, multiple matches |
| When the generated scanner is run, it analyzes its input looking for |
| strings which match any of its patterns. If it finds more than one |
| match, it takes the one matching the most text (for trailing context |
| rules, this includes the length of the trailing part, even though it |
| will then be returned to the input). If it finds two or more matches of |
| the same length, the rule listed first in the @code{flex} input file is |
| chosen. |
| |
| @cindex token |
| @cindex yytext |
| @cindex yyleng |
| Once the match is determined, the text corresponding to the match |
| (called the @dfn{token}) is made available in the global character |
| pointer @code{yytext}, and its length in the global integer |
| @code{yyleng}. The @dfn{action} corresponding to the matched pattern is |
| then executed (@pxref{Actions}), and then the remaining input is scanned |
| for another match. |
| |
| @cindex default rule |
| If no match is found, then the @dfn{default rule} is executed: the next |
| character in the input is considered matched and copied to the standard |
| output. Thus, the simplest valid @code{flex} input is: |
| |
| @cindex minimal scanner |
| @example |
| @verbatim |
| %% |
| @end verbatim |
| @end example |
| |
| which generates a scanner that simply copies its input (one character at |
| a time) to its output. |
| |
| @cindex yytext, two types of |
| @cindex %array, use of |
| @cindex %pointer, use of |
| @vindex yytext |
| Note that @code{yytext} can be defined in two different ways: either as |
| a character @emph{pointer} or as a character @emph{array}. You can |
| control which definition @code{flex} uses by including one of the |
| special directives @code{%pointer} or @code{%array} in the first |
| (definitions) section of your flex input. The default is |
| @code{%pointer}, unless you use the @samp{-l} lex compatibility option, |
| in which case @code{yytext} will be an array. The advantage of using |
| @code{%pointer} is substantially faster scanning and no buffer overflow |
| when matching very large tokens (unless you run out of dynamic memory). |
| The disadvantage is that you are restricted in how your actions can |
| modify @code{yytext} (@pxref{Actions}), and calls to the @code{unput()} |
| function destroys the present contents of @code{yytext}, which can be a |
| considerable porting headache when moving between different @code{lex} |
| versions. |
| |
| @cindex %array, advantages of |
| The advantage of @code{%array} is that you can then modify @code{yytext} |
| to your heart's content, and calls to @code{unput()} do not destroy |
| @code{yytext} (@pxref{Actions}). Furthermore, existing @code{lex} |
| programs sometimes access @code{yytext} externally using declarations of |
| the form: |
| |
| @example |
| @verbatim |
| extern char yytext[]; |
| @end verbatim |
| @end example |
| |
| This definition is erroneous when used with @code{%pointer}, but correct |
| for @code{%array}. |
| |
| The @code{%array} declaration defines @code{yytext} to be an array of |
| @code{YYLMAX} characters, which defaults to a fairly large value. You |
| can change the size by simply #define'ing @code{YYLMAX} to a different |
| value in the first section of your @code{flex} input. As mentioned |
| above, with @code{%pointer} yytext grows dynamically to accommodate |
| large tokens. While this means your @code{%pointer} scanner can |
| accommodate very large tokens (such as matching entire blocks of |
| comments), bear in mind that each time the scanner must resize |
| @code{yytext} it also must rescan the entire token from the beginning, |
| so matching such tokens can prove slow. @code{yytext} presently does |
| @emph{not} dynamically grow if a call to @code{unput()} results in too |
| much text being pushed back; instead, a run-time error results. |
| |
| @cindex %array, with C++ |
| Also note that you cannot use @code{%array} with C++ scanner classes |
| (@pxref{Cxx}). |
| |
| @node Actions, Generated Scanner, Matching, Top |
| @chapter Actions |
| |
| @cindex actions |
| Each pattern in a rule has a corresponding @dfn{action}, which can be |
| any arbitrary C statement. The pattern ends at the first non-escaped |
| whitespace character; the remainder of the line is its action. If the |
| action is empty, then when the pattern is matched the input token is |
| simply discarded. For example, here is the specification for a program |
| which deletes all occurrences of @samp{zap me} from its input: |
| |
| @cindex deleting lines from input |
| @example |
| @verbatim |
| %% |
| "zap me" |
| @end verbatim |
| @end example |
| |
| This example will copy all other characters in the input to the output |
| since they will be matched by the default rule. |
| |
| Here is a program which compresses multiple blanks and tabs down to a |
| single blank, and throws away whitespace found at the end of a line: |
| |
| @cindex whitespace, compressing |
| @cindex compressing whitespace |
| @example |
| @verbatim |
| %% |
| [ \t]+ putchar( ' ' ); |
| [ \t]+$ /* ignore this token */ |
| @end verbatim |
| @end example |
| |
| @cindex %@{ and %@}, in Rules Section |
| @cindex actions, use of @{ and @} |
| @cindex actions, embedded C strings |
| @cindex C-strings, in actions |
| @cindex comments, in actions |
| If the action contains a @samp{@{}, then the action spans till the |
| balancing @samp{@}} is found, and the action may cross multiple lines. |
| @code{flex} knows about C strings and comments and won't be fooled by |
| braces found within them, but also allows actions to begin with |
| @samp{%@{} and will consider the action to be all the text up to the |
| next @samp{%@}} (regardless of ordinary braces inside the action). |
| |
| @cindex |, in actions |
| An action consisting solely of a vertical bar (@samp{|}) means ``same as the |
| action for the next rule''. See below for an illustration. |
| |
| Actions can include arbitrary C code, including @code{return} statements |
| to return a value to whatever routine called @code{yylex()}. Each time |
| @code{yylex()} is called it continues processing tokens from where it |
| last left off until it either reaches the end of the file or executes a |
| return. |
| |
| @cindex yytext, modification of |
| Actions are free to modify @code{yytext} except for lengthening it |
| (adding characters to its end--these will overwrite later characters in |
| the input stream). This however does not apply when using @code{%array} |
| (@pxref{Matching}). In that case, @code{yytext} may be freely modified |
| in any way. |
| |
| @cindex yyleng, modification of |
| @cindex yymore, and yyleng |
| Actions are free to modify @code{yyleng} except they should not do so if |
| the action also includes use of @code{yymore()} (see below). |
| |
| @cindex preprocessor macros, for use in actions |
| There are a number of special directives which can be included within an |
| action: |
| |
| @table @code |
| @item ECHO |
| @cindex ECHO |
| copies yytext to the scanner's output. |
| |
| @item BEGIN |
| @cindex BEGIN |
| followed by the name of a start condition places the scanner in the |
| corresponding start condition (see below). |
| |
| @item REJECT |
| @cindex REJECT |
| directs the scanner to proceed on to the ``second best'' rule which |
| matched the input (or a prefix of the input). The rule is chosen as |
| described above in @ref{Matching}, and @code{yytext} and @code{yyleng} |
| set up appropriately. It may either be one which matched as much text |
| as the originally chosen rule but came later in the @code{flex} input |
| file, or one which matched less text. For example, the following will |
| both count the words in the input and call the routine @code{special()} |
| whenever @samp{frob} is seen: |
| |
| @example |
| @verbatim |
| int word_count = 0; |
| %% |
| |
| frob special(); REJECT; |
| [^ \t\n]+ ++word_count; |
| @end verbatim |
| @end example |
| |
| Without the @code{REJECT}, any occurrences of @samp{frob} in the input |
| would not be counted as words, since the scanner normally executes only |
| one action per token. Multiple uses of @code{REJECT} are allowed, each |
| one finding the next best choice to the currently active rule. For |
| example, when the following scanner scans the token @samp{abcd}, it will |
| write @samp{abcdabcaba} to the output: |
| |
| @cindex REJECT, calling multiple times |
| @cindex |, use of |
| @example |
| @verbatim |
| %% |
| a | |
| ab | |
| abc | |
| abcd ECHO; REJECT; |
| .|\n /* eat up any unmatched character */ |
| @end verbatim |
| @end example |
| |
| The first three rules share the fourth's action since they use the |
| special @samp{|} action. |
| |
| @code{REJECT} is a particularly expensive feature in terms of scanner |
| performance; if it is used in @emph{any} of the scanner's actions it |
| will slow down @emph{all} of the scanner's matching. Furthermore, |
| @code{REJECT} cannot be used with the @samp{-Cf} or @samp{-CF} options |
| (@pxref{Scanner Options}). |
| |
| Note also that unlike the other special actions, @code{REJECT} is a |
| @emph{branch}. Code immediately following it in the action will |
| @emph{not} be executed. |
| |
| @item yymore() |
| @cindex yymore() |
| tells the scanner that the next time it matches a rule, the |
| corresponding token should be @emph{appended} onto the current value of |
| @code{yytext} rather than replacing it. For example, given the input |
| @samp{mega-kludge} the following will write @samp{mega-mega-kludge} to |
| the output: |
| |
| @cindex yymore(), mega-kludge |
| @cindex yymore() to append token to previous token |
| @example |
| @verbatim |
| %% |
| mega- ECHO; yymore(); |
| kludge ECHO; |
| @end verbatim |
| @end example |
| |
| First @samp{mega-} is matched and echoed to the output. Then @samp{kludge} |
| is matched, but the previous @samp{mega-} is still hanging around at the |
| beginning of |
| @code{yytext} |
| so the |
| @code{ECHO} |
| for the @samp{kludge} rule will actually write @samp{mega-kludge}. |
| @end table |
| |
| @cindex yymore, performance penalty of |
| Two notes regarding use of @code{yymore()}. First, @code{yymore()} |
| depends on the value of @code{yyleng} correctly reflecting the size of |
| the current token, so you must not modify @code{yyleng} if you are using |
| @code{yymore()}. Second, the presence of @code{yymore()} in the |
| scanner's action entails a minor performance penalty in the scanner's |
| matching speed. |
| |
| @cindex yyless() |
| @code{yyless(n)} returns all but the first @code{n} characters of the |
| current token back to the input stream, where they will be rescanned |
| when the scanner looks for the next match. @code{yytext} and |
| @code{yyleng} are adjusted appropriately (e.g., @code{yyleng} will now |
| be equal to @code{n}). For example, on the input @samp{foobar} the |
| following will write out @samp{foobarbar}: |
| |
| @cindex yyless(), pushing back characters |
| @cindex pushing back characters with yyless |
| @example |
| @verbatim |
| %% |
| foobar ECHO; yyless(3); |
| [a-z]+ ECHO; |
| @end verbatim |
| @end example |
| |
| An argument of 0 to @code{yyless()} will cause the entire current input |
| string to be scanned again. Unless you've changed how the scanner will |
| subsequently process its input (using @code{BEGIN}, for example), this |
| will result in an endless loop. |
| |
| Note that @code{yyless()} is a macro and can only be used in the flex |
| input file, not from other source files. |
| |
| @cindex unput() |
| @cindex pushing back characters with unput |
| @code{unput(c)} puts the character @code{c} back onto the input stream. |
| It will be the next character scanned. The following action will take |
| the current token and cause it to be rescanned enclosed in parentheses. |
| |
| @cindex unput(), pushing back characters |
| @cindex pushing back characters with unput() |
| @example |
| @verbatim |
| { |
| int i; |
| /* Copy yytext because unput() trashes yytext */ |
| char *yycopy = strdup( yytext ); |
| unput( ')' ); |
| for ( i = yyleng - 1; i >= 0; --i ) |
| unput( yycopy[i] ); |
| unput( '(' ); |
| free( yycopy ); |
| } |
| @end verbatim |
| @end example |
| |
| Note that since each @code{unput()} puts the given character back at the |
| @emph{beginning} of the input stream, pushing back strings must be done |
| back-to-front. |
| |
| @cindex %pointer, and unput() |
| @cindex unput(), and %pointer |
| An important potential problem when using @code{unput()} is that if you |
| are using @code{%pointer} (the default), a call to @code{unput()} |
| @emph{destroys} the contents of @code{yytext}, starting with its |
| rightmost character and devouring one character to the left with each |
| call. If you need the value of @code{yytext} preserved after a call to |
| @code{unput()} (as in the above example), you must either first copy it |
| elsewhere, or build your scanner using @code{%array} instead |
| (@pxref{Matching}). |
| |
| @cindex pushing back EOF |
| @cindex EOF, pushing back |
| Finally, note that you cannot put back @samp{EOF} to attempt to mark the |
| input stream with an end-of-file. |
| |
| @cindex input() |
| @code{input()} reads the next character from the input stream. For |
| example, the following is one way to eat up C comments: |
| |
| @cindex comments, discarding |
| @cindex discarding C comments |
| @example |
| @verbatim |
| %% |
| "/*" { |
| register int c; |
| |
| for ( ; ; ) |
| { |
| while ( (c = input()) != '*' && |
| c != EOF ) |
| ; /* eat up text of comment */ |
| |
| if ( c == '*' ) |
| { |
| while ( (c = input()) == '*' ) |
| ; |
| if ( c == '/' ) |
| break; /* found the end */ |
| } |
| |
| if ( c == EOF ) |
| { |
| error( "EOF in comment" ); |
| break; |
| } |
| } |
| } |
| @end verbatim |
| @end example |
| |
| @cindex input(), and C++ |
| @cindex yyinput() |
| (Note that if the scanner is compiled using @code{C++}, then |
| @code{input()} is instead referred to as @b{yyinput()}, in order to |
| avoid a name clash with the @code{C++} stream by the name of |
| @code{input}.) |
| |
| @cindex flushing the internal buffer |
| @cindex YY_FLUSH_BUFFER |
| @code{YY_FLUSH_BUFFER;} flushes the scanner's internal buffer so that |
| the next time the scanner attempts to match a token, it will first |
| refill the buffer using @code{YY_INPUT()} (@pxref{Generated Scanner}). |
| This action is a special case of the more general |
| @code{yy_flush_buffer;} function, described below (@pxref{Multiple |
| Input Buffers}) |
| |
| @cindex yyterminate() |
| @cindex terminating with yyterminate() |
| @cindex exiting with yyterminate() |
| @cindex halting with yyterminate() |
| @code{yyterminate()} can be used in lieu of a return statement in an |
| action. It terminates the scanner and returns a 0 to the scanner's |
| caller, indicating ``all done''. By default, @code{yyterminate()} is |
| also called when an end-of-file is encountered. It is a macro and may |
| be redefined. |
| |
| @node Generated Scanner, Start Conditions, Actions, Top |
| @chapter The Generated Scanner |
| |
| @cindex yylex(), in generated scanner |
| The output of @code{flex} is the file @file{lex.yy.c}, which contains |
| the scanning routine @code{yylex()}, a number of tables used by it for |
| matching tokens, and a number of auxiliary routines and macros. By |
| default, @code{yylex()} is declared as follows: |
| |
| @example |
| @verbatim |
| int yylex() |
| { |
| ... various definitions and the actions in here ... |
| } |
| @end verbatim |
| @end example |
| |
| @cindex yylex(), overriding |
| (If your environment supports function prototypes, then it will be |
| @code{int yylex( void )}.) This definition may be changed by defining |
| the @code{YY_DECL} macro. For example, you could use: |
| |
| @cindex yylex, overriding the prototype of |
| @example |
| @verbatim |
| #define YY_DECL float lexscan( a, b ) float a, b; |
| @end verbatim |
| @end example |
| |
| to give the scanning routine the name @code{lexscan}, returning a float, |
| and taking two floats as arguments. Note that if you give arguments to |
| the scanning routine using a K&R-style/non-prototyped function |
| declaration, you must terminate the definition with a semi-colon (;). |
| |
| @code{flex} generates @samp{C99} function definitions by |
| default. However flex does have the ability to generate obsolete, er, |
| @samp{traditional}, function definitions. This is to support |
| bootstrapping gcc on old systems. Unfortunately, traditional |
| definitions prevent us from using any standard data types smaller than |
| int (such as short, char, or bool) as function arguments. For this |
| reason, future versions of @code{flex} may generate standard C99 code |
| only, leaving K&R-style functions to the historians. Currently, if you |
| do @strong{not} want @samp{C99} definitions, then you must use |
| @code{%option noansi-definitions}. |
| |
| @cindex stdin, default for yyin |
| @cindex yyin |
| Whenever @code{yylex()} is called, it scans tokens from the global input |
| file @file{yyin} (which defaults to stdin). It continues until it |
| either reaches an end-of-file (at which point it returns the value 0) or |
| one of its actions executes a @code{return} statement. |
| |
| @cindex EOF and yyrestart() |
| @cindex end-of-file, and yyrestart() |
| @cindex yyrestart() |
| If the scanner reaches an end-of-file, subsequent calls are undefined |
| unless either @file{yyin} is pointed at a new input file (in which case |
| scanning continues from that file), or @code{yyrestart()} is called. |
| @code{yyrestart()} takes one argument, a @code{FILE *} pointer (which |
| can be NULL, if you've set up @code{YY_INPUT} to scan from a source other |
| than @code{yyin}), and initializes @file{yyin} for scanning from that |
| file. Essentially there is no difference between just assigning |
| @file{yyin} to a new input file or using @code{yyrestart()} to do so; |
| the latter is available for compatibility with previous versions of |
| @code{flex}, and because it can be used to switch input files in the |
| middle of scanning. It can also be used to throw away the current input |
| buffer, by calling it with an argument of @file{yyin}; but it would be |
| better to use @code{YY_FLUSH_BUFFER} (@pxref{Actions}). Note that |
| @code{yyrestart()} does @emph{not} reset the start condition to |
| @code{INITIAL} (@pxref{Start Conditions}). |
| |
| @cindex RETURN, within actions |
| If @code{yylex()} stops scanning due to executing a @code{return} |
| statement in one of the actions, the scanner may then be called again |
| and it will resume scanning where it left off. |
| |
| @cindex YY_INPUT |
| By default (and for purposes of efficiency), the scanner uses |
| block-reads rather than simple @code{getc()} calls to read characters |
| from @file{yyin}. The nature of how it gets its input can be controlled |
| by defining the @code{YY_INPUT} macro. The calling sequence for |
| @code{YY_INPUT()} is @code{YY_INPUT(buf,result,max_size)}. Its action |
| is to place up to @code{max_size} characters in the character array |
| @code{buf} and return in the integer variable @code{result} either the |
| number of characters read or the constant @code{YY_NULL} (0 on Unix |
| systems) to indicate @samp{EOF}. The default @code{YY_INPUT} reads from |
| the global file-pointer @file{yyin}. |
| |
| @cindex YY_INPUT, overriding |
| Here is a sample definition of @code{YY_INPUT} (in the definitions |
| section of the input file): |
| |
| @example |
| @verbatim |
| %{ |
| #define YY_INPUT(buf,result,max_size) \ |
| { \ |
| int c = getchar(); \ |
| result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \ |
| } |
| %} |
| @end verbatim |
| @end example |
| |
| This definition will change the input processing to occur one character |
| at a time. |
| |
| @cindex yywrap() |
| When the scanner receives an end-of-file indication from YY_INPUT, it |
| then checks the @code{yywrap()} function. If @code{yywrap()} returns |
| false (zero), then it is assumed that the function has gone ahead and |
| set up @file{yyin} to point to another input file, and scanning |
| continues. If it returns true (non-zero), then the scanner terminates, |
| returning 0 to its caller. Note that in either case, the start |
| condition remains unchanged; it does @emph{not} revert to |
| @code{INITIAL}. |
| |
| @cindex yywrap, default for |
| @cindex noyywrap, %option |
| @cindex %option noyywrapp |
| If you do not supply your own version of @code{yywrap()}, then you must |
| either use @code{%option noyywrap} (in which case the scanner behaves as |
| though @code{yywrap()} returned 1), or you must link with @samp{-lfl} to |
| obtain the default version of the routine, which always returns 1. |
| |
| For scanning from in-memory buffers (e.g., scanning strings), see |
| @ref{Scanning Strings}. @xref{Multiple Input Buffers}. |
| |
| @cindex ECHO, and yyout |
| @cindex yyout |
| @cindex stdout, as default for yyout |
| The scanner writes its @code{ECHO} output to the @file{yyout} global |
| (default, @file{stdout}), which may be redefined by the user simply by |
| assigning it to some other @code{FILE} pointer. |
| |
| @node Start Conditions, Multiple Input Buffers, Generated Scanner, Top |
| @chapter Start Conditions |
| |
| @cindex start conditions |
| @code{flex} provides a mechanism for conditionally activating rules. |
| Any rule whose pattern is prefixed with @samp{<sc>} will only be active |
| when the scanner is in the @dfn{start condition} named @code{sc}. For |
| example, |
| |
| @c proofread edit stopped here |
| @example |
| @verbatim |
| <STRING>[^"]* { /* eat up the string body ... */ |
| ... |
| } |
| @end verbatim |
| @end example |
| |
| will be active only when the scanner is in the @code{STRING} start |
| condition, and |
| |
| @cindex start conditions, multiple |
| @example |
| @verbatim |
| <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */ |
| ... |
| } |
| @end verbatim |
| @end example |
| |
| will be active only when the current start condition is either |
| @code{INITIAL}, @code{STRING}, or @code{QUOTE}. |
| |
| @cindex start conditions, inclusive v.s.@: exclusive |
| Start conditions are declared in the definitions (first) section of the |
| input using unindented lines beginning with either @samp{%s} or |
| @samp{%x} followed by a list of names. The former declares |
| @dfn{inclusive} start conditions, the latter @dfn{exclusive} start |
| conditions. A start condition is activated using the @code{BEGIN} |
| action. Until the next @code{BEGIN} action is executed, rules with the |
| given start condition will be active and rules with other start |
| conditions will be inactive. If the start condition is inclusive, then |
| rules with no start conditions at all will also be active. If it is |
| exclusive, then @emph{only} rules qualified with the start condition |
| will be active. A set of rules contingent on the same exclusive start |
| condition describe a scanner which is independent of any of the other |
| rules in the @code{flex} input. Because of this, exclusive start |
| conditions make it easy to specify ``mini-scanners'' which scan portions |
| of the input that are syntactically different from the rest (e.g., |
| comments). |
| |
| If the distinction between inclusive and exclusive start conditions |
| is still a little vague, here's a simple example illustrating the |
| connection between the two. The set of rules: |
| |
| @cindex start conditions, inclusive |
| @example |
| @verbatim |
| %s example |
| %% |
| |
| <example>foo do_something(); |
| |
| bar something_else(); |
| @end verbatim |
| @end example |
| |
| is equivalent to |
| |
| @cindex start conditions, exclusive |
| @example |
| @verbatim |
| %x example |
| %% |
| |
| <example>foo do_something(); |
| |
| <INITIAL,example>bar something_else(); |
| @end verbatim |
| @end example |
| |
| Without the @code{<INITIAL,example>} qualifier, the @code{bar} pattern in |
| the second example wouldn't be active (i.e., couldn't match) when in |
| start condition @code{example}. If we just used @code{<example>} to |
| qualify @code{bar}, though, then it would only be active in |
| @code{example} and not in @code{INITIAL}, while in the first example |
| it's active in both, because in the first example the @code{example} |
| start condition is an inclusive @code{(%s)} start condition. |
| |
| @cindex start conditions, special wildcard condition |
| Also note that the special start-condition specifier |
| @code{<*>} |
| matches every start condition. Thus, the above example could also |
| have been written: |
| |
| @cindex start conditions, use of wildcard condition (<*>) |
| @example |
| @verbatim |
| %x example |
| %% |
| |
| <example>foo do_something(); |
| |
| <*>bar something_else(); |
| @end verbatim |
| @end example |
| |
| The default rule (to @code{ECHO} any unmatched character) remains active |
| in start conditions. It is equivalent to: |
| |
| @cindex start conditions, behavior of default rule |
| @example |
| @verbatim |
| <*>.|\n ECHO; |
| @end verbatim |
| @end example |
| |
| @cindex BEGIN, explanation |
| @findex BEGIN |
| @vindex INITIAL |
| @code{BEGIN(0)} returns to the original state where only the rules with |
| no start conditions are active. This state can also be referred to as |
| the start-condition @code{INITIAL}, so @code{BEGIN(INITIAL)} is |
| equivalent to @code{BEGIN(0)}. (The parentheses around the start |
| condition name are not required but are considered good style.) |
| |
| @code{BEGIN} actions can also be given as indented code at the beginning |
| of the rules section. For example, the following will cause the scanner |
| to enter the @code{SPECIAL} start condition whenever @code{yylex()} is |
| called and the global variable @code{enter_special} is true: |
| |
| @cindex start conditions, using BEGIN |
| @example |
| @verbatim |
| int enter_special; |
| |
| %x SPECIAL |
| %% |
| if ( enter_special ) |
| BEGIN(SPECIAL); |
| |
| <SPECIAL>blahblahblah |
| ...more rules follow... |
| @end verbatim |
| @end example |
| |
| To illustrate the uses of start conditions, here is a scanner which |
| provides two different interpretations of a string like @samp{123.456}. |
| By default it will treat it as three tokens, the integer @samp{123}, a |
| dot (@samp{.}), and the integer @samp{456}. But if the string is |
| preceded earlier in the line by the string @samp{expect-floats} it will |
| treat it as a single token, the floating-point number @samp{123.456}: |
| |
| @cindex start conditions, for different interpretations of same input |
| @example |
| @verbatim |
| %{ |
| #include <math.h> |
| %} |
| %s expect |
| |
| %% |
| expect-floats BEGIN(expect); |
| |
| <expect>[0-9]+.[0-9]+ { |
| printf( "found a float, = %f\n", |
| atof( yytext ) ); |
| } |
| <expect>\n { |
| /* that's the end of the line, so |
| * we need another "expect-number" |
| * before we'll recognize any more |
| * numbers |
| */ |
| BEGIN(INITIAL); |
| } |
| |
| [0-9]+ { |
| printf( "found an integer, = %d\n", |
| atoi( yytext ) ); |
| } |
| |
| "." printf( "found a dot\n" ); |
| @end verbatim |
| @end example |
| |
| @cindex comments, example of scanning C comments |
| Here is a scanner which recognizes (and discards) C comments while |
| maintaining a count of the current input line. |
| |
| @cindex recognizing C comments |
| @example |
| @verbatim |
| %x comment |
| %% |
| int line_num = 1; |
| |
| "/*" BEGIN(comment); |
| |
| <comment>[^*\n]* /* eat anything that's not a '*' */ |
| <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ |
| <comment>\n ++line_num; |
| <comment>"*"+"/" BEGIN(INITIAL); |
| @end verbatim |
| @end example |
| |
| This scanner goes to a bit of trouble to match as much |
| text as possible with each rule. In general, when attempting to write |
| a high-speed scanner try to match as much possible in each rule, as |
| it's a big win. |
| |
| Note that start-conditions names are really integer values and |
| can be stored as such. Thus, the above could be extended in the |
| following fashion: |
| |
| @cindex start conditions, integer values |
| @cindex using integer values of start condition names |
| @example |
| @verbatim |
| %x comment foo |
| %% |
| int line_num = 1; |
| int comment_caller; |
| |
| "/*" { |
| comment_caller = INITIAL; |
| BEGIN(comment); |
| } |
| |
| ... |
| |
| <foo>"/*" { |
| comment_caller = foo; |
| BEGIN(comment); |
| } |
| |
| <comment>[^*\n]* /* eat anything that's not a '*' */ |
| <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */ |
| <comment>\n ++line_num; |
| <comment>"*"+"/" BEGIN(comment_caller); |
| @end verbatim |
| @end example |
| |
| @cindex YY_START, example |
| Furthermore, you can access the current start condition using the |
| integer-valued @code{YY_START} macro. For example, the above |
| assignments to @code{comment_caller} could instead be written |
| |
| @cindex getting current start state with YY_START |
| @example |
| @verbatim |
| comment_caller = YY_START; |
| @end verbatim |
| @end example |
| |
| @vindex YY_START |
| Flex provides @code{YYSTATE} as an alias for @code{YY_START} (since that |
| is what's used by AT&T @code{lex}). |
| |
| For historical reasons, start conditions do not have their own |
| name-space within the generated scanner. The start condition names are |
| unmodified in the generated scanner and generated header. |
| @xref{option-header}. @xref{option-prefix}. |
| |
| |
| |
| Finally, here's an example of how to match C-style quoted strings using |
| exclusive start conditions, including expanded escape sequences (but |
| not including checking for a string that's too long): |
| |
| @cindex matching C-style double-quoted strings |
| @example |
| @verbatim |
| %x str |
| |
| %% |
| char string_buf[MAX_STR_CONST]; |
| char *string_buf_ptr; |
| |
| |
| \" string_buf_ptr = string_buf; BEGIN(str); |
| |
| <str>\" { /* saw closing quote - all done */ |
| BEGIN(INITIAL); |
| *string_buf_ptr = '\0'; |
| /* return string constant token type and |
| * value to parser |
| */ |
| } |
| |
| <str>\n { |
| /* error - unterminated string constant */ |
| /* generate error message */ |
| } |
| |
| <str>\\[0-7]{1,3} { |
| /* octal escape sequence */ |
| int result; |
| |
| (void) sscanf( yytext + 1, "%o", &result ); |
| |
| if ( result > 0xff ) |
| /* error, constant is out-of-bounds */ |
| |
| *string_buf_ptr++ = result; |
| } |
| |
| <str>\\[0-9]+ { |
| /* generate error - bad escape sequence; something |
| * like '\48' or '\0777777' |
| */ |
| } |
| |
| <str>\\n *string_buf_ptr++ = '\n'; |
| <str>\\t *string_buf_ptr++ = '\t'; |
| <str>\\r *string_buf_ptr++ = '\r'; |
| <str>\\b *string_buf_ptr++ = '\b'; |
| <str>\\f *string_buf_ptr++ = '\f'; |
| |
| <str>\\(.|\n) *string_buf_ptr++ = yytext[1]; |
| |
| <str>[^\\\n\"]+ { |
| char *yptr = yytext; |
| |
| while ( *yptr ) |
| *string_buf_ptr++ = *yptr++; |
| } |
| @end verbatim |
| @end example |
| |
| @cindex start condition, applying to multiple patterns |
| Often, such as in some of the examples above, you wind up writing a |
| whole bunch of rules all preceded by the same start condition(s). Flex |
| makes this a little easier and cleaner by introducing a notion of start |
| condition @dfn{scope}. A start condition scope is begun with: |
| |
| @example |
| @verbatim |
| <SCs>{ |
| @end verbatim |
| @end example |
| |
| where @code{SCs} is a list of one or more start conditions. Inside the |
| start condition scope, every rule automatically has the prefix |
| @code{SCs>} applied to it, until a @samp{@}} which matches the initial |
| @samp{@{}. So, for example, |
| |
| @cindex extended scope of start conditions |
| @example |
| @verbatim |
| <ESC>{ |
| "\\n" return '\n'; |
| "\\r" return '\r'; |
| "\\f" return '\f'; |
| "\\0" return '\0'; |
| } |
| @end verbatim |
| @end example |
| |
| is equivalent to: |
| |
| @example |
| @verbatim |
| <ESC>"\\n" return '\n'; |
| <ESC>"\\r" return '\r'; |
| <ESC>"\\f" return '\f'; |
| <ESC>"\\0" return '\0'; |
| @end verbatim |
| @end example |
| |
| Start condition scopes may be nested. |
| |
| @cindex stacks, routines for manipulating |
| @cindex start conditions, use of a stack |
| |
| The following routines are available for manipulating stacks of start conditions: |
| |
| @deftypefun void yy_push_state ( int @code{new_state} ) |
| pushes the current start condition onto the top of the start condition |
| stack and switches to |
| @code{new_state} |
| as though you had used |
| @code{BEGIN new_state} |
| (recall that start condition names are also integers). |
| @end deftypefun |
| |
| @deftypefun void yy_pop_state () |
| pops the top of the stack and switches to it via |
| @code{BEGIN}. |
| @end deftypefun |
| |
| @deftypefun int yy_top_state () |
| returns the top of the stack without altering the stack's contents. |
| @end deftypefun |
| |
| @cindex memory, for start condition stacks |
| The start condition stack grows dynamically and so has no built-in size |
| limitation. If memory is exhausted, program execution aborts. |
| |
| To use start condition stacks, your scanner must include a @code{%option |
| stack} directive (@pxref{Scanner Options}). |
| |
| @node Multiple Input Buffers, EOF, Start Conditions, Top |
| @chapter Multiple Input Buffers |
| |
| @cindex multiple input streams |
| Some scanners (such as those which support ``include'' files) require |
| reading from several input streams. As @code{flex} scanners do a large |
| amount of buffering, one cannot control where the next input will be |
| read from by simply writing a @code{YY_INPUT()} which is sensitive to |
| the scanning context. @code{YY_INPUT()} is only called when the scanner |
| reaches the end of its buffer, which may be a long time after scanning a |
| statement such as an @code{include} statement which requires switching |
| the input source. |
| |
| To negotiate these sorts of problems, @code{flex} provides a mechanism |
| for creating and switching between multiple input buffers. An input |
| buffer is created by using: |
| |
| @cindex memory, allocating input buffers |
| @deftypefun YY_BUFFER_STATE yy_create_buffer ( FILE *file, int size ) |
| @end deftypefun |
| |
| which takes a @code{FILE} pointer and a size and creates a buffer |
| associated with the given file and large enough to hold @code{size} |
| characters (when in doubt, use @code{YY_BUF_SIZE} for the size). It |
| returns a @code{YY_BUFFER_STATE} handle, which may then be passed to |
| other routines (see below). |
| @tindex YY_BUFFER_STATE |
| The @code{YY_BUFFER_STATE} type is a |
| pointer to an opaque @code{struct yy_buffer_state} structure, so you may |
| safely initialize @code{YY_BUFFER_STATE} variables to @code{((YY_BUFFER_STATE) |
| 0)} if you wish, and also refer to the opaque structure in order to |
| correctly declare input buffers in source files other than that of your |
| scanner. Note that the @code{FILE} pointer in the call to |
| @code{yy_create_buffer} is only used as the value of @file{yyin} seen by |
| @code{YY_INPUT}. If you redefine @code{YY_INPUT()} so it no longer uses |
| @file{yyin}, then you can safely pass a NULL @code{FILE} pointer to |
| @code{yy_create_buffer}. You select a particular buffer to scan from |
| using: |
| |
| @deftypefun void yy_switch_to_buffer ( YY_BUFFER_STATE new_buffer ) |
| @end deftypefun |
| |
| The above function switches the scanner's input buffer so subsequent tokens |
| will come from @code{new_buffer}. Note that @code{yy_switch_to_buffer()} may |
| be used by @code{yywrap()} to set things up for continued scanning, instead of |
| opening a new file and pointing @file{yyin} at it. If you are looking for a |
| stack of input buffers, then you want to use @code{yypush_buffer_state()} |
| instead of this function. Note also that switching input sources via either |
| @code{yy_switch_to_buffer()} or @code{yywrap()} does @emph{not} change the |
| start condition. |
| |
| @cindex memory, deleting input buffers |
| @deftypefun void yy_delete_buffer ( YY_BUFFER_STATE buffer ) |
| @end deftypefun |
| |
| is used to reclaim the storage associated with a buffer. (@code{buffer} |
| can be NULL, in which case the routine does nothing.) You can also clear |
| the current contents of a buffer using: |
| |
| @cindex pushing an input buffer |
| @cindex stack, input buffer push |
| @deftypefun void yypush_buffer_state ( YY_BUFFER_STATE buffer ) |
| @end deftypefun |
| |
| This function pushes the new buffer state onto an internal stack. The pushed |
| state becomes the new current state. The stack is maintained by flex and will |
| grow as required. This function is intended to be used instead of |
| @code{yy_switch_to_buffer}, when you want to change states, but preserve the |
| current state for later use. |
| |
| @cindex popping an input buffer |
| @cindex stack, input buffer pop |
| @deftypefun void yypop_buffer_state ( ) |
| @end deftypefun |
| |
| This function removes the current state from the top of the stack, and deletes |
| it by calling @code{yy_delete_buffer}. The next state on the stack, if any, |
| becomes the new current state. |
| |
| @cindex clearing an input buffer |
| @cindex flushing an input buffer |
| @deftypefun void yy_flush_buffer ( YY_BUFFER_STATE buffer ) |
| @end deftypefun |
| |
| This function discards the buffer's contents, |
| so the next time the scanner attempts to match a token from the |
| buffer, it will first fill the buffer anew using |
| @code{YY_INPUT()}. |
| |
| @deftypefun YY_BUFFER_STATE yy_new_buffer ( FILE *file, int size ) |
| @end deftypefun |
| |
| is an alias for @code{yy_create_buffer()}, |
| provided for compatibility with the C++ use of @code{new} and |
| @code{delete} for creating and destroying dynamic objects. |
| |
| @cindex YY_CURRENT_BUFFER, and multiple buffers Finally, the macro |
| @code{YY_CURRENT_BUFFER} macro returns a @code{YY_BUFFER_STATE} handle to the |
| current buffer. It should not be used as an lvalue. |
| |
| @cindex EOF, example using multiple input buffers |
| Here are two examples of using these features for writing a scanner |
| which expands include files (the |
| @code{<<EOF>>} |
| feature is discussed below). |
| |
| This first example uses yypush_buffer_state and yypop_buffer_state. Flex |
| maintains the stack internally. |
| |
| @cindex handling include files with multiple input buffers |
| @example |
| @verbatim |
| /* the "incl" state is used for picking up the name |
| * of an include file |
| */ |
| %x incl |
| %% |
| include BEGIN(incl); |
| |
| [a-z]+ ECHO; |
| [^a-z\n]*\n? ECHO; |
| |
| <incl>[ \t]* /* eat the whitespace */ |
| <incl>[^ \t\n]+ { /* got the include file name */ |
| yyin = fopen( yytext, "r" ); |
| |
| if ( ! yyin ) |
| error( ... ); |
| |
| yypush_buffer_state(yy_create_buffer( yyin, YY_BUF_SIZE )); |
| |
| BEGIN(INITIAL); |
| } |
| |
| <<EOF>> { |
| yypop_buffer_state(); |
| |
| if ( !YY_CURRENT_BUFFER ) |
| { |
| yyterminate(); |
| } |
| } |
| @end verbatim |
| @end example |
| |
| The second example, below, does the same thing as the previous example did, but |
| manages its own input buffer stack manually (instead of letting flex do it). |
| |
| @cindex handling include files with multiple input buffers |
| @example |
| @verbatim |
| /* the "incl" state is used for picking up the name |
| * of an include file |
| */ |
| %x incl |
| |
| %{ |
| #define MAX_INCLUDE_DEPTH 10 |
| YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; |
| int include_stack_ptr = 0; |
| %} |
| |
| %% |
| include BEGIN(incl); |
| |
| [a-z]+ ECHO; |
| [^a-z\n]*\n? ECHO; |
| |
| <incl>[ \t]* /* eat the whitespace */ |
| <incl>[^ \t\n]+ { /* got the include file name */ |
| if ( include_stack_ptr >= MAX_INCLUDE_DEPTH ) |
| { |
| fprintf( stderr, "Includes nested too deeply" ); |
| exit( 1 ); |
| } |
| |
| include_stack[include_stack_ptr++] = |
| YY_CURRENT_BUFFER; |
| |
| yyin = fopen( yytext, "r" ); |
| |
| if ( ! yyin ) |
| error( ... ); |
| |
| yy_switch_to_buffer( |
| yy_create_buffer( yyin, YY_BUF_SIZE ) ); |
| |
| BEGIN(INITIAL); |
| } |
| |
| <<EOF>> { |
| if ( --include_stack_ptr 0 ) |
| { |
| yyterminate(); |
| } |
| |
| else |
| { |
| yy_delete_buffer( YY_CURRENT_BUFFER ); |
| yy_switch_to_buffer( |
| include_stack[include_stack_ptr] ); |
| } |
| } |
| @end verbatim |
| @end example |
| |
| @anchor{Scanning Strings} |
| @cindex strings, scanning strings instead of files |
| The following routines are available for setting up input buffers for |
| scanning in-memory strings instead of files. All of them create a new |
| input buffer for scanning the string, and return a corresponding |
| @code{YY_BUFFER_STATE} handle (which you should delete with |
| @code{yy_delete_buffer()} when done with it). They also switch to the |
| new buffer using @code{yy_switch_to_buffer()}, so the next call to |
| @code{yylex()} will start scanning the string. |
| |
| @deftypefun YY_BUFFER_STATE yy_scan_string ( const char *str ) |
| scans a NUL-terminated string. |
| @end deftypefun |
| |
| @deftypefun YY_BUFFER_STATE yy_scan_bytes ( const char *bytes, int len ) |
| scans @code{len} bytes (including possibly @code{NUL}s) starting at location |
| @code{bytes}. |
| @end deftypefun |
| |
| Note that both of these functions create and scan a @emph{copy} of the |
| string or bytes. (This may be desirable, since @code{yylex()} modifies |
| the contents of the buffer it is scanning.) You can avoid the copy by |
| using: |
| |
| @vindex YY_END_OF_BUFFER_CHAR |
| @deftypefun YY_BUFFER_STATE yy_scan_buffer (char *base, yy_size_t size) |
| which scans in place the buffer starting at @code{base}, consisting of |
| @code{size} bytes, the last two bytes of which @emph{must} be |
| @code{YY_END_OF_BUFFER_CHAR} (ASCII NUL). These last two bytes are not |
| scanned; thus, scanning consists of @code{base[0]} through |
| @code{base[size-2]}, inclusive. |
| @end deftypefun |
| |
| If you fail to set up @code{base} in this manner (i.e., forget the final |
| two @code{YY_END_OF_BUFFER_CHAR} bytes), then @code{yy_scan_buffer()} |
| returns a NULL pointer instead of creating a new input buffer. |
| |
| @deftp {Data type} yy_size_t |
| is an integral type to which you can cast an integer expression |
| reflecting the size of the buffer. |
| @end deftp |
| |
| @node EOF, Misc Macros, Multiple Input Buffers, Top |
| @chapter End-of-File Rules |
| |
| @cindex EOF, explanation |
| The special rule @code{<<EOF>>} indicates |
| actions which are to be taken when an end-of-file is |
| encountered and @code{yywrap()} returns non-zero (i.e., indicates |
| no further files to process). The action must finish |
| by doing one of the following things: |
| |
| @itemize |
| @item |
| @findex YY_NEW_FILE (now obsolete) |
| assigning @file{yyin} to a new input file (in previous versions of |
| @code{flex}, after doing the assignment you had to call the special |
| action @code{YY_NEW_FILE}. This is no longer necessary.) |
| |
| @item |
| executing a @code{return} statement; |
| |
| @item |
| executing the special @code{yyterminate()} action. |
| |
| @item |
| or, switching to a new buffer using @code{yy_switch_to_buffer()} as |
| shown in the example above. |
| @end itemize |
| |
| <<EOF>> rules may not be used with other patterns; they may only be |
| qualified with a list of start conditions. If an unqualified <<EOF>> |
| rule is given, it applies to @emph{all} start conditions which do not |
| already have <<EOF>> actions. To specify an <<EOF>> rule for only the |
| initial start condition, use: |
| |
| @example |
| @verbatim |
| <INITIAL><<EOF>> |
| @end verbatim |
| @end example |
| |
| These rules are useful for catching things like unclosed comments. An |
| example: |
| |
| @cindex <<EOF>>, use of |
| @example |
| @verbatim |
| %x quote |
| %% |
| |
| ...other rules for dealing with quotes... |
| |
| <quote><<EOF>> { |
| error( "unterminated quote" ); |
| yyterminate(); |
| } |
| <<EOF>> { |
| if ( *++filelist ) |
| yyin = fopen( *filelist, "r" ); |
| else |
| yyterminate(); |
| } |
| @end verbatim |
| @end example |
| |
| @node Misc Macros, User Values, EOF, Top |
| @chapter Miscellaneous Macros |
| |
| @hkindex YY_USER_ACTION |
| The macro @code{YY_USER_ACTION} can be defined to provide an action |
| which is always executed prior to the matched rule's action. For |
| example, it could be #define'd to call a routine to convert yytext to |
| lower-case. When @code{YY_USER_ACTION} is invoked, the variable |
| @code{yy_act} gives the number of the matched rule (rules are numbered |
| starting with 1). Suppose you want to profile how often each of your |
| rules is matched. The following would do the trick: |
| |
| @cindex YY_USER_ACTION to track each time a rule is matched |
| @example |
| @verbatim |
| #define YY_USER_ACTION ++ctr[yy_act] |
| @end verbatim |
| @end example |
| |
| @vindex YY_NUM_RULES |
| where @code{ctr} is an array to hold the counts for the different rules. |
| Note that the macro @code{YY_NUM_RULES} gives the total number of rules |
| (including the default rule), even if you use @samp{-s)}, so a correct |
| declaration for @code{ctr} is: |
| |
| @example |
| @verbatim |
| int ctr[YY_NUM_RULES]; |
| @end verbatim |
| @end example |
| |
| @hkindex YY_USER_INIT |
| The macro @code{YY_USER_INIT} may be defined to provide an action which |
| is always executed before the first scan (and before the scanner's |
| internal initializations are done). For example, it could be used to |
| call a routine to read in a data table or open a logging file. |
| |
| @findex yy_set_interactive |
| The macro @code{yy_set_interactive(is_interactive)} can be used to |
| control whether the current buffer is considered @dfn{interactive}. An |
| interactive buffer is processed more slowly, but must be used when the |
| scanner's input source is indeed interactive to avoid problems due to |
| waiting to fill buffers (see the discussion of the @samp{-I} flag in |
| @ref{Scanner Options}). A non-zero value in the macro invocation marks |
| the buffer as interactive, a zero value as non-interactive. Note that |
| use of this macro overrides @code{%option always-interactive} or |
| @code{%option never-interactive} (@pxref{Scanner Options}). |
| @code{yy_set_interactive()} must be invoked prior to beginning to scan |
| the buffer that is (or is not) to be considered interactive. |
| |
| @cindex BOL, setting it |
| @findex yy_set_bol |
| The macro @code{yy_set_bol(at_bol)} can be used to control whether the |
| current buffer's scanning context for the next token match is done as |
| though at the beginning of a line. A non-zero macro argument makes |
| rules anchored with @samp{^} active, while a zero argument makes |
| @samp{^} rules inactive. |
| |
| @cindex BOL, checking the BOL flag |
| @findex YY_AT_BOL |
| The macro @code{YY_AT_BOL()} returns true if the next token scanned from |
| the current buffer will have @samp{^} rules active, false otherwise. |
| |
| @cindex actions, redefining YY_BREAK |
| @hkindex YY_BREAK |
| In the generated scanner, the actions are all gathered in one large |
| switch statement and separated using @code{YY_BREAK}, which may be |
| redefined. By default, it is simply a @code{break}, to separate each |
| rule's action from the following rule's. Redefining @code{YY_BREAK} |
| allows, for example, C++ users to #define YY_BREAK to do nothing (while |
| being very careful that every rule ends with a @code{break} or a |
| @code{return}!) to avoid suffering from unreachable statement warnings |
| where because a rule's action ends with @code{return}, the |
| @code{YY_BREAK} is inaccessible. |
| |
| @node User Values, Yacc, Misc Macros, Top |
| @chapter Values Available To the User |
| |
| This chapter summarizes the various values available to the user in the |
| rule actions. |
| |
| @table @code |
| @vindex yytext |
| @item char *yytext |
| holds the text of the current token. It may be modified but not |
| lengthened (you cannot append characters to the end). |
| |
| @cindex yytext, default array size |
| @cindex array, default size for yytext |
| @vindex YYLMAX |
| If the special directive @code{%array} appears in the first section of |
| the scanner description, then @code{yytext} is instead declared |
| @code{char yytext[YYLMAX]}, where @code{YYLMAX} is a macro definition |
| that you can redefine in the first section if you don't like the default |
| value (generally 8KB). Using @code{%array} results in somewhat slower |
| scanners, but the value of @code{yytext} becomes immune to calls to |
| @code{unput()}, which potentially destroy its value when @code{yytext} is |
| a character pointer. The opposite of @code{%array} is @code{%pointer}, |
| which is the default. |
| |
| @cindex C++ and %array |
| You cannot use @code{%array} when generating C++ scanner classes (the |
| @samp{-+} flag). |
| |
| @vindex yyleng |
| @item int yyleng |
| holds the length of the current token. |
| |
| @vindex yyin |
| @item FILE *yyin |
| is the file which by default @code{flex} reads from. It may be |
| redefined but doing so only makes sense before scanning begins or after |
| an EOF has been encountered. Changing it in the midst of scanning will |
| have unexpected results since @code{flex} buffers its input; use |
| @code{yyrestart()} instead. Once scanning terminates because an |
| end-of-file has been seen, you can assign @file{yyin} at the new input |
| file and then call the scanner again to continue scanning. |
| |
| @findex yyrestart |
| @item void yyrestart( FILE *new_file ) |
| may be called to point @file{yyin} at the new input file. The |
| switch-over to the new file is immediate (any previously buffered-up |
| input is lost). Note that calling @code{yyrestart()} with @file{yyin} |
| as an argument thus throws away the current input buffer and continues |
| scanning the same input file. |
| |
| @vindex yyout |
| @item FILE *yyout |
| is the file to which @code{ECHO} actions are done. It can be reassigned |
| by the user. |
| |
| @vindex YY_CURRENT_BUFFER |
| @item YY_CURRENT_BUFFER |
| returns a @code{YY_BUFFER_STATE} handle to the current buffer. |
| |
| @vindex YY_START |
| @item YY_START |
| returns an integer value corresponding to the current start condition. |
| You can subsequently use this value with @code{BEGIN} to return to that |
| start condition. |
| @end table |
| |
| @node Yacc, Scanner Options, User Values, Top |
| @chapter Interfacing with Yacc |
| |
| @cindex yacc, interface |
| |
| @vindex yylval, with yacc |
| One of the main uses of @code{flex} is as a companion to the @code{yacc} |
| parser-generator. @code{yacc} parsers expect to call a routine named |
| @code{yylex()} to find the next input token. The routine is supposed to |
| return the type of the next token as well as putting any associated |
| value in the global @code{yylval}. To use @code{flex} with @code{yacc}, |
| one specifies the @samp{-d} option to @code{yacc} to instruct it to |
| generate the file @file{y.tab.h} containing definitions of all the |
| @code{%tokens} appearing in the @code{yacc} input. This file is then |
| included in the @code{flex} scanner. For example, if one of the tokens |
| is @code{TOK_NUMBER}, part of the scanner might look like: |
| |
| @cindex yacc interface |
| @example |
| @verbatim |
| %{ |
| #include "y.tab.h" |
| %} |
| |
| %% |
| |
| [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER; |
| @end verbatim |
| @end example |
| |
| @node Scanner Options, Performance, Yacc, Top |
| @chapter Scanner Options |
| |
| @cindex command-line options |
| @cindex options, command-line |
| @cindex arguments, command-line |
| |
| The various @code{flex} options are categorized by function in the following |
| menu. If you want to lookup a particular option by name, @xref{Index of Scanner Options}. |
| |
| @menu |
| * Options for Specifying Filenames:: |
| * Options Affecting Scanner Behavior:: |
| * Code-Level And API Options:: |
| * Options for Scanner Speed and Size:: |
| * Debugging Options:: |
| * Miscellaneous Options:: |
| @end menu |
| |
| Even though there are many scanner options, a typical scanner might only |
| specify the following options: |
| |
| @example |
| @verbatim |
| %option 8bit reentrant bison-bridge |
| %option warn nodefault |
| %option yylineno |
| %option outfile="scanner.c" header-file="scanner.h" |
| @end verbatim |
| @end example |
| |
| The first line specifies the general type of scanner we want. The second line |
| specifies that we are being careful. The third line asks flex to track line |
| numbers. The last line tells flex what to name the files. (The options can be |
| specified in any order. We just divided them.) |
| |
| @code{flex} also provides a mechanism for controlling options within the |
| scanner specification itself, rather than from the flex command-line. |
| This is done by including @code{%option} directives in the first section |
| of the scanner specification. You can specify multiple options with a |
| single @code{%option} directive, and multiple directives in the first |
| section of your flex input file. |
| |
| Most options are given simply as names, optionally preceded by the |
| word @samp{no} (with no intervening whitespace) to negate their meaning. |
| The names are the same as their long-option equivalents (but without the |
| leading @samp{--} ). |
| |
| @code{flex} scans your rule actions to determine whether you use the |
| @code{REJECT} or @code{yymore()} features. The @code{REJECT} and |
| @code{yymore} options are available to override its decision as to |
| whether you use the options, either by setting them (e.g., @code{%option |
| reject)} to indicate the feature is indeed used, or unsetting them to |
| indicate it actually is not used (e.g., @code{%option noyymore)}. |
| |
| |
| A number of options are available for lint purists who want to suppress |
| the appearance of unneeded routines in the generated scanner. Each of |
| the following, if unset (e.g., @code{%option nounput}), results in the |
| corresponding routine not appearing in the generated scanner: |
| |
| @example |
| @verbatim |
| input, unput |
| yy_push_state, yy_pop_state, yy_top_state |
| yy_scan_buffer, yy_scan_bytes, yy_scan_string |
| |
| yyget_extra, yyset_extra, yyget_leng, yyget_text, |
| yyget_lineno, yyset_lineno, yyget_in, yyset_in, |
| yyget_out, yyset_out, yyget_lval, yyset_lval, |
| yyget_lloc, yyset_lloc, yyget_debug, yyset_debug |
| @end verbatim |
| @end example |
| |
| (though @code{yy_push_state()} and friends won't appear anyway unless |
| you use @code{%option stack)}. |
| |
| @node Options for Specifying Filenames, Options Affecting Scanner Behavior, Scanner Options, Scanner Options |
| @section Options for Specifying Filenames |
| |
| @table @samp |
| |
| @anchor{option-header} |
| @opindex ---header-file |
| @opindex header-file |
| @item --header-file=FILE, @code{%option header-file="FILE"} |
| instructs flex to write a C header to @file{FILE}. This file contains |
| function prototypes, extern variables, and types used by the scanner. |
| Only the external API is exported by the header file. Many macros that |
| are usable from within scanner actions are not exported to the header |
| file. This is due to namespace problems and the goal of a clean |
| external API. |
| |
| While in the header, the macro @code{yyIN_HEADER} is defined, where @samp{yy} |
| is substituted with the appropriate prefix. |
| |
| The @samp{--header-file} option is not compatible with the @samp{--c++} option, |
| since the C++ scanner provides its own header in @file{yyFlexLexer.h}. |
| |
| |
| |
| @anchor{option-outfile} |
| @opindex -o |
| @opindex ---outfile |
| @opindex outfile |
| @item -oFILE, --outfile=FILE, @code{%option outfile="FILE"} |
| directs flex to write the scanner to the file @file{FILE} instead of |
| @file{lex.yy.c}. If you combine @samp{--outfile} with the @samp{--stdout} option, |
| then the scanner is written to @file{stdout} but its @code{#line} |
| directives (see the @samp{-l} option above) refer to the file |
| @file{FILE}. |
| |
| |
| |
| @anchor{option-stdout} |
| @opindex -t |
| @opindex ---stdout |
| @opindex stdout |
| @item -t, --stdout, @code{%option stdout} |
| instructs @code{flex} to write the scanner it generates to standard |
| output instead of @file{lex.yy.c}. |
| |
| |
| |
| @opindex ---skel |
| @item -SFILE, --skel=FILE |
| overrides the default skeleton file from which |
| @code{flex} |
| constructs its scanners. You'll never need this option unless you are doing |
| @code{flex} |
| maintenance or development. |
| |
| @opindex ---tables-file |
| @opindex tables-file |
| @item --tables-file=FILE |
| Write serialized scanner dfa tables to FILE. The generated scanner will not |
| contain the tables, and requires them to be loaded at runtime. |
| @xref{serialization}. |
| |
| @opindex ---tables-verify |
| @opindex tables-verify |
| @item --tables-verify |
| This option is for flex development. We document it here in case you stumble |
| upon it by accident or in case you suspect some inconsistency in the serialized |
| tables. Flex will serialize the scanner dfa tables but will also generate the |
| in-code tables as it normally does. At runtime, the scanner will verify that |
| the serialized tables match the in-code tables, instead of loading them. |
| |
| @end table |
| |
| @node Options Affecting Scanner Behavior, Code-Level And API Options, Options for Specifying Filenames, Scanner Options |
| @section Options Affecting Scanner Behavior |
| |
| @table @samp |
| @anchor{option-case-insensitive} |
| @opindex -i |
| @opindex ---case-insensitive |
| @opindex case-insensitive |
| @item -i, --case-insensitive, @code{%option case-insensitive} |
| instructs @code{flex} to generate a @dfn{case-insensitive} scanner. The |
| case of letters given in the @code{flex} input patterns will be ignored, |
| and tokens in the input will be matched regardless of case. The matched |
| text given in @code{yytext} will have the preserved case (i.e., it will |
| not be folded). For tricky behavior, see @ref{case and character ranges}. |
| |
| |
| |
| @anchor{option-lex-compat} |
| @opindex -l |
| @opindex ---lex-compat |
| @opindex lex-compat |
| @item -l, --lex-compat, @code{%option lex-compat} |
| turns on maximum compatibility with the original AT&T @code{lex} |
| implementation. Note that this does not mean @emph{full} compatibility. |
| Use of this option costs a considerable amount of performance, and it |
| cannot be used with the @samp{--c++}, @samp{--full}, @samp{--fast}, @samp{-Cf}, or |
| @samp{-CF} options. For details on the compatibilities it provides, see |
| @ref{Lex and Posix}. This option also results in the name |
| @code{YY_FLEX_LEX_COMPAT} being @code{#define}'d in the generated scanner. |
| |
| |
| |
| @anchor{option-batch} |
| @opindex -B |
| @opindex ---batch |
| @opindex batch |
| @item -B, --batch, @code{%option batch} |
| instructs @code{flex} to generate a @dfn{batch} scanner, the opposite of |
| @emph{interactive} scanners generated by @samp{--interactive} (see below). In |
| general, you use @samp{-B} when you are @emph{certain} that your scanner |
| will never be used interactively, and you want to squeeze a |
| @emph{little} more performance out of it. If your goal is instead to |
| squeeze out a @emph{lot} more performance, you should be using the |
| @samp{-Cf} or @samp{-CF} options, which turn on @samp{--batch} automatically |
| anyway. |
| |
| |
| |
| @anchor{option-interactive} |
| @opindex -I |
| @opindex ---interactive |
| @opindex interactive |
| @item -I, --interactive, @code{%option interactive} |
| instructs @code{flex} to generate an @i{interactive} scanner. An |
| interactive scanner is one that only looks ahead to decide what token |
| has been matched if it absolutely must. It turns out that always |
| looking one extra character ahead, even if the scanner has already seen |
| enough text to disambiguate the current token, is a bit faster than only |
| looking ahead when necessary. But scanners that always look ahead give |
| dreadful interactive performance; for example, when a user types a |
| newline, it is not recognized as a newline token until they enter |
| @emph{another} token, which often means typing in another whole line. |
| |
| @code{flex} scanners default to @code{interactive} unless you use the |
| @samp{-Cf} or @samp{-CF} table-compression options |
| (@pxref{Performance}). That's because if you're looking for |
| high-performance you should be using one of these options, so if you |
| didn't, @code{flex} assumes you'd rather trade off a bit of run-time |
| performance for intuitive interactive behavior. Note also that you |
| @emph{cannot} use @samp{--interactive} in conjunction with @samp{-Cf} or |
| @samp{-CF}. Thus, this option is not really needed; it is on by default |
| for all those cases in which it is allowed. |
| |
| You can force a scanner to |
| @emph{not} |
| be interactive by using |
| @samp{--batch} |
| |
| |
| |
| @anchor{option-7bit} |
| @opindex -7 |
| @opindex ---7bit |
| @opindex 7bit |
| @item -7, --7bit, @code{%option 7bit} |
| instructs @code{flex} to generate a 7-bit scanner, i.e., one which can |
| only recognize 7-bit characters in its input. The advantage of using |
| @samp{--7bit} is that the scanner's tables can be up to half the size of |
| those generated using the @samp{--8bit}. The disadvantage is that such |
| scanners often hang or crash if their input contains an 8-bit character. |
| |
| Note, however, that unless you generate your scanner using the |
| @samp{-Cf} or @samp{-CF} table compression options, use of @samp{--7bit} |
| will save only a small amount of table space, and make your scanner |
| considerably less portable. @code{Flex}'s default behavior is to |
| generate an 8-bit scanner unless you use the @samp{-Cf} or @samp{-CF}, |
| in which case @code{flex} defaults to generating 7-bit scanners unless |
| your site was always configured to generate 8-bit scanners (as will |
| often be the case with non-USA sites). You can tell whether flex |
| generated a 7-bit or an 8-bit scanner by inspecting the flag summary in |
| the @samp{--verbose} output as described above. |
| |
| Note that if you use @samp{-Cfe} or @samp{-CFe} @code{flex} still |
| defaults to generating an 8-bit scanner, since usually with these |
| compression options full 8-bit tables are not much more expensive than |
| 7-bit tables. |
| |
| |
| |
| @anchor{option-8bit} |
| @opindex -8 |
| @opindex ---8bit |
| @opindex 8bit |
| @item -8, --8bit, @code{%option 8bit} |
| instructs @code{flex} to generate an 8-bit scanner, i.e., one which can |
| recognize 8-bit characters. This flag is only needed for scanners |
| generated using @samp{-Cf} or @samp{-CF}, as otherwise flex defaults to |
| generating an 8-bit scanner anyway. |
| |
| See the discussion of |
| @samp{--7bit} |
| above for @code{flex}'s default behavior and the tradeoffs between 7-bit |
| and 8-bit scanners. |
| |
| |
| |
| @anchor{option-default} |
| @opindex ---default |
| @opindex default |
| @item --default, @code{%option default} |
| generate the default rule. |
| |
| |
| |
| @anchor{option-always-interactive} |
| @opindex ---always-interactive |
| @opindex always-interactive |
| @item --always-interactive, @code{%option always-interactive} |
| instructs flex to generate a scanner which always considers its input |
| @emph{interactive}. Normally, on each new input file the scanner calls |
| @code{isatty()} in an attempt to determine whether the scanner's input |
| source is interactive and thus should be read a character at a time. |
| When this option is used, however, then no such call is made. |
| |
| |
| |
| @opindex ---never-interactive |
| @item --never-interactive, @code{--never-interactive} |
| instructs flex to generate a scanner which never considers its input |
| interactive. This is the opposite of @code{always-interactive}. |
| |
| |
| @anchor{option-posix} |
| @opindex -X |
| @opindex ---posix |
| @opindex posix |
| @item -X, --posix, @code{%option posix} |
| turns on maximum compatibility with the POSIX 1003.2-1992 definition of |
| @code{lex}. Since @code{flex} was originally designed to implement the |
| POSIX definition of @code{lex} this generally involves very few changes |
| in behavior. At the current writing the known differences between |
| @code{flex} and the POSIX standard are: |
| |
| @itemize |
| @item |
| In POSIX and AT&T @code{lex}, the repeat operator, @samp{@{@}}, has lower |
| precedence than concatenation (thus @samp{ab@{3@}} yields @samp{ababab}). |
| Most POSIX utilities use an Extended Regular Expression (ERE) precedence |
| that has the precedence of the repeat operator higher than concatenation |
| (which causes @samp{ab@{3@}} to yield @samp{abbb}). By default, @code{flex} |
| places the precedence of the repeat operator higher than concatenation |
| which matches the ERE processing of other POSIX utilities. When either |
| @samp{--posix} or @samp{-l} are specified, @code{flex} will use the |
| traditional AT&T and POSIX-compliant precedence for the repeat operator |
| where concatenation has higher precedence than the repeat operator. |
| @end itemize |
| |
| |
| @anchor{option-stack} |
| @opindex ---stack |
| @opindex stack |
| @item --stack, @code{%option stack} |
| enables the use of |
| start condition stacks (@pxref{Start Conditions}). |
| |
| |
| |
| @anchor{option-stdinit} |
| @opindex ---stdinit |
| @opindex stdinit |
| @item --stdinit, @code{%option stdinit} |
| if set (i.e., @b{%option stdinit)} initializes @code{yyin} and |
| @code{yyout} to @file{stdin} and @file{stdout}, instead of the default of |
| @file{NULL}. Some existing @code{lex} programs depend on this behavior, |
| even though it is not compliant with ANSI C, which does not require |
| @file{stdin} and @file{stdout} to be compile-time constant. In a |
| reentrant scanner, however, this is not a problem since initialization |
| is performed in @code{yylex_init} at runtime. |
| |
| |
| |
| @anchor{option-yylineno} |
| @opindex ---yylineno |
| @opindex yylineno |
| @item --yylineno, @code{%option yylineno} |
| directs @code{flex} to generate a scanner |
| that maintains the number of the current line read from its input in the |
| global variable @code{yylineno}. This option is implied by @code{%option |
| lex-compat}. In a reentrant C scanner, the macro @code{yylineno} is |
| accessible regardless of the value of @code{%option yylineno}, however, its |
| value is not modified by @code{flex} unless @code{%option yylineno} is enabled. |
| |
| |
| |
| @anchor{option-yywrap} |
| @opindex ---yywrap |
| @opindex yywrap |
| @item --yywrap, @code{%option yywrap} |
| if unset (i.e., @code{--noyywrap)}, makes the scanner not call |
| @code{yywrap()} upon an end-of-file, but simply assume that there are no |
| more files to scan (until the user points @file{yyin} at a new file and |
| calls @code{yylex()} again). |
| |
| @end table |
| |
| @node Code-Level And API Options, Options for Scanner Speed and Size, Options Affecting Scanner Behavior, Scanner Options |
| @section Code-Level And API Options |
| |
| @table @samp |
| |
| @anchor{option-ansi-definitions} |
| @opindex ---option-ansi-definitions |
| @opindex ansi-definitions |
| @item --ansi-definitions, @code{%option ansi-definitions} |
| instruct flex to generate ANSI C99 definitions for functions. |
| This option is enabled by default. |
| If @code{%option noansi-definitions} is specified, then the obsolete style |
| is generated. |
| |
| @anchor{option-ansi-prototypes} |
| @opindex ---option-ansi-prototypes |
| @opindex ansi-prototypes |
| @item --ansi-prototypes, @code{%option ansi-prototypes} |
| instructs flex to generate ANSI C99 prototypes for functions. |
| This option is enabled by default. |
| If @code{noansi-prototypes} is specified, then |
| prototypes will have empty parameter lists. |
| |
| @anchor{option-bison-bridge} |
| @opindex ---bison-bridge |
| @opindex bison-bridge |
| @item --bison-bridge, @code{%option bison-bridge} |
| instructs flex to generate a C scanner that is |
| meant to be called by a |
| @code{GNU bison} |
| parser. The scanner has minor API changes for |
| @code{bison} |
| compatibility. In particular, the declaration of |
| @code{yylex} |
| is modified to take an additional parameter, |
| @code{yylval}. |
| @xref{Bison Bridge}. |
| |
| @anchor{option-bison-locations} |
| @opindex ---bison-locations |
| @opindex bison-locations |
| @item --bison-locations, @code{%option bison-locations} |
| instruct flex that |
| @code{GNU bison} @code{%locations} are being used. |
| This means @code{yylex} will be passed |
| an additional parameter, @code{yylloc}. This option |
| implies @code{%option bison-bridge}. |
| @xref{Bison Bridge}. |
| |
| @anchor{option-noline} |
| @opindex -L |
| @opindex ---noline |
| @opindex noline |
| @item -L, --noline, @code{%option noline} |
| instructs |
| @code{flex} |
| not to generate |
| @code{#line} |
| directives. Without this option, |
| @code{flex} |
| peppers the generated scanner |
| with @code{#line} directives so error messages in the actions will be correctly |
| located with respect to either the original |
| @code{flex} |
| input file (if the errors are due to code in the input file), or |
| @file{lex.yy.c} |
| (if the errors are |
| @code{flex}'s |
| fault -- you should report these sorts of errors to the email address |
| given in @ref{Reporting Bugs}). |
| |
| |
| |
| @anchor{option-reentrant} |
| @opindex -R |
| @opindex ---reentrant |
| @opindex reentrant |
| @item -R, --reentrant, @code{%option reentrant} |
| instructs flex to generate a reentrant C scanner. The generated scanner |
| may safely be used in a multi-threaded environment. The API for a |
| reentrant scanner is different than for a non-reentrant scanner |
| @pxref{Reentrant}). Because of the API difference between |
| reentrant and non-reentrant @code{flex} scanners, non-reentrant flex |
| code must be modified before it is suitable for use with this option. |
| This option is not compatible with the @samp{--c++} option. |
| |
| The option @samp{--reentrant} does not affect the performance of |
| the scanner. |
| |
| |
| |
| @anchor{option-c++} |
| @opindex -+ |
| @opindex ---c++ |
| @opindex c++ |
| @item -+, --c++, @code{%option c++} |
| specifies that you want flex to generate a C++ |
| scanner class. @xref{Cxx}, for |
| details. |
| |
| |
| |
| @anchor{option-array} |
| @opindex ---array |
| @opindex array |
| @item --array, @code{%option array} |
| specifies that you want yytext to be an array instead of a char* |
| |
| |
| |
| @anchor{option-pointer} |
| @opindex ---pointer |
| @opindex pointer |
| @item --pointer, @code{%option pointer} |
| specify that @code{yytext} should be a @code{char *}, not an array. |
| This default is @code{char *}. |
| |
| |
| |
| @anchor{option-prefix} |
| @opindex -P |
| @opindex ---prefix |
| @opindex prefix |
| @item -PPREFIX, --prefix=PREFIX, @code{%option prefix="PREFIX"} |
| changes the default @samp{yy} prefix used by @code{flex} for all |
| globally-visible variable and function names to instead be |
| @samp{PREFIX}. For example, @samp{--prefix=foo} changes the name of |
| @code{yytext} to @code{footext}. It also changes the name of the default |
| output file from @file{lex.yy.c} to @file{lex.foo.c}. Here is a partial |
| list of the names affected: |
| |
| @example |
| @verbatim |
| yy_create_buffer |
| yy_delete_buffer |
| yy_flex_debug |
| yy_init_buffer |
| yy_flush_buffer |
| yy_load_buffer_state |
| yy_switch_to_buffer |
| yyin |
| yyleng |
| yylex |
| yylineno |
| yyout |
| yyrestart |
| yytext |
| yywrap |
| yyalloc |
| yyrealloc |
| yyfree |
| @end verbatim |
| @end example |
| |
| (If you are using a C++ scanner, then only @code{yywrap} and |
| @code{yyFlexLexer} are affected.) Within your scanner itself, you can |
| still refer to the global variables and functions using either version |
| of their name; but externally, they have the modified name. |
| |
| This option lets you easily link together multiple |
| @code{flex} |
| programs into the same executable. Note, though, that using this |
| option also renames |
| @code{yywrap()}, |
| so you now |
| @emph{must} |
| either |
| provide your own (appropriately-named) version of the routine for your |
| scanner, or use |
| @code{%option noyywrap}, |
| as linking with |
| @samp{-lfl} |
| no longer provides one for you by default. |
| |
| |
| |
| @anchor{option-main} |
| @opindex ---main |
| @opindex main |
| @item --main, @code{%option main} |
| directs flex to provide a default @code{main()} program for the |
| scanner, which simply calls @code{yylex()}. This option implies |
| @code{noyywrap} (see below). |
| |
| |
| |
| @anchor{option-nounistd} |
| @opindex ---nounistd |
| @opindex nounistd |
| @item --nounistd, @code{%option nounistd} |
| suppresses inclusion of the non-ANSI header file @file{unistd.h}. This option |
| is meant to target environments in which @file{unistd.h} does not exist. Be aware |
| that certain options may cause flex to generate code that relies on functions |
| normally found in @file{unistd.h}, (e.g. @code{isatty()}, @code{read()}.) |
| If you wish to use these functions, you will have to inform your compiler where |
| to find them. |
| @xref{option-always-interactive}. @xref{option-read}. |
| |
| |
| |
| @anchor{option-yyclass} |
| @opindex ---yyclass |
| @opindex yyclass |
| @item --yyclass=NAME, @code{%option yyclass="NAME"} |
| only applies when generating a C++ scanner (the @samp{--c++} option). It |
| informs @code{flex} that you have derived @code{NAME} as a subclass of |
| @code{yyFlexLexer}, so @code{flex} will place your actions in the member |
| function @code{foo::yylex()} instead of @code{yyFlexLexer::yylex()}. It |
| also generates a @code{yyFlexLexer::yylex()} member function that emits |
| a run-time error (by invoking @code{yyFlexLexer::LexerError())} if |
| called. @xref{Cxx}. |
| |
| @end table |
| |
| @node Options for Scanner Speed and Size, Debugging Options, Code-Level And API Options, Scanner Options |
| @section Options for Scanner Speed and Size |
| |
| @table @samp |
| |
| @item -C[aefFmr] |
| controls the degree of table compression and, more generally, trade-offs |
| between small scanners and fast scanners. |
| |
| @table @samp |
| @opindex -C |
| @item -C |
| A lone @samp{-C} specifies that the scanner tables should be compressed |
| but neither equivalence classes nor meta-equivalence classes should be |
| used. |
| |
| @anchor{option-align} |
| @opindex -Ca |
| @opindex ---align |
| @opindex align |
| @item -Ca, --align, @code{%option align} |
| (``align'') instructs flex to trade off larger tables in the |
| generated scanner for faster performance because the elements of |
| the tables are better aligned for memory access and computation. On some |
| RISC architectures, fetching and manipulating longwords is more efficient |
| than with smaller-sized units such as shortwords. This option can |
| quadruple the size of the tables used by your scanner. |
| |
| @anchor{option-ecs} |
| @opindex -Ce |
| @opindex ---ecs |
| @opindex ecs |
| @item -Ce, --ecs, @code{%option ecs} |
| directs @code{flex} to construct @dfn{equivalence classes}, i.e., sets |
| of characters which have identical lexical properties (for example, if |
| the only appearance of digits in the @code{flex} input is in the |
| character class ``[0-9]'' then the digits '0', '1', ..., '9' will all be |
| put in the same equivalence class). Equivalence classes usually give |
| dramatic reductions in the final table/object file sizes (typically a |
| factor of 2-5) and are pretty cheap performance-wise (one array look-up |
| per character scanned). |
| |
| @opindex -Cf |
| @item -Cf |
| specifies that the @dfn{full} scanner tables should be generated - |
| @code{flex} should not compress the tables by taking advantages of |
| similar transition functions for different states. |
| |
| @opindex -CF |
| @item -CF |
| specifies that the alternate fast scanner representation (described |
| above under the @samp{--fast} flag) should be used. This option cannot be |
| used with @samp{--c++}. |
| |
| @anchor{option-meta-ecs} |
| @opindex -Cm |
| @opindex ---meta-ecs |
| @opindex meta-ecs |
| @item -Cm, --meta-ecs, @code{%option meta-ecs} |
| directs |
| @code{flex} |
| to construct |
| @dfn{meta-equivalence classes}, |
| which are sets of equivalence classes (or characters, if equivalence |
| classes are not being used) that are commonly used together. Meta-equivalence |
| classes are often a big win when using compressed tables, but they |
| have a moderate performance impact (one or two @code{if} tests and one |
| array look-up per character scanned). |
| |
| @anchor{option-read} |
| @opindex -Cr |
| @opindex ---read |
| @opindex read |
| @item -Cr, --read, @code{%option read} |
| causes the generated scanner to @emph{bypass} use of the standard I/O |
| library (@code{stdio}) for input. Instead of calling @code{fread()} or |
| @code{getc()}, the scanner will use the @code{read()} system call, |
| resulting in a performance gain which varies from system to system, but |
| in general is probably negligible unless you are also using @samp{-Cf} |
| or @samp{-CF}. Using @samp{-Cr} can cause strange behavior if, for |
| example, you read from @file{yyin} using @code{stdio} prior to calling |
| the scanner (because the scanner will miss whatever text your previous |
| reads left in the @code{stdio} input buffer). @samp{-Cr} has no effect |
| if you define @code{YY_INPUT()} (@pxref{Generated Scanner}). |
| @end table |
| |
| The options @samp{-Cf} or @samp{-CF} and @samp{-Cm} do not make sense |
| together - there is no opportunity for meta-equivalence classes if the |
| table is not being compressed. Otherwise the options may be freely |
| mixed, and are cumulative. |
| |
| The default setting is @samp{-Cem}, which specifies that @code{flex} |
| should generate equivalence classes and meta-equivalence classes. This |
| setting provides the highest degree of table compression. You can trade |
| off faster-executing scanners at the cost of larger tables with the |
| following generally being true: |
| |
| @example |
| @verbatim |
| slowest & smallest |
| -Cem |
| -Cm |
| -Ce |
| -C |
| -C{f,F}e |
| -C{f,F} |
| -C{f,F}a |
| fastest & largest |
| @end verbatim |
| @end example |
| |
| Note that scanners with the smallest tables are usually generated and |
| compiled the quickest, so during development you will usually want to |
| use the default, maximal compression. |
| |
| @samp{-Cfe} is often a good compromise between speed and size for |
| production scanners. |
| |
| @anchor{option-full} |
| @opindex -f |
| @opindex ---full |
| @opindex full |
| @item -f, --full, @code{%option full} |
| specifies |
| @dfn{fast scanner}. |
| No table compression is done and @code{stdio} is bypassed. |
| The result is large but fast. This option is equivalent to |
| @samp{--Cfr} |
| |
| |
| @anchor{option-fast} |
| @opindex -F |
| @opindex ---fast |
| @opindex fast |
| @item -F, --fast, @code{%option fast} |
| specifies that the @emph{fast} scanner table representation should be |
| used (and @code{stdio} bypassed). This representation is about as fast |
| as the full table representation @samp{--full}, and for some sets of |
| patterns will be considerably smaller (and for others, larger). In |
| general, if the pattern set contains both @emph{keywords} and a |
| catch-all, @emph{identifier} rule, such as in the set: |
| |
| @example |
| @verbatim |
| "case" return TOK_CASE; |
| "switch" return TOK_SWITCH; |
| ... |
| "default" return TOK_DEFAULT; |
| [a-z]+ return TOK_ID; |
| @end verbatim |
| @end example |
| |
| then you're better off using the full table representation. If only |
| the @emph{identifier} rule is present and you then use a hash table or some such |
| to detect the keywords, you're better off using |
| @samp{--fast}. |
| |
| This option is equivalent to @samp{-CFr}. It cannot be used |
| with @samp{--c++}. |
| |
| @end table |
| |
| @node Debugging Options, Miscellaneous Options, Options for Scanner Speed and Size, Scanner Options |
| @section Debugging Options |
| |
| @table @samp |
| |
| @anchor{option-backup} |
| @opindex -b |
| @opindex ---backup |
| @opindex backup |
| @item -b, --backup, @code{%option backup} |
| Generate backing-up information to @file{lex.backup}. This is a list of |
| scanner states which require backing up and the input characters on |
| which they do so. By adding rules one can remove backing-up states. If |
| @emph{all} backing-up states are eliminated and @samp{-Cf} or @code{-CF} |
| is used, the generated scanner will run faster (see the @samp{--perf-report} flag). |
| Only users who wish to squeeze every last cycle out of their scanners |
| need worry about this option. (@pxref{Performance}). |
| |
| |
| |
| @anchor{option-debug} |
| @opindex -d |
| @opindex ---debug |
| @opindex debug |
| @item -d, --debug, @code{%option debug} |
| makes the generated scanner run in @dfn{debug} mode. Whenever a pattern |
| is recognized and the global variable @code{yy_flex_debug} is non-zero |
| (which is the default), the scanner will write to @file{stderr} a line |
| of the form: |
| |
| @example |
| @verbatim |
| -accepting rule at line 53 ("the matched text") |
| @end verbatim |
| @end example |
| |
| The line number refers to the location of the rule in the file defining |
| the scanner (i.e., the file that was fed to flex). Messages are also |
| generated when the scanner backs up, accepts the default rule, reaches |
| the end of its input buffer (or encounters a NUL; at this point, the two |
| look the same as far as the scanner's concerned), or reaches an |
| end-of-file. |
| |
| |
| |
| @anchor{option-perf-report} |
| @opindex -p |
| @opindex ---perf-report |
| @opindex perf-report |
| @item -p, --perf-report, @code{%option perf-report} |
| generates a performance report to @file{stderr}. The report consists of |
| comments regarding features of the @code{flex} input file which will |
| cause a serious loss of performance in the resulting scanner. If you |
| give the flag twice, you will also get comments regarding features that |
| lead to minor performance losses. |
| |
| Note that the use of @code{REJECT}, and |
| variable trailing context (@pxref{Limitations}) entails a substantial |
| performance penalty; use of @code{yymore()}, the @samp{^} operator, and |
| the @samp{--interactive} flag entail minor performance penalties. |
| |
| |
| |
| @anchor{option-nodefault} |
| @opindex -s |
| @opindex ---nodefault |
| @opindex nodefault |
| @item -s, --nodefault, @code{%option nodefault} |
| causes the @emph{default rule} (that unmatched scanner input is echoed |
| to @file{stdout)} to be suppressed. If the scanner encounters input |
| that does not match any of its rules, it aborts with an error. This |
| option is useful for finding holes in a scanner's rule set. |
| |
| |
| |
| @anchor{option-trace} |
| @opindex -T |
| @opindex ---trace |
| @opindex trace |
| @item -T, --trace, @code{%option trace} |
| makes @code{flex} run in @dfn{trace} mode. It will generate a lot of |
| messages to @file{stderr} concerning the form of the input and the |
| resultant non-deterministic and deterministic finite automata. This |
| option is mostly for use in maintaining @code{flex}. |
| |
| |
| |
| @anchor{option-nowarn} |
| @opindex -w |
| @opindex ---nowarn |
| @opindex nowarn |
| @item -w, --nowarn, @code{%option nowarn} |
| suppresses warning messages. |
| |
| |
| |
| @anchor{option-verbose} |
| @opindex -v |
| @opindex ---verbose |
| @opindex verbose |
| @item -v, --verbose, @code{%option verbose} |
| specifies that @code{flex} should write to @file{stderr} a summary of |
| statistics regarding the scanner it generates. Most of the statistics |
| are meaningless to the casual @code{flex} user, but the first line |
| identifies the version of @code{flex} (same as reported by @samp{--version}), |
| and the next line the flags used when generating the scanner, including |
| those that are on by default. |
| |
| |
| |
| @anchor{option-warn} |
| @opindex ---warn |
| @opindex warn |
| @item --warn, @code{%option warn} |
| warn about certain things. In particular, if the default rule can be |
| matched but no default rule has been given, the flex will warn you. |
| We recommend using this option always. |
| |
| @end table |
| |
| @node Miscellaneous Options, , Debugging Options, Scanner Options |
| @section Miscellaneous Options |
| |
| @table @samp |
| @opindex -c |
| @item -c |
| A do-nothing option included for POSIX compliance. |
| |
| @opindex -h |
| @opindex ---help |
| @item -h, -?, --help |
| generates a ``help'' summary of @code{flex}'s options to @file{stdout} |
| and then exits. |
| |
| @opindex -n |
| @item -n |
| Another do-nothing option included for |
| POSIX compliance. |
| |
| @opindex -V |
| @opindex ---version |
| @item -V, --version |
| prints the version number to @file{stdout} and exits. |
| |
| @end table |
| |
| |
| @node Performance, Cxx, Scanner Options, Top |
| @chapter Performance Considerations |
| |
| @cindex performance, considerations |
| The main design goal of @code{flex} is that it generate high-performance |
| scanners. It has been optimized for dealing well with large sets of |
| rules. Aside from the effects on scanner speed of the table compression |
| @samp{-C} options outlined above, there are a number of options/actions |
| which degrade performance. These are, from most expensive to least: |
| |
| @cindex REJECT, performance costs |
| @cindex yylineno, performance costs |
| @cindex trailing context, performance costs |
| @example |
| @verbatim |
| REJECT |
| arbitrary trailing context |
| |
| pattern sets that require backing up |
| %option yylineno |
| %array |
| |
| %option interactive |
| %option always-interactive |
| |
| ^ beginning-of-line operator |
| yymore() |
| @end verbatim |
| @end example |
| |
| with the first two all being quite expensive and the last two being |
| quite cheap. Note also that @code{unput()} is implemented as a routine |
| call that potentially does quite a bit of work, while @code{yyless()} is |
| a quite-cheap macro. So if you are just putting back some excess text |
| you scanned, use @code{yyless()}. |
| |
| @code{REJECT} should be avoided at all costs when performance is |
| important. It is a particularly expensive option. |
| |
| There is one case when @code{%option yylineno} can be expensive. That is when |
| your patterns match long tokens that could @emph{possibly} contain a newline |
| character. There is no performance penalty for rules that can not possibly |
| match newlines, since flex does not need to check them for newlines. In |
| general, you should avoid rules such as @code{[^f]+}, which match very long |
| tokens, including newlines, and may possibly match your entire file! A better |
| approach is to separate @code{[^f]+} into two rules: |
| |
| @example |
| @verbatim |
| %option yylineno |
| %% |
| [^f\n]+ |
| \n+ |
| @end verbatim |
| @end example |
| |
| The above scanner does not incur a performance penalty. |
| |
| @cindex patterns, tuning for performance |
| @cindex performance, backing up |
| @cindex backing up, example of eliminating |
| Getting rid of backing up is messy and often may be an enormous amount |
| of work for a complicated scanner. In principal, one begins by using |
| the @samp{-b} flag to generate a @file{lex.backup} file. For example, |
| on the input: |
| |
| @cindex backing up, eliminating |
| @example |
| @verbatim |
| %% |
| foo return TOK_KEYWORD; |
| foobar return TOK_KEYWORD; |
| @end verbatim |
| @end example |
| |
| the file looks like: |
| |
| @example |
| @verbatim |
| State #6 is non-accepting - |
| associated rule line numbers: |
| 2 3 |
| out-transitions: [ o ] |
| jam-transitions: EOF [ \001-n p-\177 ] |
| |
| State #8 is non-accepting - |
| associated rule line numbers: |
| 3 |
| out-transitions: [ a ] |
| jam-transitions: EOF [ \001-` b-\177 ] |
| |
| State #9 is non-accepting - |
| associated rule line numbers: |
| 3 |
| out-transitions: [ r ] |
| jam-transitions: EOF [ \001-q s-\177 ] |
| |
| Compressed tables always back up. |
| @end verbatim |
| @end example |
| |
| The first few lines tell us that there's a scanner state in which it can |
| make a transition on an 'o' but not on any other character, and that in |
| that state the currently scanned text does not match any rule. The |
| state occurs when trying to match the rules found at lines 2 and 3 in |
| the input file. If the scanner is in that state and then reads |
| something other than an 'o', it will have to back up to find a rule |
| which is matched. With a bit of headscratching one can see that this |
| must be the state it's in when it has seen @samp{fo}. When this has |
| happened, if anything other than another @samp{o} is seen, the scanner |
| will have to back up to simply match the @samp{f} (by the default rule). |
| |
| The comment regarding State #8 indicates there's a problem when |
| @samp{foob} has been scanned. Indeed, on any character other than an |
| @samp{a}, the scanner will have to back up to accept "foo". Similarly, |
| the comment for State #9 concerns when @samp{fooba} has been scanned and |
| an @samp{r} does not follow. |
| |
| The final comment reminds us that there's no point going to all the |
| trouble of removing backing up from the rules unless we're using |
| @samp{-Cf} or @samp{-CF}, since there's no performance gain doing so |
| with compressed scanners. |
| |
| @cindex error rules, to eliminate backing up |
| The way to remove the backing up is to add ``error'' rules: |
| |
| @cindex backing up, eliminating by adding error rules |
| @example |
| @verbatim |
| %% |
| foo return TOK_KEYWORD; |
| foobar return TOK_KEYWORD; |
| |
| fooba | |
| foob | |
| fo { |
| /* false alarm, not really a keyword */ |
| return TOK_ID; |
| } |
| @end verbatim |
| @end example |
| |
| Eliminating backing up among a list of keywords can also be done using a |
| ``catch-all'' rule: |
| |
| @cindex backing up, eliminating with catch-all rule |
| @example |
| @verbatim |
| %% |
| foo return TOK_KEYWORD; |
| foobar return TOK_KEYWORD; |
| |
| [a-z]+ return TOK_ID; |
| @end verbatim |
| @end example |
| |
| This is usually the best solution when appropriate. |
| |
| Backing up messages tend to cascade. With a complicated set of rules |
| it's not uncommon to get hundreds of messages. If one can decipher |
| them, though, it often only takes a dozen or so rules to eliminate the |
| backing up (though it's easy to make a mistake and have an error rule |
| accidentally match a valid token. A possible future @code{flex} feature |
| will be to automatically add rules to eliminate backing up). |
| |
| It's important to keep in mind that you gain the benefits of eliminating |
| backing up only if you eliminate @emph{every} instance of backing up. |
| Leaving just one means you gain nothing. |
| |
| @emph{Variable} trailing context (where both the leading and trailing |
| parts do not have a fixed length) entails almost the same performance |
| loss as @code{REJECT} (i.e., substantial). So when possible a rule |
| like: |
| |
| @cindex trailing context, variable length |
| @example |
| @verbatim |
| %% |
| mouse|rat/(cat|dog) run(); |
| @end verbatim |
| @end example |
| |
| is better written: |
| |
| @example |
| @verbatim |
| %% |
| mouse/cat|dog run(); |
| rat/cat|dog run(); |
| @end verbatim |
| @end example |
| |
| or as |
| |
| @example |
| @verbatim |
| %% |
| mouse|rat/cat run(); |
| mouse|rat/dog run(); |
| @end verbatim |
| @end example |
| |
| Note that here the special '|' action does @emph{not} provide any |
| savings, and can even make things worse (@pxref{Limitations}). |
| |
| Another area where the user can increase a scanner's performance (and |
| one that's easier to implement) arises from the fact that the longer the |
| tokens matched, the faster the scanner will run. This is because with |
| long tokens the processing of most input characters takes place in the |
| (short) inner scanning loop, and does not often have to go through the |
| additional work of setting up the scanning environment (e.g., |
| @code{yytext}) for the action. Recall the scanner for C comments: |
| |
| @cindex performance optimization, matching longer tokens |
| @example |
| @verbatim |
| %x comment |
| %% |
| int line_num = 1; |
| |
| "/*" BEGIN(comment); |
| |
| <comment>[^*\n]* |
| <comment>"*"+[^*/\n]* |
| <comment>\n ++line_num; |
| <comment>"*"+"/" BEGIN(INITIAL); |
| @end verbatim |
| @end example |
| |
| This could be sped up by writing it as: |
| |
| @example |
| @verbatim |
| %x comment |
| %% |
| int line_num = 1; |
| |
| "/*" BEGIN(comment); |
| |
| <comment>[^*\n]* |
| <comment>[^*\n]*\n ++line_num; |
| <comment>"*"+[^*/\n]* |
| <comment>"*"+[^*/\n]*\n ++line_num; |
| <comment>"*"+"/" BEGIN(INITIAL); |
| @end verbatim |
| @end example |
| |
| Now instead of each newline requiring the processing of another action, |
| recognizing the newlines is distributed over the other rules to keep the |
| matched text as long as possible. Note that @emph{adding} rules does |
| @emph{not} slow down the scanner! The speed of the scanner is |
| independent of the number of rules or (modulo the considerations given |
| at the beginning of this section) how complicated the rules are with |
| regard to operators such as @samp{*} and @samp{|}. |
| |
| @cindex keywords, for performance |
| @cindex performance, using keywords |
| A final example in speeding up a scanner: suppose you want to scan |
| through a file containing identifiers and keywords, one per line |
| and with no other extraneous characters, and recognize all the |
| keywords. A natural first approach is: |
| |
| @cindex performance optimization, recognizing keywords |
| @example |
| @verbatim |
| %% |
| asm | |
| auto | |
| break | |
| ... etc ... |
| volatile | |
| while /* it's a keyword */ |
| |
| .|\n /* it's not a keyword */ |
| @end verbatim |
| @end example |
| |
| To eliminate the back-tracking, introduce a catch-all rule: |
| |
| @example |
| @verbatim |
| %% |
| asm | |
| auto | |
| break | |
| ... etc ... |
| volatile | |
| while /* it's a keyword */ |
| |
| [a-z]+ | |
| .|\n /* it's not a keyword */ |
| @end verbatim |
| @end example |
| |
| Now, if it's guaranteed that there's exactly one word per line, then we |
| can reduce the total number of matches by a half by merging in the |
| recognition of newlines with that of the other tokens: |
| |
| @example |
| @verbatim |
| %% |
| asm\n | |
| auto\n | |
| break\n | |
| ... etc ... |
| volatile\n | |
| while\n /* it's a keyword */ |
| |
| [a-z]+\n | |
| .|\n /* it's not a keyword */ |
| @end verbatim |
| @end example |
| |
| One has to be careful here, as we have now reintroduced backing up |
| into the scanner. In particular, while |
| @emph{we} |
| know that there will never be any characters in the input stream |
| other than letters or newlines, |
| @code{flex} |
| can't figure this out, and it will plan for possibly needing to back up |
| when it has scanned a token like @samp{auto} and then the next character |
| is something other than a newline or a letter. Previously it would |
| then just match the @samp{auto} rule and be done, but now it has no @samp{auto} |
| rule, only a @samp{auto\n} rule. To eliminate the possibility of backing up, |
| we could either duplicate all rules but without final newlines, or, |
| since we never expect to encounter such an input and therefore don't |
| how it's classified, we can introduce one more catch-all rule, this |
| one which doesn't include a newline: |
| |
| @example |
| @verbatim |
| %% |
| asm\n | |
| auto\n | |
| break\n | |
| ... etc ... |
| volatile\n | |
| while\n /* it's a keyword */ |
| |
| [a-z]+\n | |
| [a-z]+ | |
| .|\n /* it's not a keyword */ |
| @end verbatim |
| @end example |
| |
| Compiled with @samp{-Cf}, this is about as fast as one can get a |
| @code{flex} scanner to go for this particular problem. |
| |
| A final note: @code{flex} is slow when matching @code{NUL}s, |
| particularly when a token contains multiple @code{NUL}s. It's best to |
| write rules which match @emph{short} amounts of text if it's anticipated |
| that the text will often include @code{NUL}s. |
| |
| Another final note regarding performance: as mentioned in |
| @ref{Matching}, dynamically resizing @code{yytext} to accommodate huge |
| tokens is a slow process because it presently requires that the (huge) |
| token be rescanned from the beginning. Thus if performance is vital, |
| you should attempt to match ``large'' quantities of text but not |
| ``huge'' quantities, where the cutoff between the two is at about 8K |
| characters per token. |
| |
| @node Cxx, Reentrant, Performance, Top |
| @chapter Generating C++ Scanners |
| |
| @cindex c++, experimental form of scanner class |
| @cindex experimental form of c++ scanner class |
| @strong{IMPORTANT}: the present form of the scanning class is @emph{experimental} |
| and may change considerably between major releases. |
| |
| @cindex C++ |
| @cindex member functions, C++ |
| @cindex methods, c++ |
| @code{flex} provides two different ways to generate scanners for use |
| with C++. The first way is to simply compile a scanner generated by |
| @code{flex} using a C++ compiler instead of a C compiler. You should |
| not encounter any compilation errors (@pxref{Reporting Bugs}). You can |
| then use C++ code in your rule actions instead of C code. Note that the |
| default input source for your scanner remains @file{yyin}, and default |
| echoing is still done to @file{yyout}. Both of these remain @code{FILE |
| *} variables and not C++ @emph{streams}. |
| |
| You can also use @code{flex} to generate a C++ scanner class, using the |
| @samp{-+} option (or, equivalently, @code{%option c++)}, which is |
| automatically specified if the name of the @code{flex} executable ends |
| in a '+', such as @code{flex++}. When using this option, @code{flex} |
| defaults to generating the scanner to the file @file{lex.yy.cc} instead |
| of @file{lex.yy.c}. The generated scanner includes the header file |
| @file{FlexLexer.h}, which defines the interface to two C++ classes. |
| |
| The first class, |
| @code{FlexLexer}, |
| provides an abstract base class defining the general scanner class |
| interface. It provides the following member functions: |
| |
| @table @code |
| @findex YYText (C++ only) |
| @item const char* YYText() |
| returns the text of the most recently matched token, the equivalent of |
| @code{yytext}. |
| |
| @findex YYLeng (C++ only) |
| @item int YYLeng() |
| returns the length of the most recently matched token, the equivalent of |
| @code{yyleng}. |
| |
| @findex lineno (C++ only) |
| @item int lineno() const |
| returns the current input line number (see @code{%option yylineno)}, or |
| @code{1} if @code{%option yylineno} was not used. |
| |
| @findex set_debug (C++ only) |
| @item void set_debug( int flag ) |
| sets the debugging flag for the scanner, equivalent to assigning to |
| @code{yy_flex_debug} (@pxref{Scanner Options}). Note that you must build |
| the scanner using @code{%option debug} to include debugging information |
| in it. |
| |
| @findex debug (C++ only) |
| @item int debug() const |
| returns the current setting of the debugging flag. |
| @end table |
| |
| Also provided are member functions equivalent to |
| @code{yy_switch_to_buffer()}, @code{yy_create_buffer()} (though the |
| first argument is an @code{istream*} object pointer and not a |
| @code{FILE*)}, @code{yy_flush_buffer()}, @code{yy_delete_buffer()}, and |
| @code{yyrestart()} (again, the first argument is a @code{istream*} |
| object pointer). |
| |
| @tindex yyFlexLexer (C++ only) |
| @tindex FlexLexer (C++ only) |
| The second class defined in @file{FlexLexer.h} is @code{yyFlexLexer}, |
| which is derived from @code{FlexLexer}. It defines the following |
| additional member functions: |
| |
| @table @code |
| @findex yyFlexLexer constructor (C++ only) |
| @item yyFlexLexer( istream* arg_yyin = 0, ostream* arg_yyout = 0 ) |
| constructs a @code{yyFlexLexer} object using the given streams for input |
| and output. If not specified, the streams default to @code{cin} and |
| @code{cout}, respectively. |
| |
| @findex yylex (C++ version) |
| @item virtual int yylex() |
| performs the same role is @code{yylex()} does for ordinary @code{flex} |
| scanners: it scans the input stream, consuming tokens, until a rule's |
| action returns a value. If you derive a subclass @code{S} from |
| @code{yyFlexLexer} and want to access the member functions and variables |
| of @code{S} inside @code{yylex()}, then you need to use @code{%option |
| yyclass="S"} to inform @code{flex} that you will be using that subclass |
| instead of @code{yyFlexLexer}. In this case, rather than generating |
| @code{yyFlexLexer::yylex()}, @code{flex} generates @code{S::yylex()} |
| (and also generates a dummy @code{yyFlexLexer::yylex()} that calls |
| @code{yyFlexLexer::LexerError()} if called). |
| |
| @findex switch_streams (C++ only) |
| @item virtual void switch_streams(istream* new_in = 0, ostream* new_out = 0) |
| reassigns @code{yyin} to @code{new_in} (if non-null) and @code{yyout} to |
| @code{new_out} (if non-null), deleting the previous input buffer if |
| @code{yyin} is reassigned. |
| |
| @item int yylex( istream* new_in, ostream* new_out = 0 ) |
| first switches the input streams via @code{switch_streams( new_in, |
| new_out )} and then returns the value of @code{yylex()}. |
| @end table |
| |
| In addition, @code{yyFlexLexer} defines the following protected virtual |
| functions which you can redefine in derived classes to tailor the |
| scanner: |
| |
| @table @code |
| @findex LexerInput (C++ only) |
| @item virtual int LexerInput( char* buf, int max_size ) |
| reads up to @code{max_size} characters into @code{buf} and returns the |
| number of characters read. To indicate end-of-input, return 0 |
| characters. Note that @code{interactive} scanners (see the @samp{-B} |
| and @samp{-I} flags in @ref{Scanner Options}) define the macro |
| @code{YY_INTERACTIVE}. If you redefine @code{LexerInput()} and need to |
| take different actions depending on whether or not the scanner might be |
| scanning an interactive input source, you can test for the presence of |
| this name via @code{#ifdef} statements. |
| |
| @findex LexerOutput (C++ only) |
| @item virtual void LexerOutput( const char* buf, int size ) |
| writes out @code{size} characters from the buffer @code{buf}, which, while |
| @code{NUL}-terminated, may also contain internal @code{NUL}s if the |
| scanner's rules can match text with @code{NUL}s in them. |
| |
| @cindex error reporting, in C++ |
| @findex LexerError (C++ only) |
| @item virtual void LexerError( const char* msg ) |
| reports a fatal error message. The default version of this function |
| writes the message to the stream @code{cerr} and exits. |
| @end table |
| |
| Note that a @code{yyFlexLexer} object contains its @emph{entire} |
| scanning state. Thus you can use such objects to create reentrant |
| scanners, but see also @ref{Reentrant}. You can instantiate multiple |
| instances of the same @code{yyFlexLexer} class, and you can also combine |
| multiple C++ scanner classes together in the same program using the |
| @samp{-P} option discussed above. |
| |
| Finally, note that the @code{%array} feature is not available to C++ |
| scanner classes; you must use @code{%pointer} (the default). |
| |
| Here is an example of a simple C++ scanner: |
| |
| @cindex C++ scanners, use of |
| @example |
| @verbatim |
| // An example of using the flex C++ scanner class. |
| |
| %{ |
| #include <iostream> |
| using namespace std; |
| int mylineno = 0; |
| %} |
| |
| %option noyywrap |
| |
| string \"[^\n"]+\" |
| |
| ws [ \t]+ |
| |
| alpha [A-Za-z] |
| dig [0-9] |
| name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])* |
| num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)? |
| num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)? |
| number {num1}|{num2} |
| |
| %% |
| |
| {ws} /* skip blanks and tabs */ |
| |
| "/*" { |
| int c; |
| |
| while((c = yyinput()) != 0) |
| { |
| if(c == '\n') |
| ++mylineno; |
| |
| else if(c == '*') |
| { |
| if((c = yyinput()) == '/') |
| break; |
| else |
| unput(c); |
| } |
| } |
| } |
| |
| {number} cout << "number " << YYText() << '\n'; |
| |
| \n mylineno++; |
| |
| {name} cout << "name " << YYText() << '\n'; |
| |
| {string} cout << "string " << YYText() << '\n'; |
| |
| %% |
| |
| int main( int /* argc */, char** /* argv */ ) |
| { |
| FlexLexer* lexer = new yyFlexLexer; |
| while(lexer->yylex() != 0) |
| ; |
| return 0; |
| } |
| @end verbatim |
| @end example |
| |
| @cindex C++, multiple different scanners |
| If you want to create multiple (different) lexer classes, you use the |
| @samp{-P} flag (or the @code{prefix=} option) to rename each |
| @code{yyFlexLexer} to some other @samp{xxFlexLexer}. You then can |
| include @file{<FlexLexer.h>} in your other sources once per lexer class, |
| first renaming @code{yyFlexLexer} as follows: |
| |
| @cindex include files, with C++ |
| @cindex header files, with C++ |
| @cindex C++ scanners, including multiple scanners |
| @example |
| @verbatim |
| #undef yyFlexLexer |
| #define yyFlexLexer xxFlexLexer |
| #include <FlexLexer.h> |
| |
| #undef yyFlexLexer |
| #define yyFlexLexer zzFlexLexer |
| #include <FlexLexer.h> |
| @end verbatim |
| @end example |
| |
| if, for example, you used @code{%option prefix="xx"} for one of your |
| scanners and @code{%option prefix="zz"} for the other. |
| |
| @node Reentrant, Lex and Posix, Cxx, Top |
| @chapter Reentrant C Scanners |
| |
| @cindex reentrant, explanation |
| @code{flex} has the ability to generate a reentrant C scanner. This is |
| accomplished by specifying @code{%option reentrant} (@samp{-R}) The generated |
| scanner is both portable, and safe to use in one or more separate threads of |
| control. The most common use for reentrant scanners is from within |
| multi-threaded applications. Any thread may create and execute a reentrant |
| @code{flex} scanner without the need for synchronization with other threads. |
| |
| @menu |
| * Reentrant Uses:: |
| * Reentrant Overview:: |
| * Reentrant Example:: |
| * Reentrant Detail:: |
| * Reentrant Functions:: |
| @end menu |
| |
| @node Reentrant Uses, Reentrant Overview, Reentrant, Reentrant |
| @section Uses for Reentrant Scanners |
| |
| However, there are other uses for a reentrant scanner. For example, you |
| could scan two or more files simultaneously to implement a @code{diff} at |
| the token level (i.e., instead of at the character level): |
| |
| @cindex reentrant scanners, multiple interleaved scanners |
| @example |
| @verbatim |
| /* Example of maintaining more than one active scanner. */ |
| |
| do { |
| int tok1, tok2; |
| |
| tok1 = yylex( scanner_1 ); |
| tok2 = yylex( scanner_2 ); |
| |
| if( tok1 != tok2 ) |
| printf("Files are different."); |
| |
| } while ( tok1 && tok2 ); |
| @end verbatim |
| @end example |
| |
| Another use for a reentrant scanner is recursion. |
| (Note that a recursive scanner can also be created using a non-reentrant scanner and |
| buffer states. @xref{Multiple Input Buffers}.) |
| |
| The following crude scanner supports the @samp{eval} command by invoking |
| another instance of itself. |
| |
| @cindex reentrant scanners, recursive invocation |
| @example |
| @verbatim |
| /* Example of recursive invocation. */ |
| |
| %option reentrant |
| |
| %% |
| "eval(".+")" { |
| yyscan_t scanner; |
| YY_BUFFER_STATE buf; |
| |
| yylex_init( &scanner ); |
| yytext[yyleng-1] = ' '; |
| |
| buf = yy_scan_string( yytext + 5, scanner ); |
| yylex( scanner ); |
| |
| yy_delete_buffer(buf,scanner); |
| yylex_destroy( scanner ); |
| } |
| ... |
| %% |
| @end verbatim |
| @end example |
| |
| @node Reentrant Overview, Reentrant Example, Reentrant Uses, Reentrant |
| @section An Overview of the Reentrant API |
| |
| @cindex reentrant, API explanation |
| The API for reentrant scanners is different than for non-reentrant |
| scanners. Here is a quick overview of the API: |
| |
| @itemize |
| @code{%option reentrant} must be specified. |
| |
| @item |
| All functions take one additional argument: @code{yyscanner} |
| |
| @item |
| All global variables are replaced by their macro equivalents. |
| (We tell you this because it may be important to you during debugging.) |
| |
| @item |
| @code{yylex_init} and @code{yylex_destroy} must be called before and |
| after @code{yylex}, respectively. |
| |
| @item |
| Accessor methods (get/set functions) provide access to common |
| @code{flex} variables. |
| |
| @item |
| User-specific data can be stored in @code{yyextra}. |
| @end itemize |
| |
| @node Reentrant Example, Reentrant Detail, Reentrant Overview, Reentrant |
| @section Reentrant Example |
| |
| First, an example of a reentrant scanner: |
| @cindex reentrant, example of |
| @example |
| @verbatim |
| /* This scanner prints "//" comments. */ |
| |
| %option reentrant stack noyywrap |
| %x COMMENT |
| |
| %% |
| |
| "//" yy_push_state( COMMENT, yyscanner); |
| .|\n |
| |
| <COMMENT>\n yy_pop_state( yyscanner ); |
| <COMMENT>[^\n]+ fprintf( yyout, "%s\n", yytext); |
| |
| %% |
| |
| int main ( int argc, char * argv[] ) |
| { |
| yyscan_t scanner; |
| |
| yylex_init ( &scanner ); |
| yylex ( scanner ); |
| yylex_destroy ( scanner ); |
| return 0; |
| } |
| @end verbatim |
| @end example |
| |
| @node Reentrant Detail, Reentrant Functions, Reentrant Example, Reentrant |
| @section The Reentrant API in Detail |
| |
| Here are the things you need to do or know to use the reentrant C API of |
| @code{flex}. |
| |
| @menu |
| * Specify Reentrant:: |
| * Extra Reentrant Argument:: |
| * Global Replacement:: |
| * Init and Destroy Functions:: |
| * Accessor Methods:: |
| * Extra Data:: |
| * About yyscan_t:: |
| @end menu |
| |
| @node Specify Reentrant, Extra Reentrant Argument, Reentrant Detail, Reentrant Detail |
| @subsection Declaring a Scanner As Reentrant |
| |
| %option reentrant (--reentrant) must be specified. |
| |
| Notice that @code{%option reentrant} is specified in the above example |
| (@pxref{Reentrant Example}. Had this option not been specified, |
| @code{flex} would have happily generated a non-reentrant scanner without |
| complaining. You may explicitly specify @code{%option noreentrant}, if |
| you do @emph{not} want a reentrant scanner, although it is not |
| necessary. The default is to generate a non-reentrant scanner. |
| |
| @node Extra Reentrant Argument, Global Replacement, Specify Reentrant, Reentrant Detail |
| @subsection The Extra Argument |
| |
| @cindex reentrant, calling functions |
| @vindex yyscanner (reentrant only) |
| All functions take one additional argument: @code{yyscanner}. |
| |
| Notice that the calls to @code{yy_push_state} and @code{yy_pop_state} |
| both have an argument, @code{yyscanner} , that is not present in a |
| non-reentrant scanner. Here are the declarations of |
| @code{yy_push_state} and @code{yy_pop_state} in the reentrant scanner: |
| |
| @example |
| @verbatim |
| static void yy_push_state ( int new_state , yyscan_t yyscanner ) ; |
| static void yy_pop_state ( yyscan_t yyscanner ) ; |
| @end verbatim |
| @end example |
| |
| Notice that the argument @code{yyscanner} appears in the declaration of |
| both functions. In fact, all @code{flex} functions in a reentrant |
| scanner have this additional argument. It is always the last argument |
| in the argument list, it is always of type @code{yyscan_t} (which is |
| typedef'd to @code{void *}) and it is |
| always named @code{yyscanner}. As you may have guessed, |
| @code{yyscanner} is a pointer to an opaque data structure encapsulating |
| the current state of the scanner. For a list of function declarations, |
| see @ref{Reentrant Functions}. Note that preprocessor macros, such as |
| @code{BEGIN}, @code{ECHO}, and @code{REJECT}, do not take this |
| additional argument. |
| |
| @node Global Replacement, Init and Destroy Functions, Extra Reentrant Argument, Reentrant Detail |
| @subsection Global Variables Replaced By Macros |
| |
| @cindex reentrant, accessing flex variables |
| All global variables in traditional flex have been replaced by macro equivalents. |
| |
| Note that in the above example, @code{yyout} and @code{yytext} are |
| not plain variables. These are macros that will expand to their equivalent lvalue. |
| All of the familiar @code{flex} globals have been replaced by their macro |
| equivalents. In particular, @code{yytext}, @code{yyleng}, @code{yylineno}, |
| @code{yyin}, @code{yyout}, @code{yyextra}, @code{yylval}, and @code{yylloc} |
| are macros. You may safely use these macros in actions as if they were plain |
| variables. We only tell you this so you don't expect to link to these variables |
| externally. Currently, each macro expands to a member of an internal struct, e.g., |
| |
| @example |
| @verbatim |
| #define yytext (((struct yyguts_t*)yyscanner)->yytext_r) |
| @end verbatim |
| @end example |
| |
| One important thing to remember about |
| @code{yytext} |
| and friends is that |
| @code{yytext} |
| is not a global variable in a reentrant |
| scanner, you can not access it directly from outside an action or from |
| other functions. You must use an accessor method, e.g., |
| @code{yyget_text}, |
| to accomplish this. (See below). |
| |
| @node Init and Destroy Functions, Accessor Methods, Global Replacement, Reentrant Detail |
| @subsection Init and Destroy Functions |
| |
| @cindex memory, considerations for reentrant scanners |
| @cindex reentrant, initialization |
| @findex yylex_init |
| @findex yylex_destroy |
| |
| @code{yylex_init} and @code{yylex_destroy} must be called before and |
| after @code{yylex}, respectively. |
| |
| @example |
| @verbatim |
| int yylex_init ( yyscan_t * ptr_yy_globals ) ; |
| int yylex_init_extra ( YY_EXTRA_TYPE user_defined, yyscan_t * ptr_yy_globals ) ; |
| int yylex ( yyscan_t yyscanner ) ; |
| int yylex_destroy ( yyscan_t yyscanner ) ; |
| @end verbatim |
| @end example |
| |
| The function @code{yylex_init} must be called before calling any other |
| function. The argument to @code{yylex_init} is the address of an |
| uninitialized pointer to be filled in by @code{yylex_init}, overwriting |
| any previous contents. The function @code{yylex_init_extra} may be used |
| instead, taking as its first argument a variable of type @code{YY_EXTRA_TYPE}. |
| See the section on yyextra, below, for more details. |
| |
| The value stored in @code{ptr_yy_globals} should |
| thereafter be passed to @code{yylex} and @code{yylex_destroy}. Flex |
| does not save the argument passed to @code{yylex_init}, so it is safe to |
| pass the address of a local pointer to @code{yylex_init} so long as it remains |
| in scope for the duration of all calls to the scanner, up to and including |
| the call to @code{yylex_destroy}. |
| |
| The function |
| @code{yylex} should be familiar to you by now. The reentrant version |
| takes one argument, which is the value returned (via an argument) by |
| @code{yylex_init}. Otherwise, it behaves the same as the non-reentrant |
| version of @code{yylex}. |
| |
| Both @code{yylex_init} and @code{yylex_init_extra} returns 0 (zero) on success, |
| or non-zero on failure, in which case errno is set to one of the following values: |
| |
| @itemize |
| @item ENOMEM |
| Memory allocation error. @xref{memory-management}. |
| @item EINVAL |
| Invalid argument. |
| @end itemize |
| |
| |
| The function @code{yylex_destroy} should be |
| called to free resources used by the scanner. After @code{yylex_destroy} |
| is called, the contents of @code{yyscanner} should not be used. Of |
| course, there is no need to destroy a scanner if you plan to reuse it. |
| A @code{flex} scanner (both reentrant and non-reentrant) may be |
| restarted by calling @code{yyrestart}. |
| |
| Below is an example of a program that creates a scanner, uses it, then destroys |
| it when done: |
| |
| @example |
| @verbatim |
| int main () |
| { |
| yyscan_t scanner; |
| int tok; |
| |
| yylex_init(&scanner); |
| |
| while ((tok=yylex(scanner)) > 0) |
| printf("tok=%d yytext=%s\n", tok, yyget_text(scanner)); |
| |
| yylex_destroy(scanner); |
| return 0; |
| } |
| @end verbatim |
| @end example |
| |
| @node Accessor Methods, Extra Data, Init and Destroy Functions, Reentrant Detail |
| @subsection Accessing Variables with Reentrant Scanners |
| |
| @cindex reentrant, accessor functions |
| Accessor methods (get/set functions) provide access to common |
| @code{flex} variables. |
| |
| Many scanners that you build will be part of a larger project. Portions |
| of your project will need access to @code{flex} values, such as |
| @code{yytext}. In a non-reentrant scanner, these values are global, so |
| there is no problem accessing them. However, in a reentrant scanner, there are no |
| global @code{flex} values. You can not access them directly. Instead, |
| you must access @code{flex} values using accessor methods (get/set |
| functions). Each accessor method is named @code{yyget_NAME} or |
| @code{yyset_NAME}, where @code{NAME} is the name of the @code{flex} |
| variable you want. For example: |
| |
| @cindex accessor functions, use of |
| @example |
| @verbatim |
| /* Set the last character of yytext to NULL. */ |
| void chop ( yyscan_t scanner ) |
| { |
| int len = yyget_leng( scanner ); |
| yyget_text( scanner )[len - 1] = '\0'; |
| } |
| @end verbatim |
| @end example |
| |
| The above code may be called from within an action like this: |
| |
| @example |
| @verbatim |
| %% |
| .+\n { chop( yyscanner );} |
| @end verbatim |
| @end example |
| |
| You may find that @code{%option header-file} is particularly useful for generating |
| prototypes of all the accessor functions. @xref{option-header}. |
| |
| @node Extra Data, About yyscan_t, Accessor Methods, Reentrant Detail |
| @subsection Extra Data |
| |
| @cindex reentrant, extra data |
| @vindex yyextra |
| User-specific data can be stored in @code{yyextra}. |
| |
| In a reentrant scanner, it is unwise to use global variables to |
| communicate with or maintain state between different pieces of your program. |
| However, you may need access to external data or invoke external functions |
| from within the scanner actions. |
| Likewise, you may need to pass information to your scanner |
| (e.g., open file descriptors, or database connections). |
| In a non-reentrant scanner, the only way to do this would be through the |
| use of global variables. |
| @code{Flex} allows you to store arbitrary, ``extra'' data in a scanner. |
| This data is accessible through the accessor methods |
| @code{yyget_extra} and @code{yyset_extra} |
| from outside the scanner, and through the shortcut macro |
| @code{yyextra} |
| from within the scanner itself. They are defined as follows: |
| |
| @tindex YY_EXTRA_TYPE (reentrant only) |
| @findex yyget_extra |
| @findex yyset_extra |
| @example |
| @verbatim |
| #define YY_EXTRA_TYPE void* |
| YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); |
| void yyset_extra ( YY_EXTRA_TYPE arbitrary_data , yyscan_t scanner); |
| @end verbatim |
| @end example |
| |
| In addition, an extra form of @code{yylex_init} is provided, |
| @code{yylex_init_extra}. This function is provided so that the yyextra value can |
| be accessed from within the very first yyalloc, used to allocate |
| the scanner itself. |
| |
| By default, @code{YY_EXTRA_TYPE} is defined as type @code{void *}. You |
| may redefine this type using @code{%option extra-type="your_type"} in |
| the scanner: |
| |
| @cindex YY_EXTRA_TYPE, defining your own type |
| @example |
| @verbatim |
| /* An example of overriding YY_EXTRA_TYPE. */ |
| %{ |
| #include <sys/stat.h> |
| #include <unistd.h> |
| %} |
| %option reentrant |
| %option extra-type="struct stat *" |
| %% |
| |
| __filesize__ printf( "%ld", yyextra->st_size ); |
| __lastmod__ printf( "%ld", yyextra->st_mtime ); |
| %% |
| void scan_file( char* filename ) |
| { |
| yyscan_t scanner; |
| struct stat buf; |
| FILE *in; |
| |
| in = fopen( filename, "r" ); |
| stat( filename, &buf ); |
| |
| yylex_init_extra( buf, &scanner ); |
| yyset_in( in, scanner ); |
| yylex( scanner ); |
| yylex_destroy( scanner ); |
| |
| fclose( in ); |
| } |
| @end verbatim |
| @end example |
| |
| |
| @node About yyscan_t, , Extra Data, Reentrant Detail |
| @subsection About yyscan_t |
| |
| @tindex yyscan_t (reentrant only) |
| @code{yyscan_t} is defined as: |
| |
| @example |
| @verbatim |
| typedef void* yyscan_t; |
| @end verbatim |
| @end example |
| |
| It is initialized by @code{yylex_init()} to point to |
| an internal structure. You should never access this value |
| directly. In particular, you should never attempt to free it |
| (use @code{yylex_destroy()} instead.) |
| |
| @node Reentrant Functions, , Reentrant Detail, Reentrant |
| @section Functions and Macros Available in Reentrant C Scanners |
| |
| The following Functions are available in a reentrant scanner: |
| |
| @findex yyget_text |
| @findex yyget_leng |
| @findex yyget_in |
| @findex yyget_out |
| @findex yyget_lineno |
| @findex yyset_in |
| @findex yyset_out |
| @findex yyset_lineno |
| @findex yyget_debug |
| @findex yyset_debug |
| @findex yyget_extra |
| @findex yyset_extra |
| |
| @example |
| @verbatim |
| char *yyget_text ( yyscan_t scanner ); |
| int yyget_leng ( yyscan_t scanner ); |
| FILE *yyget_in ( yyscan_t scanner ); |
| FILE *yyget_out ( yyscan_t scanner ); |
| int yyget_lineno ( yyscan_t scanner ); |
| YY_EXTRA_TYPE yyget_extra ( yyscan_t scanner ); |
| int yyget_debug ( yyscan_t scanner ); |
| |
| void yyset_debug ( int flag, yyscan_t scanner ); |
| void yyset_in ( FILE * in_str , yyscan_t scanner ); |
| void yyset_out ( FILE * out_str , yyscan_t scanner ); |
| void yyset_lineno ( int line_number , yyscan_t scanner ); |
| void yyset_extra ( YY_EXTRA_TYPE user_defined , yyscan_t scanner ); |
| @end verbatim |
| @end example |
| |
| There are no ``set'' functions for yytext and yyleng. This is intentional. |
| |
| The following Macro shortcuts are available in actions in a reentrant |
| scanner: |
| |
| @example |
| @verbatim |
| yytext |
| yyleng |
| yyin |
| yyout |
| yylineno |
| yyextra |
| yy_flex_debug |
| @end verbatim |
| @end example |
| |
| @cindex yylineno, in a reentrant scanner |
| In a reentrant C scanner, support for yylineno is always present |
| (i.e., you may access yylineno), but the value is never modified by |
| @code{flex} unless @code{%option yylineno} is enabled. This is to allow |
| the user to maintain the line count independently of @code{flex}. |
| |
| @anchor{bison-functions} |
| The following functions and macros are made available when @code{%option |
| bison-bridge} (@samp{--bison-bridge}) is specified: |
| |
| @example |
| @verbatim |
| YYSTYPE * yyget_lval ( yyscan_t scanner ); |
| void yyset_lval ( YYSTYPE * yylvalp , yyscan_t scanner ); |
| yylval |
| @end verbatim |
| @end example |
| |
| The following functions and macros are made available |
| when @code{%option bison-locations} (@samp{--bison-locations}) is specified: |
| |
| @example |
| @verbatim |
| YYLTYPE *yyget_lloc ( yyscan_t scanner ); |
| void yyset_lloc ( YYLTYPE * yyllocp , yyscan_t scanner ); |
| yylloc |
| @end verbatim |
| @end example |
| |
| Support for yylval assumes that @code{YYSTYPE} is a valid type. Support for |
| yylloc assumes that @code{YYSLYPE} is a valid type. Typically, these types are |
| generated by @code{bison}, and are included in section 1 of the @code{flex} |
| input. |
| |
| @node Lex and Posix, Memory Management, Reentrant, Top |
| @chapter Incompatibilities with Lex and Posix |
| |
| @cindex POSIX and lex |
| @cindex lex (traditional) and POSIX |
| |
| @code{flex} is a rewrite of the AT&T Unix @emph{lex} tool (the two |
| implementations do not share any code, though), with some extensions and |
| incompatibilities, both of which are of concern to those who wish to |
| write scanners acceptable to both implementations. @code{flex} is fully |
| compliant with the POSIX @code{lex} specification, except that when |
| using @code{%pointer} (the default), a call to @code{unput()} destroys |
| the contents of @code{yytext}, which is counter to the POSIX |
| specification. In this section we discuss all of the known areas of |
| incompatibility between @code{flex}, AT&T @code{lex}, and the POSIX |
| specification. @code{flex}'s @samp{-l} option turns on maximum |
| compatibility with the original AT&T @code{lex} implementation, at the |
| cost of a major loss in the generated scanner's performance. We note |
| below which incompatibilities can be overcome using the @samp{-l} |
| option. @code{flex} is fully compatible with @code{lex} with the |
| following exceptions: |
| |
| @itemize |
| @item |
| The undocumented @code{lex} scanner internal variable @code{yylineno} is |
| not supported unless @samp{-l} or @code{%option yylineno} is used. |
| |
| @item |
| @code{yylineno} should be maintained on a per-buffer basis, rather than |
| a per-scanner (single global variable) basis. |
| |
| @item |
| @code{yylineno} is not part of the POSIX specification. |
| |
| @item |
| The @code{input()} routine is not redefinable, though it may be called |
| to read characters following whatever has been matched by a rule. If |
| @code{input()} encounters an end-of-file the normal @code{yywrap()} |
| processing is done. A ``real'' end-of-file is returned by |
| @code{input()} as @code{EOF}. |
| |
| @item |
| Input is instead controlled by defining the @code{YY_INPUT()} macro. |
| |
| @item |
| The @code{flex} restriction that @code{input()} cannot be redefined is |
| in accordance with the POSIX specification, which simply does not |
| specify any way of controlling the scanner's input other than by making |
| an initial assignment to @file{yyin}. |
| |
| @item |
| The @code{unput()} routine is not redefinable. This restriction is in |
| accordance with POSIX. |
| |
| @item |
| @code{flex} scanners are not as reentrant as @code{lex} scanners. In |
| particular, if you have an interactive scanner and an interrupt handler |
| which long-jumps out of the scanner, and the scanner is subsequently |
| called again, you may get the following message: |
| |
| @cindex error messages, end of buffer missed |
| @example |
| @verbatim |
| fatal flex scanner internal error--end of buffer missed |
| @end verbatim |
| @end example |
| |
| To reenter the scanner, first use: |
| |
| @cindex restarting the scanner |
| @example |
| @verbatim |
| yyrestart( yyin ); |
| @end verbatim |
| @end example |
| |
| Note that this call will throw away any buffered input; usually this |
| isn't a problem with an interactive scanner. @xref{Reentrant}, for |
| @code{flex}'s reentrant API. |
| |
| @item |
| Also note that @code{flex} C++ scanner classes |
| @emph{are} |
| reentrant, so if using C++ is an option for you, you should use |
| them instead. @xref{Cxx}, and @ref{Reentrant} for details. |
| |
| @item |
| @code{output()} is not supported. Output from the @b{ECHO} macro is |
| done to the file-pointer @code{yyout} (default @file{stdout)}. |
| |
| @item |
| @code{output()} is not part of the POSIX specification. |
| |
| @item |
| @code{lex} does not support exclusive start conditions (%x), though they |
| are in the POSIX specification. |
| |
| @item |
| When definitions are expanded, @code{flex} encloses them in parentheses. |
| With @code{lex}, the following: |
| |
| @cindex name definitions, not POSIX |
| @example |
| @verbatim |
| NAME [A-Z][A-Z0-9]* |
| %% |
| foo{NAME}? printf( "Found it\n" ); |
| %% |
| @end verbatim |
| @end example |
| |
| will not match the string @samp{foo} because when the macro is expanded |
| the rule is equivalent to @samp{foo[A-Z][A-Z0-9]*?} and the precedence |
| is such that the @samp{?} is associated with @samp{[A-Z0-9]*}. With |
| @code{flex}, the rule will be expanded to @samp{foo([A-Z][A-Z0-9]*)?} |
| and so the string @samp{foo} will match. |
| |
| @item |
| Note that if the definition begins with @samp{^} or ends with @samp{$} |
| then it is @emph{not} expanded with parentheses, to allow these |
| operators to appear in definitions without losing their special |
| meanings. But the @samp{<s>}, @samp{/}, and @code{<<EOF>>} operators |
| cannot be used in a @code{flex} definition. |
| |
| @item |
| Using @samp{-l} results in the @code{lex} behavior of no parentheses |
| around the definition. |
| |
| @item |
| The POSIX specification is that the definition be enclosed in parentheses. |
| |
| @item |
| Some implementations of @code{lex} allow a rule's action to begin on a |
| separate line, if the rule's pattern has trailing whitespace: |
| |
| @cindex patterns and actions on different lines |
| @example |
| @verbatim |
| %% |
| foo|bar<space here> |
| { foobar_action();} |
| @end verbatim |
| @end example |
| |
| @code{flex} does not support this feature. |
| |
| @item |
| The @code{lex} @code{%r} (generate a Ratfor scanner) option is not |
| supported. It is not part of the POSIX specification. |
| |
| @item |
| After a call to @code{unput()}, @emph{yytext} is undefined until the |
| next token is matched, unless the scanner was built using @code{%array}. |
| This is not the case with @code{lex} or the POSIX specification. The |
| @samp{-l} option does away with this incompatibility. |
| |
| @item |
| The precedence of the @samp{@{,@}} (numeric range) operator is |
| different. The AT&T and POSIX specifications of @code{lex} |
| interpret @samp{abc@{1,3@}} as match one, two, |
| or three occurrences of @samp{abc}'', whereas @code{flex} interprets it |
| as ``match @samp{ab} followed by one, two, or three occurrences of |
| @samp{c}''. The @samp{-l} and @samp{--posix} options do away with this |
| incompatibility. |
| |
| @item |
| The precedence of the @samp{^} operator is different. @code{lex} |
| interprets @samp{^foo|bar} as ``match either 'foo' at the beginning of a |
| line, or 'bar' anywhere'', whereas @code{flex} interprets it as ``match |
| either @samp{foo} or @samp{bar} if they come at the beginning of a |
| line''. The latter is in agreement with the POSIX specification. |
| |
| @item |
| The special table-size declarations such as @code{%a} supported by |
| @code{lex} are not required by @code{flex} scanners.. @code{flex} |
| ignores them. |
| @item |
| The name @code{FLEX_SCANNER} is @code{#define}'d so scanners may be |
| written for use with either @code{flex} or @code{lex}. Scanners also |
| include @code{YY_FLEX_MAJOR_VERSION}, @code{YY_FLEX_MINOR_VERSION} |
| and @code{YY_FLEX_SUBMINOR_VERSION} |
| indicating which version of @code{flex} generated the scanner. For |
| example, for the 2.5.22 release, these defines would be 2, 5 and 22 |
| respectively. If the version of @code{flex} being used is a beta |
| version, then the symbol @code{FLEX_BETA} is defined. |
| |
| @item |
| The symbols @samp{[[} and @samp{]]} in the code sections of the input |
| may conflict with the m4 delimiters. @xref{M4 Dependency}. |
| |
| |
| @end itemize |
| |
| @cindex POSIX comp;compliance |
| @cindex non-POSIX features of flex |
| The following @code{flex} features are not included in @code{lex} or the |
| POSIX specification: |
| |
| @itemize |
| @item |
| C++ scanners |
| @item |
| %option |
| @item |
| start condition scopes |
| @item |
| start condition stacks |
| @item |
| interactive/non-interactive scanners |
| @item |
| yy_scan_string() and friends |
| @item |
| yyterminate() |
| @item |
| yy_set_interactive() |
| @item |
| yy_set_bol() |
| @item |
| YY_AT_BOL() |
| <<EOF>> |
| @item |
| <*> |
| @item |
| YY_DECL |
| @item |
| YY_START |
| @item |
| YY_USER_ACTION |
| @item |
| YY_USER_INIT |
| @item |
| #line directives |
| @item |
| %@{@}'s around actions |
| @item |
| reentrant C API |
| @item |
| multiple actions on a line |
| @item |
| almost all of the @code{flex} command-line options |
| @end itemize |
| |
| The feature ``multiple actions on a line'' |
| refers to the fact that with @code{flex} you can put multiple actions on |
| the same line, separated with semi-colons, while with @code{lex}, the |
| following: |
| |
| @example |
| @verbatim |
| foo handle_foo(); ++num_foos_seen; |
| @end verbatim |
| @end example |
| |
| is (rather surprisingly) truncated to |
| |
| @example |
| @verbatim |
| foo handle_foo(); |
| @end verbatim |
| @end example |
| |
| @code{flex} does not truncate the action. Actions that are not enclosed |
| in braces are simply terminated at the end of the line. |
| |
| @node Memory Management, Serialized Tables, Lex and Posix, Top |
| @chapter Memory Management |
| |
| @cindex memory management |
| @anchor{memory-management} |
| This chapter describes how flex handles dynamic memory, and how you can |
| override the default behavior. |
| |
| @menu |
| * The Default Memory Management:: |
| * Overriding The Default Memory Management:: |
| * A Note About yytext And Memory:: |
| @end menu |
| |
| @node The Default Memory Management, Overriding The Default Memory Management, Memory Management, Memory Management |
| @section The Default Memory Management |
| |
| Flex allocates dynamic memory during initialization, and once in a while from |
| within a call to yylex(). Initialization takes place during the first call to |
| yylex(). Thereafter, flex may reallocate more memory if it needs to enlarge a |
| buffer. As of version 2.5.9 Flex will clean up all memory when you call @code{yylex_destroy} |
| @xref{faq-memory-leak}. |
| |
| Flex allocates dynamic memory for four purposes, listed below @footnote{The |
| quantities given here are approximate, and may vary due to host architecture, |
| compiler configuration, or due to future enhancements to flex.} |
| |
| @table @asis |
| |
| @item 16kB for the input buffer. |
| Flex allocates memory for the character buffer used to perform pattern |
| matching. Flex must read ahead from the input stream and store it in a large |
| character buffer. This buffer is typically the largest chunk of dynamic memory |
| flex consumes. This buffer will grow if necessary, doubling the size each time. |
| Flex frees this memory when you call yylex_destroy(). The default size of this |
| buffer (16384 bytes) is almost always too large. The ideal size for this |
| buffer is the length of the longest token expected, in bytes, plus a little more. Flex will allocate a few |
| extra bytes for housekeeping. Currently, to override the size of the input buffer |
| you must @code{#define YY_BUF_SIZE} to whatever number of bytes you want. We don't plan |
| to change this in the near future, but we reserve the right to do so if we ever add a more robust memory management |
| API. |
| |
| @item 64kb for the REJECT state. This will only be allocated if you use REJECT. |
| The size is large enough to hold the same number of states as characters in the input buffer. If you override the size of the |
| input buffer (via @code{YY_BUF_SIZE}), then you automatically override the size of this buffer as well. |
| |
| @item 100 bytes for the start condition stack. |
| Flex allocates memory for the start condition stack. This is the stack used |
| for pushing start states, i.e., with yy_push_state(). It will grow if |
| necessary. Since the states are simply integers, this stack doesn't consume |
| much memory. This stack is not present if @code{%option stack} is not |
| specified. You will rarely need to tune this buffer. The ideal size for this |
| stack is the maximum depth expected. The memory for this stack is |
| automatically destroyed when you call yylex_destroy(). @xref{option-stack}. |
| |
| @item 40 bytes for each YY_BUFFER_STATE. |
| Flex allocates memory for each YY_BUFFER_STATE. The buffer state itself |
| is about 40 bytes, plus an additional large character buffer (described above.) |
| The initial buffer state is created during initialization, and with each call |
| to yy_create_buffer(). You can't tune the size of this, but you can tune the |
| character buffer as described above. Any buffer state that you explicitly |
| create by calling yy_create_buffer() is @emph{NOT} destroyed automatically. You |
| must call yy_delete_buffer() to free the memory. The exception to this rule is |
| that flex will delete the current buffer automatically when you call |
| yylex_destroy(). If you delete the current buffer, be sure to set it to NULL. |
| That way, flex will not try to delete the buffer a second time (possibly |
| crashing your program!) At the time of this writing, flex does not provide a |
| growable stack for the buffer states. You have to manage that yourself. |
| @xref{Multiple Input Buffers}. |
| |
| @item 84 bytes for the reentrant scanner guts |
| Flex allocates about 84 bytes for the reentrant scanner structure when |
| you call yylex_init(). It is destroyed when the user calls yylex_destroy(). |
| |
| @end table |
| |
| |
| @node Overriding The Default Memory Management, A Note About yytext And Memory, The Default Memory Management, Memory Management |
| @section Overriding The Default Memory Management |
| |
| @cindex yyalloc, overriding |
| @cindex yyrealloc, overriding |
| @cindex yyfree, overriding |
| |
| Flex calls the functions @code{yyalloc}, @code{yyrealloc}, and @code{yyfree} |
| when it needs to allocate or free memory. By default, these functions are |
| wrappers around the standard C functions, @code{malloc}, @code{realloc}, and |
| @code{free}, respectively. You can override the default implementations by telling |
| flex that you will provide your own implementations. |
| |
| To override the default implementations, you must do two things: |
| |
| @enumerate |
| |
| @item Suppress the default implementations by specifying one or more of the |
| following options: |
| |
| @itemize |
| @opindex noyyalloc |
| @item @code{%option noyyalloc} |
| @item @code{%option noyyrealloc} |
| @item @code{%option noyyfree}. |
| @end itemize |
| |
| @item Provide your own implementation of the following functions: @footnote{It |
| is not necessary to override all (or any) of the memory management routines. |
| You may, for example, override @code{yyrealloc}, but not @code{yyfree} or |
| @code{yyalloc}.} |
| |
| @example |
| @verbatim |
| // For a non-reentrant scanner |
| void * yyalloc (size_t bytes); |
| void * yyrealloc (void * ptr, size_t bytes); |
| void yyfree (void * ptr); |
| |
| // For a reentrant scanner |
| void * yyalloc (size_t bytes, void * yyscanner); |
| void * yyrealloc (void * ptr, size_t bytes, void * yyscanner); |
| void yyfree (void * ptr, void * yyscanner); |
| @end verbatim |
| @end example |
| |
| @end enumerate |
| |
| In the following example, we will override all three memory routines. We assume |
| that there is a custom allocator with garbage collection. In order to make this |
| example interesting, we will use a reentrant scanner, passing a pointer to the |
| custom allocator through @code{yyextra}. |
| |
| @cindex overriding the memory routines |
| @example |
| @verbatim |
| %{ |
| #include "some_allocator.h" |
| %} |
| |
| /* Suppress the default implementations. */ |
| %option noyyalloc noyyrealloc noyyfree |
| %option reentrant |
| |
| /* Initialize the allocator. */ |
| #define YY_EXTRA_TYPE struct allocator* |
| #define YY_USER_INIT yyextra = allocator_create(); |
| |
| %% |
| .|\n ; |
| %% |
| |
| /* Provide our own implementations. */ |
| void * yyalloc (size_t bytes, void* yyscanner) { |
| return allocator_alloc (yyextra, bytes); |
| } |
| |
| void * yyrealloc (void * ptr, size_t bytes, void* yyscanner) { |
| return allocator_realloc (yyextra, bytes); |
| } |
| |
| void yyfree (void * ptr, void * yyscanner) { |
| /* Do nothing -- we leave it to the garbage collector. */ |
| } |
| |
| @end verbatim |
| @end example |
| |
| |
| @node A Note About yytext And Memory, , Overriding The Default Memory Management, Memory Management |
| @section A Note About yytext And Memory |
| |
| @cindex yytext, memory considerations |
| |
| When flex finds a match, @code{yytext} points to the first character of the |
| match in the input buffer. The string itself is part of the input buffer, and |
| is @emph{NOT} allocated separately. The value of yytext will be overwritten the next |
| time yylex() is called. In short, the value of yytext is only valid from within |
| the matched rule's action. |
| |
| Often, you want the value of yytext to persist for later processing, i.e., by a |
| parser with non-zero lookahead. In order to preserve yytext, you will have to |
| copy it with strdup() or a similar function. But this introduces some headache |
| because your parser is now responsible for freeing the copy of yytext. If you |
| use a yacc or bison parser, (commonly used with flex), you will discover that |
| the error recovery mechanisms can cause memory to be leaked. |
| |
| To prevent memory leaks from strdup'd yytext, you will have to track the memory |
| somehow. Our experience has shown that a garbage collection mechanism or a |
| pooled memory mechanism will save you a lot of grief when writing parsers. |
| |
| @node Serialized Tables, Diagnostics, Memory Management, Top |
| @chapter Serialized Tables |
| @cindex serialization |
| @cindex memory, serialized tables |
| |
| @anchor{serialization} |
| A @code{flex} scanner has the ability to save the DFA tables to a file, and |
| load them at runtime when needed. The motivation for this feature is to reduce |
| the runtime memory footprint. Traditionally, these tables have been compiled into |
| the scanner as C arrays, and are sometimes quite large. Since the tables are |
| compiled into the scanner, the memory used by the tables can never be freed. |
| This is a waste of memory, especially if an application uses several scanners, |
| but none of them at the same time. |
| |
| The serialization feature allows the tables to be loaded at runtime, before |
| scanning begins. The tables may be discarded when scanning is finished. |
| |
| @menu |
| * Creating Serialized Tables:: |
| * Loading and Unloading Serialized Tables:: |
| * Tables File Format:: |
| @end menu |
| |
| @node Creating Serialized Tables, Loading and Unloading Serialized Tables, Serialized Tables, Serialized Tables |
| @section Creating Serialized Tables |
| @cindex tables, creating serialized |
| @cindex serialization of tables |
| |
| You may create a scanner with serialized tables by specifying: |
| |
| @example |
| @verbatim |
| %option tables-file=FILE |
| or |
| --tables-file=FILE |
| @end verbatim |
| @end example |
| |
| These options instruct flex to save the DFA tables to the file @var{FILE}. The tables |
| will @emph{not} be embedded in the generated scanner. The scanner will not |
| function on its own. The scanner will be dependent upon the serialized tables. You must |
| load the tables from this file at runtime before you can scan anything. |
| |
| If you do not specify a filename to @code{--tables-file}, the tables will be |
| saved to @file{lex.yy.tables}, where @samp{yy} is the appropriate prefix. |
| |
| If your project uses several different scanners, you can concatenate the |
| serialized tables into one file, and flex will find the correct set of tables, |
| using the scanner prefix as part of the lookup key. An example follows: |
| |
| @cindex serialized tables, multiple scanners |
| @example |
| @verbatim |
| $ flex --tables-file --prefix=cpp cpp.l |
| $ flex --tables-file --prefix=c c.l |
| $ cat lex.cpp.tables lex.c.tables > all.tables |
| @end verbatim |
| @end example |
| |
| The above example created two scanners, @samp{cpp}, and @samp{c}. Since we did |
| not specify a filename, the tables were serialized to @file{lex.c.tables} and |
| @file{lex.cpp.tables}, respectively. Then, we concatenated the two files |
| together into @file{all.tables}, which we will distribute with our project. At |
| runtime, we will open the file and tell flex to load the tables from it. Flex |
| will find the correct tables automatically. (See next section). |
| |
| @node Loading and Unloading Serialized Tables, Tables File Format, Creating Serialized Tables, Serialized Tables |
| @section Loading and Unloading Serialized Tables |
| @cindex tables, loading and unloading |
| @cindex loading tables at runtime |
| @cindex tables, freeing |
| @cindex freeing tables |
| @cindex memory, serialized tables |
| |
| If you've built your scanner with @code{%option tables-file}, then you must |
| load the scanner tables at runtime. This can be accomplished with the following |
| function: |
| |
| @deftypefun int yytables_fload (FILE* @var{fp} [, yyscan_t @var{scanner}]) |
| Locates scanner tables in the stream pointed to by @var{fp} and loads them. |
| Memory for the tables is allocated via @code{yyalloc}. You must call this |
| function before the first call to @code{yylex}. The argument @var{scanner} |
| only appears in the reentrant scanner. |
| This function returns @samp{0} (zero) on success, or non-zero on error. |
| @end deftypefun |
| |
| The loaded tables are @strong{not} automatically destroyed (unloaded) when you |
| call @code{yylex_destroy}. The reason is that you may create several scanners |
| of the same type (in a reentrant scanner), each of which needs access to these |
| tables. To avoid a nasty memory leak, you must call the following function: |
| |
| @deftypefun int yytables_destroy ([yyscan_t @var{scanner}]) |
| Unloads the scanner tables. The tables must be loaded again before you can scan |
| any more data. The argument @var{scanner} only appears in the reentrant |
| scanner. This function returns @samp{0} (zero) on success, or non-zero on |
| error. |
| @end deftypefun |
| |
| @strong{The functions @code{yytables_fload} and @code{yytables_destroy} are not |
| thread-safe.} You must ensure that these functions are called exactly once (for |
| each scanner type) in a threaded program, before any thread calls @code{yylex}. |
| After the tables are loaded, they are never written to, and no thread |
| protection is required thereafter -- until you destroy them. |
| |
| @node Tables File Format, , Loading and Unloading Serialized Tables, Serialized Tables |
| @section Tables File Format |
| @cindex tables, file format |
| @cindex file format, serialized tables |
| |
| This section defines the file format of serialized @code{flex} tables. |
| |
| The tables format allows for one or more sets of tables to be |
| specified, where each set corresponds to a given scanner. Scanners are |
| indexed by name, as described below. The file format is as follows: |
| |
| @example |
| @verbatim |
| TABLE SET 1 |
| +-------------------------------+ |
| Header | uint32 th_magic; | |
| | uint32 th_hsize; | |
| | uint32 th_ssize; | |
| | uint16 th_flags; | |
| | char th_version[]; | |
| | char th_name[]; | |
| | uint8 th_pad64[]; | |
| +-------------------------------+ |
| Table 1 | uint16 td_id; | |
| | uint16 td_flags; | |
| | uint32 td_hilen; | |
| | uint32 td_lolen; | |
| | void td_data[]; | |
| | uint8 td_pad64[]; | |
| +-------------------------------+ |
| Table 2 | | |
| . . . |
| . . . |
| . . . |
| . . . |
| Table n | | |
| +-------------------------------+ |
| TABLE SET 2 |
| . |
| . |
| . |
| TABLE SET N |
| @end verbatim |
| @end example |
| |
| The above diagram shows that a complete set of tables consists of a header |
| followed by multiple individual tables. Furthermore, multiple complete sets may |
| be present in the same file, each set with its own header and tables. The sets |
| are contiguous in the file. The only way to know if another set follows is to |
| check the next four bytes for the magic number (or check for EOF). The header |
| and tables sections are padded to 64-bit boundaries. Below we describe each |
| field in detail. This format does not specify how the scanner will expand the |
| given data, i.e., data may be serialized as int8, but expanded to an int32 |
| array at runtime. This is to reduce the size of the serialized data where |
| possible. Remember, @emph{all integer values are in network byte order}. |
| |
| @noindent |
| Fields of a table header: |
| |
| @table @code |
| @item th_magic |
| Magic number, always 0xF13C57B1. |
| |
| @item th_hsize |
| Size of this entire header, in bytes, including all fields plus any padding. |
| |
| @item th_ssize |
| Size of this entire set, in bytes, including the header, all tables, plus |
| any padding. |
| |
| @item th_flags |
| Bit flags for this table set. Currently unused. |
| |
| @item th_version[] |
| Flex version in NULL-terminated string format. e.g., @samp{2.5.13a}. This is |
| the version of flex that was used to create the serialized tables. |
| |
| @item th_name[] |
| Contains the name of this table set. The default is @samp{yytables}, |
| and is prefixed accordingly, e.g., @samp{footables}. Must be NULL-terminated. |
| |
| @item th_pad64[] |
| Zero or more NULL bytes, padding the entire header to the next 64-bit boundary |
| as calculated from the beginning of the header. |
| @end table |
| |
| @noindent |
| Fields of a table: |
| |
| @table @code |
| @item td_id |
| Specifies the table identifier. Possible values are: |
| @table @code |
| @item YYTD_ID_ACCEPT (0x01) |
| @code{yy_accept} |
| @item YYTD_ID_BASE (0x02) |
| @code{yy_base} |
| @item YYTD_ID_CHK (0x03) |
| @code{yy_chk} |
| @item YYTD_ID_DEF (0x04) |
| @code{yy_def} |
| @item YYTD_ID_EC (0x05) |
| @code{yy_ec } |
| @item YYTD_ID_META (0x06) |
| @code{yy_meta} |
| @item YYTD_ID_NUL_TRANS (0x07) |
| @code{yy_NUL_trans} |
| @item YYTD_ID_NXT (0x08) |
| @code{yy_nxt}. This array may be two dimensional. See the @code{td_hilen} |
| field below. |
| @item YYTD_ID_RULE_CAN_MATCH_EOL (0x09) |
| @code{yy_rule_can_match_eol} |
| @item YYTD_ID_START_STATE_LIST (0x0A) |
| @code{yy_start_state_list}. This array is handled specially because it is an |
| array of pointers to structs. See the @code{td_flags} field below. |
| @item YYTD_ID_TRANSITION (0x0B) |
| @code{yy_transition}. This array is handled specially because it is an array of |
| structs. See the @code{td_lolen} field below. |
| @item YYTD_ID_ACCLIST (0x0C) |
| @code{yy_acclist} |
| @end table |
| |
| @item td_flags |
| Bit flags describing how to interpret the data in @code{td_data}. |
| The data arrays are one-dimensional by default, but may be |
| two dimensional as specified in the @code{td_hilen} field. |
| |
| @table @code |
| @item YYTD_DATA8 (0x01) |
| The data is serialized as an array of type int8. |
| @item YYTD_DATA16 (0x02) |
| The data is serialized as an array of type int16. |
| @item YYTD_DATA32 (0x04) |
| The data is serialized as an array of type int32. |
| @item YYTD_PTRANS (0x08) |
| The data is a list of indexes of entries in the expanded @code{yy_transition} |
| array. Each index should be expanded to a pointer to the corresponding entry |
| in the @code{yy_transition} array. We count on the fact that the |
| @code{yy_transition} array has already been seen. |
| @item YYTD_STRUCT (0x10) |
| The data is a list of yy_trans_info structs, each of which consists of |
| two integers. There is no padding between struct elements or between structs. |
| The type of each member is determined by the @code{YYTD_DATA*} bits. |
| @end table |
| |
| @item td_hilen |
| If @code{td_hilen} is non-zero, then the data is a two-dimensional array. |
| Otherwise, the data is a one-dimensional array. @code{td_hilen} contains the |
| number of elements in the higher dimensional array, and @code{td_lolen} contains |
| the number of elements in the lowest dimension. |
| |
| Conceptually, @code{td_data} is either @code{sometype td_data[td_lolen]}, or |
| @code{sometype td_data[td_hilen][td_lolen]}, where @code{sometype} is specified |
| by the @code{td_flags} field. It is possible for both @code{td_lolen} and |
| @code{td_hilen} to be zero, in which case @code{td_data} is a zero length |
| array, and no data is loaded, i.e., this table is simply skipped. Flex does not |
| currently generate tables of zero length. |
| |
| @item td_lolen |
| Specifies the number of elements in the lowest dimension array. If this is |
| a one-dimensional array, then it is simply the number of elements in this array. |
| The element size is determined by the @code{td_flags} field. |
| |
| @item td_data[] |
| The table data. This array may be a one- or two-dimensional array, of type |
| @code{int8}, @code{int16}, @code{int32}, @code{struct yy_trans_info}, or |
| @code{struct yy_trans_info*}, depending upon the values in the |
| @code{td_flags}, @code{td_hilen}, and @code{td_lolen} fields. |
| |
| @item td_pad64[] |
| Zero or more NULL bytes, padding the entire table to the next 64-bit boundary as |
| calculated from the beginning of this table. |
| @end table |
| |
| @node Diagnostics, Limitations, Serialized Tables, Top |
| @chapter Diagnostics |
| |
| @cindex error reporting, diagnostic messages |
| @cindex warnings, diagnostic messages |
| |
| The following is a list of @code{flex} diagnostic messages: |
| |
| @itemize |
| @item |
| @samp{warning, rule cannot be matched} indicates that the given rule |
| cannot be matched because it follows other rules that will always match |
| the same text as it. For example, in the following @samp{foo} cannot be |
| matched because it comes after an identifier ``catch-all'' rule: |
| |
| @cindex warning, rule cannot be matched |
| @example |
| @verbatim |
| [a-z]+ got_identifier(); |
| foo got_foo(); |
| @end verbatim |
| @end example |
| |
| Using @code{REJECT} in a scanner suppresses this warning. |
| |
| @item |
| @samp{warning, -s option given but default rule can be matched} means |
| that it is possible (perhaps only in a particular start condition) that |
| the default rule (match any single character) is the only one that will |
| match a particular input. Since @samp{-s} was given, presumably this is |
| not intended. |
| |
| @item |
| @code{reject_used_but_not_detected undefined} or |
| @code{yymore_used_but_not_detected undefined}. These errors can occur |
| at compile time. They indicate that the scanner uses @code{REJECT} or |
| @code{yymore()} but that @code{flex} failed to notice the fact, meaning |
| that @code{flex} scanned the first two sections looking for occurrences |
| of these actions and failed to find any, but somehow you snuck some in |
| (via a #include file, for example). Use @code{%option reject} or |
| @code{%option yymore} to indicate to @code{flex} that you really do use |
| these features. |
| |
| @item |
| @samp{flex scanner jammed}. a scanner compiled with |
| @samp{-s} has encountered an input string which wasn't matched by any of |
| its rules. This error can also occur due to internal problems. |
| |
| @item |
| @samp{token too large, exceeds YYLMAX}. your scanner uses @code{%array} |
| and one of its rules matched a string longer than the @code{YYLMAX} |
| constant (8K bytes by default). You can increase the value by |
| #define'ing @code{YYLMAX} in the definitions section of your @code{flex} |
| input. |
| |
| @item |
| @samp{scanner requires -8 flag to use the character 'x'}. Your scanner |
| specification includes recognizing the 8-bit character @samp{'x'} and |
| you did not specify the -8 flag, and your scanner defaulted to 7-bit |
| because you used the @samp{-Cf} or @samp{-CF} table compression options. |
| See the discussion of the @samp{-7} flag, @ref{Scanner Options}, for |
| details. |
| |
| @item |
| @samp{flex scanner push-back overflow}. you used @code{unput()} to push |
| back so much text that the scanner's buffer could not hold both the |
| pushed-back text and the current token in @code{yytext}. Ideally the |
| scanner should dynamically resize the buffer in this case, but at |
| present it does not. |
| |
| @item |
| @samp{input buffer overflow, can't enlarge buffer because scanner uses |
| REJECT}. the scanner was working on matching an extremely large token |
| and needed to expand the input buffer. This doesn't work with scanners |
| that use @code{REJECT}. |
| |
| @item |
| @samp{fatal flex scanner internal error--end of buffer missed}. This can |
| occur in a scanner which is reentered after a long-jump has jumped out |
| (or over) the scanner's activation frame. Before reentering the |
| scanner, use: |
| @example |
| @verbatim |
| yyrestart( yyin ); |
| @end verbatim |
| @end example |
| or, as noted above, switch to using the C++ scanner class. |
| |
| @item |
| @samp{too many start conditions in <> construct!} you listed more start |
| conditions in a <> construct than exist (so you must have listed at |
| least one of them twice). |
| @end itemize |
| |
| @node Limitations, Bibliography, Diagnostics, Top |
| @chapter Limitations |
| |
| @cindex limitations of flex |
| |
| Some trailing context patterns cannot be properly matched and generate |
| warning messages (@samp{dangerous trailing context}). These are |
| patterns where the ending of the first part of the rule matches the |
| beginning of the second part, such as @samp{zx*/xy*}, where the 'x*' |
| matches the 'x' at the beginning of the trailing context. (Note that |
| the POSIX draft states that the text matched by such patterns is |
| undefined.) For some trailing context rules, parts which are actually |
| fixed-length are not recognized as such, leading to the abovementioned |
| performance loss. In particular, parts using @samp{|} or @samp{@{n@}} |
| (such as @samp{foo@{3@}}) are always considered variable-length. |
| Combining trailing context with the special @samp{|} action can result |
| in @emph{fixed} trailing context being turned into the more expensive |
| @emph{variable} trailing context. For example, in the following: |
| |
| @cindex warning, dangerous trailing context |
| @example |
| @verbatim |
| %% |
| abc | |
| xyz/def |
| @end verbatim |
| @end example |
| |
| Use of @code{unput()} invalidates yytext and yyleng, unless the |
| @code{%array} directive or the @samp{-l} option has been used. |
| Pattern-matching of @code{NUL}s is substantially slower than matching |
| other characters. Dynamic resizing of the input buffer is slow, as it |
| entails rescanning all the text matched so far by the current (generally |
| huge) token. Due to both buffering of input and read-ahead, you cannot |
| intermix calls to @file{<stdio.h>} routines, such as, @b{getchar()}, |
| with @code{flex} rules and expect it to work. Call @code{input()} |
| instead. The total table entries listed by the @samp{-v} flag excludes |
| the number of table entries needed to determine what rule has been |
| matched. The number of entries is equal to the number of DFA states if |
| the scanner does not use @code{REJECT}, and somewhat greater than the |
| number of states if it does. @code{REJECT} cannot be used with the |
| @samp{-f} or @samp{-F} options. |
| |
| The @code{flex} internal algorithms need documentation. |
| |
| @node Bibliography, FAQ, Limitations, Top |
| @chapter Additional Reading |
| |
| You may wish to read more about the following programs: |
| @itemize |
| @item lex |
| @item yacc |
| @item sed |
| @item awk |
| @end itemize |
| |
| The following books may contain material of interest: |
| |
| John Levine, Tony Mason, and Doug Brown, |
| @emph{Lex & Yacc}, |
| O'Reilly and Associates. Be sure to get the 2nd edition. |
| |
| M. E. Lesk and E. Schmidt, |
| @emph{LEX -- Lexical Analyzer Generator} |
| |
| Alfred Aho, Ravi Sethi and Jeffrey Ullman, @emph{Compilers: Principles, |
| Techniques and Tools}, Addison-Wesley (1986). Describes the |
| pattern-matching techniques used by @code{flex} (deterministic finite |
| automata). |
| |
| @node FAQ, Appendices, Bibliography, Top |
| @unnumbered FAQ |
| |
| From time to time, the @code{flex} maintainer receives certain |
| questions. Rather than repeat answers to well-understood problems, we |
| publish them here. |
| |
| @menu |
| * When was flex born?:: |
| * How do I expand backslash-escape sequences in C-style quoted strings?:: |
| * Why do flex scanners call fileno if it is not ANSI compatible?:: |
| * Does flex support recursive pattern definitions?:: |
| * How do I skip huge chunks of input (tens of megabytes) while using flex?:: |
| * Flex is not matching my patterns in the same order that I defined them.:: |
| * My actions are executing out of order or sometimes not at all.:: |
| * How can I have multiple input sources feed into the same scanner at the same time?:: |
| * Can I build nested parsers that work with the same input file?:: |
| * How can I match text only at the end of a file?:: |
| * How can I make REJECT cascade across start condition boundaries?:: |
| * Why cant I use fast or full tables with interactive mode?:: |
| * How much faster is -F or -f than -C?:: |
| * If I have a simple grammar cant I just parse it with flex?:: |
| * Why doesn't yyrestart() set the start state back to INITIAL?:: |
| * How can I match C-style comments?:: |
| * The period isn't working the way I expected.:: |
| * Can I get the flex manual in another format?:: |
| * Does there exist a "faster" NDFA->DFA algorithm?:: |
| * How does flex compile the DFA so quickly?:: |
| * How can I use more than 8192 rules?:: |
| * How do I abandon a file in the middle of a scan and switch to a new file?:: |
| * How do I execute code only during initialization (only before the first scan)?:: |
| * How do I execute code at termination?:: |
| * Where else can I find help?:: |
| * Can I include comments in the "rules" section of the file?:: |
| * I get an error about undefined yywrap().:: |
| * How can I change the matching pattern at run time?:: |
| * How can I expand macros in the input?:: |
| * How can I build a two-pass scanner?:: |
| * How do I match any string not matched in the preceding rules?:: |
| * I am trying to port code from AT&T lex that uses yysptr and yysbuf.:: |
| * Is there a way to make flex treat NULL like a regular character?:: |
| * Whenever flex can not match the input it says "flex scanner jammed".:: |
| * Why doesn't flex have non-greedy operators like perl does?:: |
| * Memory leak - 16386 bytes allocated by malloc.:: |
| * How do I track the byte offset for lseek()?:: |
| * How do I use my own I/O classes in a C++ scanner?:: |
| * How do I skip as many chars as possible?:: |
| * deleteme00:: |
| * Are certain equivalent patterns faster than others?:: |
| * Is backing up a big deal?:: |
| * Can I fake multi-byte character support?:: |
| * deleteme01:: |
| * Can you discuss some flex internals?:: |
| * unput() messes up yy_at_bol:: |
| * The | operator is not doing what I want:: |
| * Why can't flex understand this variable trailing context pattern?:: |
| * The ^ operator isn't working:: |
| * Trailing context is getting confused with trailing optional patterns:: |
| * Is flex GNU or not?:: |
| * ERASEME53:: |
| * I need to scan if-then-else blocks and while loops:: |
| * ERASEME55:: |
| * ERASEME56:: |
| * ERASEME57:: |
| * Is there a repository for flex scanners?:: |
| * How can I conditionally compile or preprocess my flex input file?:: |
| * Where can I find grammars for lex and yacc?:: |
| * I get an end-of-buffer message for each character scanned.:: |
| * unnamed-faq-62:: |
| * unnamed-faq-63:: |
| * unnamed-faq-64:: |
| * unnamed-faq-65:: |
| * unnamed-faq-66:: |
| * unnamed-faq-67:: |
| * unnamed-faq-68:: |
| * unnamed-faq-69:: |
| * unnamed-faq-70:: |
| * unnamed-faq-71:: |
| * unnamed-faq-72:: |
| * unnamed-faq-73:: |
| * unnamed-faq-74:: |
| * unnamed-faq-75:: |
| * unnamed-faq-76:: |
| * unnamed-faq-77:: |
| * unnamed-faq-78:: |
| * unnamed-faq-79:: |
| * unnamed-faq-80:: |
| * unnamed-faq-81:: |
| * unnamed-faq-82:: |
| * unnamed-faq-83:: |
| * unnamed-faq-84:: |
| * unnamed-faq-85:: |
| * unnamed-faq-86:: |
| * unnamed-faq-87:: |
| * unnamed-faq-88:: |
| * unnamed-faq-90:: |
| * unnamed-faq-91:: |
| * unnamed-faq-92:: |
| * unnamed-faq-93:: |
| * unnamed-faq-94:: |
| * unnamed-faq-95:: |
| * unnamed-faq-96:: |
| * unnamed-faq-97:: |
| * unnamed-faq-98:: |
| * unnamed-faq-99:: |
| * unnamed-faq-100:: |
| * unnamed-faq-101:: |
| * What is the difference between YYLEX_PARAM and YY_DECL?:: |
| * Why do I get "conflicting types for yylex" error?:: |
| * How do I access the values set in a Flex action from within a Bison action?:: |
| @end menu |
| |
| @node When was flex born? |
| @unnumberedsec When was flex born? |
| |
| Vern Paxson took over |
| the @cite{Software Tools} lex project from Jef Poskanzer in 1982. At that point it |
| was written in Ratfor. Around 1987 or so, Paxson translated it into C, and |
| a legend was born :-). |
| |
| @node How do I expand backslash-escape sequences in C-style quoted strings? |
| @unnumberedsec How do I expand backslash-escape sequences in C-style quoted strings? |
| |
| A key point when scanning quoted strings is that you cannot (easily) write |
| a single rule that will precisely match the string if you allow things |
| like embedded escape sequences and newlines. If you try to match strings |
| with a single rule then you'll wind up having to rescan the string anyway |
| to find any escape sequences. |
| |
| Instead you can use exclusive start conditions and a set of rules, one for |
| matching non-escaped text, one for matching a single escape, one for |
| matching an embedded newline, and one for recognizing the end of the |
| string. Each of these rules is then faced with the question of where to |
| put its intermediary results. The best solution is for the rules to |
| append their local value of @code{yytext} to the end of a ``string literal'' |
| buffer. A rule like the escape-matcher will append to the buffer the |
| meaning of the escape sequence rather than the literal text in @code{yytext}. |
| In this way, @code{yytext} does not need to be modified at all. |
| |
| @node Why do flex scanners call fileno if it is not ANSI compatible? |
| @unnumberedsec Why do flex scanners call fileno if it is not ANSI compatible? |
| |
| Flex scanners call @code{fileno()} in order to get the file descriptor |
| corresponding to @code{yyin}. The file descriptor may be passed to |
| @code{isatty()} or @code{read()}, depending upon which @code{%options} you specified. |
| If your system does not have @code{fileno()} support, to get rid of the |
| @code{read()} call, do not specify @code{%option read}. To get rid of the @code{isatty()} |
| call, you must specify one of @code{%option always-interactive} or |
| @code{%option never-interactive}. |
| |
| @node Does flex support recursive pattern definitions? |
| @unnumberedsec Does flex support recursive pattern definitions? |
| |
| e.g., |
| |
| @example |
| @verbatim |
| %% |
| block "{"({block}|{statement})*"}" |
| @end verbatim |
| @end example |
| |
| No. You cannot have recursive definitions. The pattern-matching power of |
| regular expressions in general (and therefore flex scanners, too) is |
| limited. In particular, regular expressions cannot ``balance'' parentheses |
| to an arbitrary degree. For example, it's impossible to write a regular |
| expression that matches all strings containing the same number of '@{'s |
| as '@}'s. For more powerful pattern matching, you need a parser, such |
| as @cite{GNU bison}. |
| |
| @node How do I skip huge chunks of input (tens of megabytes) while using flex? |
| @unnumberedsec How do I skip huge chunks of input (tens of megabytes) while using flex? |
| |
| Use @code{fseek()} (or @code{lseek()}) to position yyin, then call @code{yyrestart()}. |
| |
| @node Flex is not matching my patterns in the same order that I defined them. |
| @unnumberedsec Flex is not matching my patterns in the same order that I defined them. |
| |
| @code{flex} picks the |
| rule that matches the most text (i.e., the longest possible input string). |
| This is because @code{flex} uses an entirely different matching technique |
| (``deterministic finite automata'') that actually does all of the matching |
| simultaneously, in parallel. (Seems impossible, but it's actually a fairly |
| simple technique once you understand the principles.) |
| |
| A side-effect of this parallel matching is that when the input matches more |
| than one rule, @code{flex} scanners pick the rule that matched the @emph{most} text. This |
| is explained further in the manual, in the section @xref{Matching}. |
| |
| If you want @code{flex} to choose a shorter match, then you can work around this |
| behavior by expanding your short |
| rule to match more text, then put back the extra: |
| |
| @example |
| @verbatim |
| data_.* yyless( 5 ); BEGIN BLOCKIDSTATE; |
| @end verbatim |
| @end example |
| |
| Another fix would be to make the second rule active only during the |
| @code{<BLOCKIDSTATE>} start condition, and make that start condition exclusive |
| by declaring it with @code{%x} instead of @code{%s}. |
| |
| A final fix is to change the input language so that the ambiguity for |
| @samp{data_} is removed, by adding characters to it that don't match the |
| identifier rule, or by removing characters (such as @samp{_}) from the |
| identifier rule so it no longer matches @samp{data_}. (Of course, you might |
| also not have the option of changing the input language.) |
| |
| @node My actions are executing out of order or sometimes not at all. |
| @unnumberedsec My actions are executing out of order or sometimes not at all. |
| |
| Most likely, you have (in error) placed the opening @samp{@{} of the action |
| block on a different line than the rule, e.g., |
| |
| @example |
| @verbatim |
| ^(foo|bar) |
| { <<<--- WRONG! |
| |
| } |
| @end verbatim |
| @end example |
| |
| @code{flex} requires that the opening @samp{@{} of an action associated with a rule |
| begin on the same line as does the rule. You need instead to write your rules |
| as follows: |
| |
| @example |
| @verbatim |
| ^(foo|bar) { // CORRECT! |
| |
| } |
| @end verbatim |
| @end example |
| |
| @node How can I have multiple input sources feed into the same scanner at the same time? |
| @unnumberedsec How can I have multiple input sources feed into the same scanner at the same time? |
| |
| If @dots{} |
| @itemize |
| @item |
| your scanner is free of backtracking (verified using @code{flex}'s @samp{-b} flag), |
| @item |
| AND you run your scanner interactively (@samp{-I} option; default unless using special table |
| compression options), |
| @item |
| AND you feed it one character at a time by redefining @code{YY_INPUT} to do so, |
| @end itemize |
| |
| then every time it matches a token, it will have exhausted its input |
| buffer (because the scanner is free of backtracking). This means you |
| can safely use @code{select()} at the point and only call @code{yylex()} for another |
| token if @code{select()} indicates there's data available. |
| |
| That is, move the @code{select()} out from the input function to a point where |
| it determines whether @code{yylex()} gets called for the next token. |
| |
| With this approach, you will still have problems if your input can arrive |
| piecemeal; @code{select()} could inform you that the beginning of a token is |
| available, you call @code{yylex()} to get it, but it winds up blocking waiting |
| for the later characters in the token. |
| |
| Here's another way: Move your input multiplexing inside of @code{YY_INPUT}. That |
| is, whenever @code{YY_INPUT} is called, it @code{select()}'s to see where input is |
| available. If input is available for the scanner, it reads and returns the |
| next byte. If input is available from another source, it calls whatever |
| function is responsible for reading from that source. (If no input is |
| available, it blocks until some input is available.) I've used this technique in an |
| interpreter I wrote that both reads keyboard input using a @code{flex} scanner and |
| IPC traffic from sockets, and it works fine. |
| |
| @node Can I build nested parsers that work with the same input file? |
| @unnumberedsec Can I build nested parsers that work with the same input file? |
| |
| This is not going to work without some additional effort. The reason is |
| that @code{flex} block-buffers the input it reads from @code{yyin}. This means that the |
| ``outermost'' @code{yylex()}, when called, will automatically slurp up the first 8K |
| of input available on yyin, and subsequent calls to other @code{yylex()}'s won't |
| see that input. You might be tempted to work around this problem by |
| redefining @code{YY_INPUT} to only return a small amount of text, but it turns out |
| that that approach is quite difficult. Instead, the best solution is to |
| combine all of your scanners into one large scanner, using a different |
| exclusive start condition for each. |
| |
| @node How can I match text only at the end of a file? |
| @unnumberedsec How can I match text only at the end of a file? |
| |
| There is no way to write a rule which is ``match this text, but only if |
| it comes at the end of the file''. You can fake it, though, if you happen |
| to have a character lying around that you don't allow in your input. |
| Then you redefine @code{YY_INPUT} to call your own routine which, if it sees |
| an @samp{EOF}, returns the magic character first (and remembers to return a |
| real @code{EOF} next time it's called). Then you could write: |
| |
| @example |
| @verbatim |
| <COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */ |
| @end verbatim |
| @end example |
| |
| @node How can I make REJECT cascade across start condition boundaries? |
| @unnumberedsec How can I make REJECT cascade across start condition boundaries? |
| |
| You can do this as follows. Suppose you have a start condition @samp{A}, and |
| after exhausting all of the possible matches in @samp{<A>}, you want to try |
| matches in @samp{<INITIAL>}. Then you could use the following: |
| |
| @example |
| @verbatim |
| %x A |
| %% |
| <A>rule_that_is_long ...; REJECT; |
| <A>rule ...; REJECT; /* shorter rule */ |
| <A>etc. |
| ... |
| <A>.|\n { |
| /* Shortest and last rule in <A>, so |
| * cascaded REJECTs will eventually |
| * wind up matching this rule. We want |
| * to now switch to the initial state |
| * and try matching from there instead. |
| */ |
| yyless(0); /* put back matched text */ |
| BEGIN(INITIAL); |
| } |
| @end verbatim |
| @end example |
| |
| @node Why cant I use fast or full tables with interactive mode? |
| @unnumberedsec Why can't I use fast or full tables with interactive mode? |
| |
| One of the assumptions |
| flex makes is that interactive applications are inherently slow (they're |
| waiting on a human after all). |
| It has to do with how the scanner detects that it must be finished scanning |
| a token. For interactive scanners, after scanning each character the current |
| state is looked up in a table (essentially) to see whether there's a chance |
| of another input character possibly extending the length of the match. If |
| not, the scanner halts. For non-interactive scanners, the end-of-token test |
| is much simpler, basically a compare with 0, so no memory bus cycles. Since |
| the test occurs in the innermost scanning loop, one would like to make it go |
| as fast as possible. |
| |
| Still, it seems reasonable to allow the user to choose to trade off a bit |
| of performance in this area to gain the corresponding flexibility. There |
| might be another reason, though, why fast scanners don't support the |
| interactive option. |
| |
| @node How much faster is -F or -f than -C? |
| @unnumberedsec How much faster is -F or -f than -C? |
| |
| Much faster (factor of 2-3). |
| |
| @node If I have a simple grammar cant I just parse it with flex? |
| @unnumberedsec If I have a simple grammar can't I just parse it with flex? |
| |
| Is your grammar recursive? That's almost always a sign that you're |
| better off using a parser/scanner rather than just trying to use a scanner |
| alone. |
| |
| @node Why doesn't yyrestart() set the start state back to INITIAL? |
| @unnumberedsec Why doesn't yyrestart() set the start state back to INITIAL? |
| |
| There are two reasons. The first is that there might |
| be programs that rely on the start state not changing across file changes. |
| The second is that beginning with @code{flex} version 2.4, use of @code{yyrestart()} is no longer required, |
| so fixing the problem there doesn't solve the more general problem. |
| |
| @node How can I match C-style comments? |
| @unnumberedsec How can I match C-style comments? |
| |
| You might be tempted to try something like this: |
| |
| @example |
| @verbatim |
| "/*".*"*/" // WRONG! |
| @end verbatim |
| @end example |
| |
| or, worse, this: |
| |
| @example |
| @verbatim |
| "/*"(.|\n)"*/" // WRONG! |
| @end verbatim |
| @end example |
| |
| The above rules will eat too much input, and blow up on things like: |
| |
| @example |
| @verbatim |
| /* a comment */ do_my_thing( "oops */" ); |
| @end verbatim |
| @end example |
| |
| Here is one way which allows you to track line information: |
| |
| @example |
| @verbatim |
| <INITIAL>{ |
| "/*" BEGIN(IN_COMMENT); |
| } |
| <IN_COMMENT>{ |
| "*/" BEGIN(INITIAL); |
| [^*\n]+ // eat comment in chunks |
| "*" // eat the lone star |
| \n yylineno++; |
| } |
| @end verbatim |
| @end example |
| |
| @node The period isn't working the way I expected. |
| @unnumberedsec The '.' isn't working the way I expected. |
| |
| Here are some tips for using @samp{.}: |
| |
| @itemize |
| @item |
| A common mistake is to place the grouping parenthesis AFTER an operator, when |
| you really meant to place the parenthesis BEFORE the operator, e.g., you |
| probably want this @code{(foo|bar)+} and NOT this @code{(foo|bar+)}. |
| |
| The first pattern matches the words @samp{foo} or @samp{bar} any number of |
| times, e.g., it matches the text @samp{barfoofoobarfoo}. The |
| second pattern matches a single instance of @code{foo} or a single instance of |
| @code{bar} followed by one or more @samp{r}s, e.g., it matches the text @code{barrrr} . |
| @item |
| A @samp{.} inside @samp{[]}'s just means a literal@samp{.} (period), |
| and NOT ``any character except newline''. |
| @item |
| Remember that @samp{.} matches any character EXCEPT @samp{\n} (and @samp{EOF}). |
| If you really want to match ANY character, including newlines, then use @code{(.|\n)} |
| Beware that the regex @code{(.|\n)+} will match your entire input! |
| @item |
| Finally, if you want to match a literal @samp{.} (a period), then use @samp{[.]} or @samp{"."} |
| @end itemize |
| |
| @node Can I get the flex manual in another format? |
| @unnumberedsec Can I get the flex manual in another format? |
| |
| The @code{flex} source distribution includes a texinfo manual. You are |
| free to convert that texinfo into whatever format you desire. The |
| @code{texinfo} package includes tools for conversion to a number of formats. |
| |
| @node Does there exist a "faster" NDFA->DFA algorithm? |
| @unnumberedsec Does there exist a "faster" NDFA->DFA algorithm? |
| |
| There's no way around the potential exponential running time - it |
| can take you exponential time just to enumerate all of the DFA states. |
| In practice, though, the running time is closer to linear, or sometimes |
| quadratic. |
| |
| @node How does flex compile the DFA so quickly? |
| @unnumberedsec How does flex compile the DFA so quickly? |
| |
| There are two big speed wins that @code{flex} uses: |
| |
| @enumerate |
| @item |
| It analyzes the input rules to construct equivalence classes for those |
| characters that always make the same transitions. It then rewrites the NFA |
| using equivalence classes for transitions instead of characters. This cuts |
| down the NFA->DFA computation time dramatically, to the point where, for |
| uncompressed DFA tables, the DFA generation is often I/O bound in writing out |
| the tables. |
| @item |
| It maintains hash values for previously computed DFA states, so testing |
| whether a newly constructed DFA state is equivalent to a previously constructed |
| state can be done very quickly, by first comparing hash values. |
| @end enumerate |
| |
| @node How can I use more than 8192 rules? |
| @unnumberedsec How can I use more than 8192 rules? |
| |
| @code{Flex} is compiled with an upper limit of 8192 rules per scanner. |
| If you need more than 8192 rules in your scanner, you'll have to recompile @code{flex} |
| with the following changes in @file{flexdef.h}: |
| |
| @example |
| @verbatim |
| < #define YY_TRAILING_MASK 0x2000 |
| < #define YY_TRAILING_HEAD_MASK 0x4000 |
| -- |
| > #define YY_TRAILING_MASK 0x20000000 |
| > #define YY_TRAILING_HEAD_MASK 0x40000000 |
| @end verbatim |
| @end example |
| |
| This should work okay as long as your C compiler uses 32 bit integers. |
| But you might want to think about whether using such a huge number of rules |
| is the best way to solve your problem. |
| |
| The following may also be relevant: |
| |
| With luck, you should be able to increase the definitions in flexdef.h for: |
| |
| @example |
| @verbatim |
| #define JAMSTATE -32766 /* marks a reference to the state that always jams */ |
| #define MAXIMUM_MNS 31999 |
| #define BAD_SUBSCRIPT -32767 |
| @end verbatim |
| @end example |
| |
| recompile everything, and it'll all work. Flex only has these 16-bit-like |
| values built into it because a long time ago it was developed on a machine |
| with 16-bit ints. I've given this advice to others in the past but haven't |
| heard back from them whether it worked okay or not... |
| |
| @node How do I abandon a file in the middle of a scan and switch to a new file? |
| @unnumberedsec How do I abandon a file in the middle of a scan and switch to a new file? |
| |
| Just call @code{yyrestart(newfile)}. Be sure to reset the start state if you want a |
| ``fresh start, since @code{yyrestart} does NOT reset the start state back to @code{INITIAL}. |
| |
| @node How do I execute code only during initialization (only before the first scan)? |
| @unnumberedsec How do I execute code only during initialization (only before the first scan)? |
| |
| You can specify an initial action by defining the macro @code{YY_USER_INIT} (though |
| note that @code{yyout} may not be available at the time this macro is executed). Or you |
| can add to the beginning of your rules section: |
| |
| @example |
| @verbatim |
| %% |
| /* Must be indented! */ |
| static int did_init = 0; |
| |
| if ( ! did_init ){ |
| do_my_init(); |
| did_init = 1; |
| } |
| @end verbatim |
| @end example |
| |
| @node How do I execute code at termination? |
| @unnumberedsec How do I execute code at termination? |
| |
| You can specify an action for the @code{<<EOF>>} rule. |
| |
| @node Where else can I find help? |
| @unnumberedsec Where else can I find help? |
| |
| You can find the flex homepage on the web at |
| @uref{http://flex.sourceforge.net/}. See that page for details about flex |
| mailing lists as well. |
| |
| @node Can I include comments in the "rules" section of the file? |
| @unnumberedsec Can I include comments in the "rules" section of the file? |
| |
| Yes, just about anywhere you want to. See the manual for the specific syntax. |
| |
| @node I get an error about undefined yywrap(). |
| @unnumberedsec I get an error about undefined yywrap(). |
| |
| You must supply a @code{yywrap()} function of your own, or link to @file{libfl.a} |
| (which provides one), or use |
| |
| @example |
| @verbatim |
| %option noyywrap |
| @end verbatim |
| @end example |
| |
| in your source to say you don't want a @code{yywrap()} function. |
| |
| @node How can I change the matching pattern at run time? |
| @unnumberedsec How can I change the matching pattern at run time? |
| |
| You can't, it's compiled into a static table when flex builds the scanner. |
| |
| @node How can I expand macros in the input? |
| @unnumberedsec How can I expand macros in the input? |
| |
| The best way to approach this problem is at a higher level, e.g., in the parser. |
| |
| However, you can do this using multiple input buffers. |
| |
| @example |
| @verbatim |
| %% |
| macro/[a-z]+ { |
| /* Saw the macro "macro" followed by extra stuff. */ |
| main_buffer = YY_CURRENT_BUFFER; |
| expansion_buffer = yy_scan_string(expand(yytext)); |
| yy_switch_to_buffer(expansion_buffer); |
| } |
| |
| <<EOF>> { |
| if ( expansion_buffer ) |
| { |
| // We were doing an expansion, return to where |
| // we were. |
| yy_switch_to_buffer(main_buffer); |
| yy_delete_buffer(expansion_buffer); |
| expansion_buffer = 0; |
| } |
| else |
| yyterminate(); |
| } |
| @end verbatim |
| @end example |
| |
| You probably will want a stack of expansion buffers to allow nested macros. |
| From the above though hopefully the idea is clear. |
| |
| @node How can I build a two-pass scanner? |
| @unnumberedsec How can I build a two-pass scanner? |
| |
| One way to do it is to filter the first pass to a temporary file, |
| then process the temporary file on the second pass. You will probably see a |
| performance hit, due to all the disk I/O. |
| |
| When you need to look ahead far forward like this, it almost always means |
| that the right solution is to build a parse tree of the entire input, then |
| walk it after the parse in order to generate the output. In a sense, this |
| is a two-pass approach, once through the text and once through the parse |
| tree, but the performance hit for the latter is usually an order of magnitude |
| smaller, since everything is already classified, in binary format, and |
| residing in memory. |
| |
| @node How do I match any string not matched in the preceding rules? |
| @unnumberedsec How do I match any string not matched in the preceding rules? |
| |
| One way to assign precedence, is to place the more specific rules first. If |
| two rules would match the same input (same sequence of characters) then the |
| first rule listed in the @code{flex} input wins, e.g., |
| |
| @example |
| @verbatim |
| %% |
| foo[a-zA-Z_]+ return FOO_ID; |
| bar[a-zA-Z_]+ return BAR_ID; |
| [a-zA-Z_]+ return GENERIC_ID; |
| @end verbatim |
| @end example |
| |
| Note that the rule @code{[a-zA-Z_]+} must come *after* the others. It will match the |
| same amount of text as the more specific rules, and in that case the |
| @code{flex} scanner will pick the first rule listed in your scanner as the |
| one to match. |
| |
| @node I am trying to port code from AT&T lex that uses yysptr and yysbuf. |
| @unnumberedsec I am trying to port code from AT&T lex that uses yysptr and yysbuf. |
| |
| Those are internal variables pointing into the AT&T scanner's input buffer. I |
| imagine they're being manipulated in user versions of the @code{input()} and @code{unput()} |
| functions. If so, what you need to do is analyze those functions to figure out |
| what they're doing, and then replace @code{input()} with an appropriate definition of |
| @code{YY_INPUT}. You shouldn't need to (and must not) replace |
| @code{flex}'s @code{unput()} function. |
| |
| @node Is there a way to make flex treat NULL like a regular character? |
| @unnumberedsec Is there a way to make flex treat NULL like a regular character? |
| |
| Yes, @samp{\0} and @samp{\x00} should both do the trick. Perhaps you have an ancient |
| version of @code{flex}. The latest release is version @value{VERSION}. |
| |
| @node Whenever flex can not match the input it says "flex scanner jammed". |
| @unnumberedsec Whenever flex can not match the input it says "flex scanner jammed". |
| |
| You need to add a rule that matches the otherwise-unmatched text, |
| e.g., |
| |
| @example |
| @verbatim |
| %option yylineno |
| %% |
| [[a bunch of rules here]] |
| |
| . printf("bad input character '%s' at line %d\n", yytext, yylineno); |
| @end verbatim |
| @end example |
| |
| See @code{%option default} for more information. |
| |
| @node Why doesn't flex have non-greedy operators like perl does? |
| @unnumberedsec Why doesn't flex have non-greedy operators like perl does? |
| |
| A DFA can do a non-greedy match by stopping |
| the first time it enters an accepting state, instead of consuming input until |
| it determines that no further matching is possible (a ``jam'' state). This |
| is actually easier to implement than longest leftmost match (which flex does). |
| |
| But it's also much less useful than longest leftmost match. In general, |
| when you find yourself wishing for non-greedy matching, that's usually a |
| sign that you're trying to make the scanner do some parsing. That's |
| generally the wrong approach, since it lacks the power to do a decent job. |
| Better is to either introduce a separate parser, or to split the scanner |
| into multiple scanners using (exclusive) start conditions. |
| |
| You might have |
| a separate start state once you've seen the @samp{BEGIN}. In that state, you |
| might then have a regex that will match @samp{END} (to kick you out of the |
| state), and perhaps @samp{(.|\n)} to get a single character within the chunk ... |
| |
| This approach also has much better error-reporting properties. |
| |
| @node Memory leak - 16386 bytes allocated by malloc. |
| @unnumberedsec Memory leak - 16386 bytes allocated by malloc. |
| @anchor{faq-memory-leak} |
| |
| UPDATED 2002-07-10: As of @code{flex} version 2.5.9, this leak means that you did not |
| call @code{yylex_destroy()}. If you are using an earlier version of @code{flex}, then read |
| on. |
| |
| The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the read-buffer, and |
| about 40 for @code{struct yy_buffer_state} (depending upon alignment). The leak is in |
| the non-reentrant C scanner only (NOT in the reentrant scanner, NOT in the C++ |
| scanner). Since @code{flex} doesn't know when you are done, the buffer is never freed. |
| |
| However, the leak won't multiply since the buffer is reused no matter how many |
| times you call @code{yylex()}. |
| |
| If you want to reclaim the memory when you are completely done scanning, then |
| you might try this: |
| |
| @example |
| @verbatim |
| /* For non-reentrant C scanner only. */ |
| yy_delete_buffer(YY_CURRENT_BUFFER); |
| yy_init = 1; |
| @end verbatim |
| @end example |
| |
| Note: @code{yy_init} is an "internal variable", and hasn't been tested in this |
| situation. It is possible that some other globals may need resetting as well. |
| |
| @node How do I track the byte offset for lseek()? |
| @unnumberedsec How do I track the byte offset for lseek()? |
| |
| @example |
| @verbatim |
| > We thought that it would be possible to have this number through the |
| > evaluation of the following expression: |
| > |
| > seek_position = (no_buffers)*YY_READ_BUF_SIZE + yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf |
| @end verbatim |
| @end example |
| |
| While this is the right idea, it has two problems. The first is that |
| it's possible that @code{flex} will request less than @code{YY_READ_BUF_SIZE} during |
| an invocation of @code{YY_INPUT} (or that your input source will return less |
| even though @code{YY_READ_BUF_SIZE} bytes were requested). The second problem |
| is that when refilling its internal buffer, @code{flex} keeps some characters |
| from the previous buffer (because usually it's in the middle of a match, |
| and needs those characters to construct @code{yytext} for the match once it's |
| done). Because of this, @code{yy_c_buf_p - YY_CURRENT_BUFFER->yy_ch_buf} won't |
| be exactly the number of characters already read from the current buffer. |
| |
| An alternative solution is to count the number of characters you've matched |
| since starting to scan. This can be done by using @code{YY_USER_ACTION}. For |
| example, |
| |
| @example |
| @verbatim |
| #define YY_USER_ACTION num_chars += yyleng; |
| @end verbatim |
| @end example |
| |
| (You need to be careful to update your bookkeeping if you use @code{yymore(}), |
| @code{yyless()}, @code{unput()}, or @code{input()}.) |
| |
| @node How do I use my own I/O classes in a C++ scanner? |
| @section How do I use my own I/O classes in a C++ scanner? |
| |
| When the flex C++ scanning class rewrite finally happens, then this sort of thing should become much easier. |
| |
| @cindex LexerOutput, overriding |
| @cindex LexerInput, overriding |
| @cindex overriding LexerOutput |
| @cindex overriding LexerInput |
| @cindex customizing I/O in C++ scanners |
| @cindex C++ I/O, customizing |
| You can do this by passing the various functions (such as @code{LexerInput()} |
| and @code{LexerOutput()}) NULL @code{iostream*}'s, and then |
| dealing with your own I/O classes surreptitiously (i.e., stashing them in |
| special member variables). This works because the only assumption about |
| the lexer regarding what's done with the iostream's is that they're |
| ultimately passed to @code{LexerInput()} and @code{LexerOutput}, which then do whatever |
| is necessary with them. |
| |
| @c faq edit stopped here |
| @node How do I skip as many chars as possible? |
| @unnumberedsec How do I skip as many chars as possible? |
| |
| How do I skip as many chars as possible -- without interfering with the other |
| patterns? |
| |
| In the example below, we want to skip over characters until we see the phrase |
| "endskip". The following will @emph{NOT} work correctly (do you see why not?) |
| |
| @example |
| @verbatim |
| /* INCORRECT SCANNER */ |
| %x SKIP |
| %% |
| <INITIAL>startskip BEGIN(SKIP); |
| ... |
| <SKIP>"endskip" BEGIN(INITIAL); |
| <SKIP>.* ; |
| @end verbatim |
| @end example |
| |
| The problem is that the pattern .* will eat up the word "endskip." |
| The simplest (but slow) fix is: |
| |
| @example |
| @verbatim |
| <SKIP>"endskip" BEGIN(INITIAL); |
| <SKIP>. ; |
| @end verbatim |
| @end example |
| |
| The fix involves making the second rule match more, without |
| making it match "endskip" plus something else. So for example: |
| |
| @example |
| @verbatim |
| <SKIP>"endskip" BEGIN(INITIAL); |
| <SKIP>[^e]+ ; |
| <SKIP>. ;/* so you eat up e's, too */ |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node deleteme00 |
| @unnumberedsec deleteme00 |
| @example |
| @verbatim |
| QUESTION: |
| When was flex born? |
| |
| Vern Paxson took over |
| the Software Tools lex project from Jef Poskanzer in 1982. At that point it |
| was written in Ratfor. Around 1987 or so, Paxson translated it into C, and |
| a legend was born :-). |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node Are certain equivalent patterns faster than others? |
| @unnumberedsec Are certain equivalent patterns faster than others? |
| @example |
| @verbatim |
| To: Adoram Rogel <adoram@orna.hybridge.com> |
| Subject: Re: Flex 2.5.2 performance questions |
| In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT. |
| Date: Wed, 18 Sep 96 10:51:02 PDT |
| From: Vern Paxson <vern> |
| |
| [Note, the most recent flex release is 2.5.4, which you can get from |
| ftp.ee.lbl.gov. It has bug fixes over 2.5.2 and 2.5.3.] |
| |
| > 1. Using the pattern |
| > ([Ff](oot)?)?[Nn](ote)?(\.)? |
| > instead of |
| > (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.))) |
| > (in a very complicated flex program) caused the program to slow from |
| > 300K+/min to 100K/min (no other changes were done). |
| |
| These two are not equivalent. For example, the first can match "footnote." |
| but the second can only match "footnote". This is almost certainly the |
| cause in the discrepancy - the slower scanner run is matching more tokens, |
| and/or having to do more backing up. |
| |
| > 2. Which of these two are better: [Ff]oot or (F|f)oot ? |
| |
| From a performance point of view, they're equivalent (modulo presumably |
| minor effects such as memory cache hit rates; and the presence of trailing |
| context, see below). From a space point of view, the first is slightly |
| preferable. |
| |
| > 3. I have a pattern that look like this: |
| > pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd) |
| > |
| > running yet another complicated program that includes the following rule: |
| > <snext>{and}/{no4}{bb}{pats} |
| > |
| > gets me to "too complicated - over 32,000 states"... |
| |
| I can't tell from this example whether the trailing context is variable-length |
| or fixed-length (it could be the latter if {and} is fixed-length). If it's |
| variable length, which flex -p will tell you, then this reflects a basic |
| performance problem, and if you can eliminate it by restructuring your |
| scanner, you will see significant improvement. |
| |
| > so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about |
| > 10 patterns and changed the rule to be 5 rules. |
| > This did compile, but what is the rule of thumb here ? |
| |
| The rule is to avoid trailing context other than fixed-length, in which for |
| a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use |
| of the '|' operator automatically makes the pattern variable length, so in |
| this case '[Ff]oot' is preferred to '(F|f)oot'. |
| |
| > 4. I changed a rule that looked like this: |
| > <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN... |
| > |
| > to the next 2 rules: |
| > <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;} |
| > <snext8>{and}{bb}/{ROMAN} { BEGIN... |
| > |
| > Again, I understand the using [^...] will cause a great performance loss |
| |
| Actually, it doesn't cause any sort of performance loss. It's a surprising |
| fact about regular expressions that they always match in linear time |
| regardless of how complex they are. |
| |
| > but are there any specific rules about it ? |
| |
| See the "Performance Considerations" section of the man page, and also |
| the example in MISC/fastwc/. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node Is backing up a big deal? |
| @unnumberedsec Is backing up a big deal? |
| @example |
| @verbatim |
| To: Adoram Rogel <adoram@hybridge.com> |
| Subject: Re: Flex 2.5.2 performance questions |
| In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT. |
| Date: Thu, 19 Sep 96 09:58:00 PDT |
| From: Vern Paxson <vern> |
| |
| > a lot about the backing up problem. |
| > I believe that there lies my biggest problem, and I'll try to improve |
| > it. |
| |
| Since you have variable trailing context, this is a bigger performance |
| problem. Fixing it is usually easier than fixing backing up, which in a |
| complicated scanner (yours seems to fit the bill) can be extremely |
| difficult to do correctly. |
| |
| You also don't mention what flags you are using for your scanner. |
| -f makes a large speed difference, and -Cfe buys you nearly as much |
| speed but the resulting scanner is considerably smaller. |
| |
| > I have an | operator in {and} and in {pats} so both of them are variable |
| > length. |
| |
| -p should have reported this. |
| |
| > Is changing one of them to fixed-length is enough ? |
| |
| Yes. |
| |
| > Is it possible to change the 32,000 states limit ? |
| |
| Yes. I've appended instructions on how. Before you make this change, |
| though, you should think about whether there are ways to fundamentally |
| simplify your scanner - those are certainly preferable! |
| |
| Vern |
| |
| To increase the 32K limit (on a machine with 32 bit integers), you increase |
| the magnitude of the following in flexdef.h: |
| |
| #define JAMSTATE -32766 /* marks a reference to the state that always jams */ |
| #define MAXIMUM_MNS 31999 |
| #define BAD_SUBSCRIPT -32767 |
| #define MAX_SHORT 32700 |
| |
| Adding a 0 or two after each should do the trick. |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node Can I fake multi-byte character support? |
| @unnumberedsec Can I fake multi-byte character support? |
| @example |
| @verbatim |
| To: Heeman_Lee@hp.com |
| Subject: Re: flex - multi-byte support? |
| In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT. |
| Date: Fri, 04 Oct 1996 11:42:18 PDT |
| From: Vern Paxson <vern> |
| |
| > I assume as long as my *.l file defines the |
| > range of expected character code values (in octal format), flex will |
| > scan the file and read multi-byte characters correctly. But I have no |
| > confidence in this assumption. |
| |
| Your lack of confidence is justified - this won't work. |
| |
| Flex has in it a widespread assumption that the input is processed |
| one byte at a time. Fixing this is on the to-do list, but is involved, |
| so it won't happen any time soon. In the interim, the best I can suggest |
| (unless you want to try fixing it yourself) is to write your rules in |
| terms of pairs of bytes, using definitions in the first section: |
| |
| X \xfe\xc2 |
| ... |
| %% |
| foo{X}bar found_foo_fe_c2_bar(); |
| |
| etc. Definitely a pain - sorry about that. |
| |
| By the way, the email address you used for me is ancient, indicating you |
| have a very old version of flex. You can get the most recent, 2.5.4, from |
| ftp.ee.lbl.gov. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node deleteme01 |
| @unnumberedsec deleteme01 |
| @example |
| @verbatim |
| To: moleary@primus.com |
| Subject: Re: Flex / Unicode compatibility question |
| In-reply-to: Your message of Tue, 22 Oct 1996 10:15:42 PDT. |
| Date: Tue, 22 Oct 1996 11:06:13 PDT |
| From: Vern Paxson <vern> |
| |
| Unfortunately flex at the moment has a widespread assumption within it |
| that characters are processed 8 bits at a time. I don't see any easy |
| fix for this (other than writing your rules in terms of double characters - |
| a pain). I also don't know of a wider lex, though you might try surfing |
| the Plan 9 stuff because I know it's a Unicode system, and also the PCCT |
| toolkit (try searching say Alta Vista for "Purdue Compiler Construction |
| Toolkit"). |
| |
| Fixing flex to handle wider characters is on the long-term to-do list. |
| But since flex is a strictly spare-time project these days, this probably |
| won't happen for quite a while, unless someone else does it first. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node Can you discuss some flex internals? |
| @unnumberedsec Can you discuss some flex internals? |
| @example |
| @verbatim |
| To: Johan Linde <jl@theophys.kth.se> |
| Subject: Re: translation of flex |
| In-reply-to: Your message of Sun, 10 Nov 1996 09:16:36 PST. |
| Date: Mon, 11 Nov 1996 10:33:50 PST |
| From: Vern Paxson <vern> |
| |
| > I'm working for the Swedish team translating GNU program, and I'm currently |
| > working with flex. I have a few questions about some of the messages which |
| > I hope you can answer. |
| |
| All of the things you're wondering about, by the way, concerning flex |
| internals - probably the only person who understands what they mean in |
| English is me! So I wouldn't worry too much about getting them right. |
| That said ... |
| |
| > #: main.c:545 |
| > msgid " %d protos created\n" |
| > |
| > Does proto mean prototype? |
| |
| Yes - prototypes of state compression tables. |
| |
| > #: main.c:539 |
| > msgid " %d/%d (peak %d) template nxt-chk entries created\n" |
| > |
| > Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?) |
| > However, 'template next-check entries' doesn't make much sense to me. To be |
| > able to find a good translation I need to know a little bit more about it. |
| |
| There is a scheme in the Aho/Sethi/Ullman compiler book for compressing |
| scanner tables. It involves creating two pairs of tables. The first has |
| "base" and "default" entries, the second has "next" and "check" entries. |
| The "base" entry is indexed by the current state and yields an index into |
| the next/check table. The "default" entry gives what to do if the state |
| transition isn't found in next/check. The "next" entry gives the next |
| state to enter, but only if the "check" entry verifies that this entry is |
| correct for the current state. Flex creates templates of series of |
| next/check entries and then encodes differences from these templates as a |
| way to compress the tables. |
| |
| > #: main.c:533 |
| > msgid " %d/%d base-def entries created\n" |
| > |
| > The same problem here for 'base-def'. |
| |
| See above. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unput() messes up yy_at_bol |
| @unnumberedsec unput() messes up yy_at_bol |
| @example |
| @verbatim |
| To: Xinying Li <xli@npac.syr.edu> |
| Subject: Re: FLEX ? |
| In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST. |
| Date: Wed, 13 Nov 1996 19:51:54 PST |
| From: Vern Paxson <vern> |
| |
| > "unput()" them to input flow, question occurs. If I do this after I scan |
| > a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That |
| > means the carriage flag has gone. |
| |
| You can control this by calling yy_set_bol(). It's described in the manual. |
| |
| > And if in pre-reading it goes to the end of file, is anything done |
| > to control the end of curren buffer and end of file? |
| |
| No, there's no way to put back an end-of-file. |
| |
| > By the way I am using flex 2.5.2 and using the "-l". |
| |
| The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and |
| 2.5.3. You can get it from ftp.ee.lbl.gov. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node The | operator is not doing what I want |
| @unnumberedsec The | operator is not doing what I want |
| @example |
| @verbatim |
| To: Alain.ISSARD@st.com |
| Subject: Re: Start condition with FLEX |
| In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST. |
| Date: Mon, 18 Nov 1996 10:41:34 PST |
| From: Vern Paxson <vern> |
| |
| > I am not able to use the start condition scope and to use the | (OR) with |
| > rules having start conditions. |
| |
| The problem is that if you use '|' as a regular expression operator, for |
| example "a|b" meaning "match either 'a' or 'b'", then it must *not* have |
| any blanks around it. If you instead want the special '|' *action* (which |
| from your scanner appears to be the case), which is a way of giving two |
| different rules the same action: |
| |
| foo | |
| bar matched_foo_or_bar(); |
| |
| then '|' *must* be separated from the first rule by whitespace and *must* |
| be followed by a new line. You *cannot* write it as: |
| |
| foo | bar matched_foo_or_bar(); |
| |
| even though you might think you could because yacc supports this syntax. |
| The reason for this unfortunately incompatibility is historical, but it's |
| unlikely to be changed. |
| |
| Your problems with start condition scope are simply due to syntax errors |
| from your use of '|' later confusing flex. |
| |
| Let me know if you still have problems. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node Why can't flex understand this variable trailing context pattern? |
| @unnumberedsec Why can't flex understand this variable trailing context pattern? |
| @example |
| @verbatim |
| To: Gregory Margo <gmargo@newton.vip.best.com> |
| Subject: Re: flex-2.5.3 bug report |
| In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST. |
| Date: Sat, 23 Nov 1996 17:07:32 PST |
| From: Vern Paxson <vern> |
| |
| > Enclosed is a lex file that "real" lex will process, but I cannot get |
| > flex to process it. Could you try it and maybe point me in the right direction? |
| |
| Your problem is that some of the definitions in the scanner use the '/' |
| trailing context operator, and have it enclosed in ()'s. Flex does not |
| allow this operator to be enclosed in ()'s because doing so allows undefined |
| regular expressions such as "(a/b)+". So the solution is to remove the |
| parentheses. Note that you must also be building the scanner with the -l |
| option for AT&T lex compatibility. Without this option, flex automatically |
| encloses the definitions in parentheses. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node The ^ operator isn't working |
| @unnumberedsec The ^ operator isn't working |
| @example |
| @verbatim |
| To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de> |
| Subject: Re: Flex Bug ? |
| In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST. |
| Date: Tue, 26 Nov 1996 11:15:05 PST |
| From: Vern Paxson <vern> |
| |
| > In my lexer code, i have the line : |
| > ^\*.* { } |
| > |
| > Thus all lines starting with an astrix (*) are comment lines. |
| > This does not work ! |
| |
| I can't get this problem to reproduce - it works fine for me. Note |
| though that if what you have is slightly different: |
| |
| COMMENT ^\*.* |
| %% |
| {COMMENT} { } |
| |
| then it won't work, because flex pushes back macro definitions enclosed |
| in ()'s, so the rule becomes |
| |
| (^\*.*) { } |
| |
| and now that the '^' operator is not at the immediate beginning of the |
| line, it's interpreted as just a regular character. You can avoid this |
| behavior by using the "-l" lex-compatibility flag, or "%option lex-compat". |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node Trailing context is getting confused with trailing optional patterns |
| @unnumberedsec Trailing context is getting confused with trailing optional patterns |
| @example |
| @verbatim |
| To: Adoram Rogel <adoram@hybridge.com> |
| Subject: Re: Flex 2.5.4 BOF ??? |
| In-reply-to: Your message of Tue, 26 Nov 1996 16:10:41 PST. |
| Date: Wed, 27 Nov 1996 10:56:25 PST |
| From: Vern Paxson <vern> |
| |
| > Organization(s)?/[a-z] |
| > |
| > This matched "Organizations" (looking in debug mode, the trailing s |
| > was matched with trailing context instead of the optional (s) in the |
| > end of the word. |
| |
| That should only happen with lex. Flex can properly match this pattern. |
| (That might be what you're saying, I'm just not sure.) |
| |
| > Is there a way to avoid this dangerous trailing context problem ? |
| |
| Unfortunately, there's no easy way. On the other hand, I don't see why |
| it should be a problem. Lex's matching is clearly wrong, and I'd hope |
| that usually the intent remains the same as expressed with the pattern, |
| so flex's matching will be correct. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node Is flex GNU or not? |
| @unnumberedsec Is flex GNU or not? |
| @example |
| @verbatim |
| To: Cameron MacKinnon <mackin@interlog.com> |
| Subject: Re: Flex documentation bug |
| In-reply-to: Your message of Mon, 02 Dec 1996 00:07:08 PST. |
| Date: Sun, 01 Dec 1996 22:29:39 PST |
| From: Vern Paxson <vern> |
| |
| > I'm not sure how or where to submit bug reports (documentation or |
| > otherwise) for the GNU project stuff ... |
| |
| Well, strictly speaking flex isn't part of the GNU project. They just |
| distribute it because no one's written a decent GPL'd lex replacement. |
| So you should send bugs directly to me. Those sent to the GNU folks |
| sometimes find there way to me, but some may drop between the cracks. |
| |
| > In GNU Info, under the section 'Start Conditions', and also in the man |
| > page (mine's dated April '95) is a nice little snippet showing how to |
| > parse C quoted strings into a buffer, defined to be MAX_STR_CONST in |
| > size. Unfortunately, no overflow checking is ever done ... |
| |
| This is already mentioned in the manual: |
| |
| Finally, here's an example of how to match C-style quoted |
| strings using exclusive start conditions, including expanded |
| escape sequences (but not including checking for a string |
| that's too long): |
| |
| The reason for not doing the overflow checking is that it will needlessly |
| clutter up an example whose main purpose is just to demonstrate how to |
| use flex. |
| |
| The latest release is 2.5.4, by the way, available from ftp.ee.lbl.gov. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node ERASEME53 |
| @unnumberedsec ERASEME53 |
| @example |
| @verbatim |
| To: tsv@cs.UManitoba.CA |
| Subject: Re: Flex (reg).. |
| In-reply-to: Your message of Thu, 06 Mar 1997 23:50:16 PST. |
| Date: Thu, 06 Mar 1997 15:54:19 PST |
| From: Vern Paxson <vern> |
| |
| > [:alpha:] ([:alnum:] | \\_)* |
| |
| If your rule really has embedded blanks as shown above, then it won't |
| work, as the first blank delimits the rule from the action. (It wouldn't |
| even compile ...) You need instead: |
| |
| [:alpha:]([:alnum:]|\\_)* |
| |
| and that should work fine - there's no restriction on what can go inside |
| of ()'s except for the trailing context operator, '/'. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node I need to scan if-then-else blocks and while loops |
| @unnumberedsec I need to scan if-then-else blocks and while loops |
| @example |
| @verbatim |
| To: "Mike Stolnicki" <mstolnic@ford.com> |
| Subject: Re: FLEX help |
| In-reply-to: Your message of Fri, 30 May 1997 13:33:27 PDT. |
| Date: Fri, 30 May 1997 10:46:35 PDT |
| From: Vern Paxson <vern> |
| |
| > We'd like to add "if-then-else", "while", and "for" statements to our |
| > language ... |
| > We've investigated many possible solutions. The one solution that seems |
| > the most reasonable involves knowing the position of a TOKEN in yyin. |
| |
| I strongly advise you to instead build a parse tree (abstract syntax tree) |
| and loop over that instead. You'll find this has major benefits in keeping |
| your interpreter simple and extensible. |
| |
| That said, the functionality you mention for get_position and set_position |
| have been on the to-do list for a while. As flex is a purely spare-time |
| project for me, no guarantees when this will be added (in particular, it |
| for sure won't be for many months to come). |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node ERASEME55 |
| @unnumberedsec ERASEME55 |
| @example |
| @verbatim |
| To: Colin Paul Adams <colin@colina.demon.co.uk> |
| Subject: Re: Flex C++ classes and Bison |
| In-reply-to: Your message of 09 Aug 1997 17:11:41 PDT. |
| Date: Fri, 15 Aug 1997 10:48:19 PDT |
| From: Vern Paxson <vern> |
| |
| > #define YY_DECL int yylex (YYSTYPE *lvalp, struct parser_control |
| > *parm) |
| > |
| > I have been trying to get this to work as a C++ scanner, but it does |
| > not appear to be possible (warning that it matches no declarations in |
| > yyFlexLexer, or something like that). |
| > |
| > Is this supposed to be possible, or is it being worked on (I DID |
| > notice the comment that scanner classes are still experimental, so I'm |
| > not too hopeful)? |
| |
| What you need to do is derive a subclass from yyFlexLexer that provides |
| the above yylex() method, squirrels away lvalp and parm into member |
| variables, and then invokes yyFlexLexer::yylex() to do the regular scanning. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node ERASEME56 |
| @unnumberedsec ERASEME56 |
| @example |
| @verbatim |
| To: Mikael.Latvala@lmf.ericsson.se |
| Subject: Re: Possible mistake in Flex v2.5 document |
| In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT. |
| Date: Fri, 05 Sep 1997 10:01:54 PDT |
| From: Vern Paxson <vern> |
| |
| > In that example you show how to count comment lines when using |
| > C style /* ... */ comments. My question is, shouldn't you take into |
| > account a scenario where end of a comment marker occurs inside |
| > character or string literals? |
| |
| The scanner certainly needs to also scan character and string literals. |
| However it does that (there's an example in the man page for strings), the |
| lexer will recognize the beginning of the literal before it runs across the |
| embedded "/*". Consequently, it will finish scanning the literal before it |
| even considers the possibility of matching "/*". |
| |
| Example: |
| |
| '([^']*|{ESCAPE_SEQUENCE})' |
| |
| will match all the text between the ''s (inclusive). So the lexer |
| considers this as a token beginning at the first ', and doesn't even |
| attempt to match other tokens inside it. |
| |
| I thinnk this subtlety is not worth putting in the manual, as I suspect |
| it would confuse more people than it would enlighten. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node ERASEME57 |
| @unnumberedsec ERASEME57 |
| @example |
| @verbatim |
| To: "Marty Leisner" <leisner@sdsp.mc.xerox.com> |
| Subject: Re: flex limitations |
| In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT. |
| Date: Mon, 08 Sep 1997 11:38:08 PDT |
| From: Vern Paxson <vern> |
| |
| > %% |
| > [a-zA-Z]+ /* skip a line */ |
| > { printf("got %s\n", yytext); } |
| > %% |
| |
| What version of flex are you using? If I feed this to 2.5.4, it complains: |
| |
| "bug.l", line 5: EOF encountered inside an action |
| "bug.l", line 5: unrecognized rule |
| "bug.l", line 5: fatal parse error |
| |
| Not the world's greatest error message, but it manages to flag the problem. |
| |
| (With the introduction of start condition scopes, flex can't accommodate |
| an action on a separate line, since it's ambiguous with an indented rule.) |
| |
| You can get 2.5.4 from ftp.ee.lbl.gov. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node Is there a repository for flex scanners? |
| @unnumberedsec Is there a repository for flex scanners? |
| |
| Not that we know of. You might try asking on comp.compilers. |
| |
| @c TODO: Evaluate this faq. |
| @node How can I conditionally compile or preprocess my flex input file? |
| @unnumberedsec How can I conditionally compile or preprocess my flex input file? |
| |
| |
| Flex doesn't have a preprocessor like C does. You might try using m4, or the C |
| preprocessor plus a sed script to clean up the result. |
| |
| |
| @c TODO: Evaluate this faq. |
| @node Where can I find grammars for lex and yacc? |
| @unnumberedsec Where can I find grammars for lex and yacc? |
| |
| In the sources for flex and bison. |
| |
| @c TODO: Evaluate this faq. |
| @node I get an end-of-buffer message for each character scanned. |
| @unnumberedsec I get an end-of-buffer message for each character scanned. |
| |
| This will happen if your LexerInput() function returns only one character |
| at a time, which can happen either if you're scanner is "interactive", or |
| if the streams library on your platform always returns 1 for yyin->gcount(). |
| |
| Solution: override LexerInput() with a version that returns whole buffers. |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-62 |
| @unnumberedsec unnamed-faq-62 |
| @example |
| @verbatim |
| To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE |
| Subject: Re: Flex maximums |
| In-reply-to: Your message of Mon, 17 Nov 1997 17:16:06 PST. |
| Date: Mon, 17 Nov 1997 17:16:15 PST |
| From: Vern Paxson <vern> |
| |
| > I took a quick look into the flex-sources and altered some #defines in |
| > flexdefs.h: |
| > |
| > #define INITIAL_MNS 64000 |
| > #define MNS_INCREMENT 1024000 |
| > #define MAXIMUM_MNS 64000 |
| |
| The things to fix are to add a couple of zeroes to: |
| |
| #define JAMSTATE -32766 /* marks a reference to the state that always jams */ |
| #define MAXIMUM_MNS 31999 |
| #define BAD_SUBSCRIPT -32767 |
| #define MAX_SHORT 32700 |
| |
| and, if you get complaints about too many rules, make the following change too: |
| |
| #define YY_TRAILING_MASK 0x200000 |
| #define YY_TRAILING_HEAD_MASK 0x400000 |
| |
| - Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-63 |
| @unnumberedsec unnamed-faq-63 |
| @example |
| @verbatim |
| To: jimmey@lexis-nexis.com (Jimmey Todd) |
| Subject: Re: FLEX question regarding istream vs ifstream |
| In-reply-to: Your message of Mon, 08 Dec 1997 15:54:15 PST. |
| Date: Mon, 15 Dec 1997 13:21:35 PST |
| From: Vern Paxson <vern> |
| |
| > stdin_handle = YY_CURRENT_BUFFER; |
| > ifstream fin( "aFile" ); |
| > yy_switch_to_buffer( yy_create_buffer( fin, YY_BUF_SIZE ) ); |
| > |
| > What I'm wanting to do, is pass the contents of a file thru one set |
| > of rules and then pass stdin thru another set... It works great if, I |
| > don't use the C++ classes. But since everything else that I'm doing is |
| > in C++, I thought I'd be consistent. |
| > |
| > The problem is that 'yy_create_buffer' is expecting an istream* as it's |
| > first argument (as stated in the man page). However, fin is a ifstream |
| > object. Any ideas on what I might be doing wrong? Any help would be |
| > appreciated. Thanks!! |
| |
| You need to pass &fin, to turn it into an ifstream* instead of an ifstream. |
| Then its type will be compatible with the expected istream*, because ifstream |
| is derived from istream. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-64 |
| @unnumberedsec unnamed-faq-64 |
| @example |
| @verbatim |
| To: Enda Fadian <fadiane@piercom.ie> |
| Subject: Re: Question related to Flex man page? |
| In-reply-to: Your message of Tue, 16 Dec 1997 15:17:34 PST. |
| Date: Tue, 16 Dec 1997 14:17:09 PST |
| From: Vern Paxson <vern> |
| |
| > Can you explain to me what is ment by a long-jump in relation to flex? |
| |
| Using the longjmp() function while inside yylex() or a routine called by it. |
| |
| > what is the flex activation frame. |
| |
| Just yylex()'s stack frame. |
| |
| > As far as I can see yyrestart will bring me back to the sart of the input |
| > file and using flex++ isnot really an option! |
| |
| No, yyrestart() doesn't imply a rewind, even though its name might sound |
| like it does. It tells the scanner to flush its internal buffers and |
| start reading from the given file at its present location. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-65 |
| @unnumberedsec unnamed-faq-65 |
| @example |
| @verbatim |
| To: hassan@larc.info.uqam.ca (Hassan Alaoui) |
| Subject: Re: Need urgent Help |
| In-reply-to: Your message of Sat, 20 Dec 1997 19:38:19 PST. |
| Date: Sun, 21 Dec 1997 21:30:46 PST |
| From: Vern Paxson <vern> |
| |
| > /usr/lib/yaccpar: In function `int yyparse()': |
| > /usr/lib/yaccpar:184: warning: implicit declaration of function `int yylex(...)' |
| > |
| > ld: Undefined symbol |
| > _yylex |
| > _yyparse |
| > _yyin |
| |
| This is a known problem with Solaris C++ (and/or Solaris yacc). I believe |
| the fix is to explicitly insert some 'extern "C"' statements for the |
| corresponding routines/symbols. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-66 |
| @unnumberedsec unnamed-faq-66 |
| @example |
| @verbatim |
| To: mc0307@mclink.it |
| Cc: gnu@prep.ai.mit.edu |
| Subject: Re: [mc0307@mclink.it: Help request] |
| In-reply-to: Your message of Fri, 12 Dec 1997 17:57:29 PST. |
| Date: Sun, 21 Dec 1997 22:33:37 PST |
| From: Vern Paxson <vern> |
| |
| > This is my definition for float and integer types: |
| > . . . |
| > NZD [1-9] |
| > ... |
| > I've tested my program on other lex version (on UNIX Sun Solaris an HP |
| > UNIX) and it work well, so I think that my definitions are correct. |
| > There are any differences between Lex and Flex? |
| |
| There are indeed differences, as discussed in the man page. The one |
| you are probably running into is that when flex expands a name definition, |
| it puts parentheses around the expansion, while lex does not. There's |
| an example in the man page of how this can lead to different matching. |
| Flex's behavior complies with the POSIX standard (or at least with the |
| last POSIX draft I saw). |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-67 |
| @unnumberedsec unnamed-faq-67 |
| @example |
| @verbatim |
| To: hassan@larc.info.uqam.ca (Hassan Alaoui) |
| Subject: Re: Thanks |
| In-reply-to: Your message of Mon, 22 Dec 1997 16:06:35 PST. |
| Date: Mon, 22 Dec 1997 14:35:05 PST |
| From: Vern Paxson <vern> |
| |
| > Thank you very much for your help. I compile and link well with C++ while |
| > declaring 'yylex ...' extern, But a little problem remains. I get a |
| > segmentation default when executing ( I linked with lfl library) while it |
| > works well when using LEX instead of flex. Do you have some ideas about the |
| > reason for this ? |
| |
| The one possible reason for this that comes to mind is if you've defined |
| yytext as "extern char yytext[]" (which is what lex uses) instead of |
| "extern char *yytext" (which is what flex uses). If it's not that, then |
| I'm afraid I don't know what the problem might be. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-68 |
| @unnumberedsec unnamed-faq-68 |
| @example |
| @verbatim |
| To: "Bart Niswonger" <NISWONGR@almaden.ibm.com> |
| Subject: Re: flex 2.5: c++ scanners & start conditions |
| In-reply-to: Your message of Tue, 06 Jan 1998 10:34:21 PST. |
| Date: Tue, 06 Jan 1998 19:19:30 PST |
| From: Vern Paxson <vern> |
| |
| > The problem is that when I do this (using %option c++) start |
| > conditions seem to not apply. |
| |
| The BEGIN macro modifies the yy_start variable. For C scanners, this |
| is a static with scope visible through the whole file. For C++ scanners, |
| it's a member variable, so it only has visible scope within a member |
| function. Your lexbegin() routine is not a member function when you |
| build a C++ scanner, so it's not modifying the correct yy_start. The |
| diagnostic that indicates this is that you found you needed to add |
| a declaration of yy_start in order to get your scanner to compile when |
| using C++; instead, the correct fix is to make lexbegin() a member |
| function (by deriving from yyFlexLexer). |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-69 |
| @unnumberedsec unnamed-faq-69 |
| @example |
| @verbatim |
| To: "Boris Zinin" <boris@ippe.rssi.ru> |
| Subject: Re: current position in flex buffer |
| In-reply-to: Your message of Mon, 12 Jan 1998 18:58:23 PST. |
| Date: Mon, 12 Jan 1998 12:03:15 PST |
| From: Vern Paxson <vern> |
| |
| > The problem is how to determine the current position in flex active |
| > buffer when a rule is matched.... |
| |
| You will need to keep track of this explicitly, such as by redefining |
| YY_USER_ACTION to count the number of characters matched. |
| |
| The latest flex release, by the way, is 2.5.4, available from ftp.ee.lbl.gov. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-70 |
| @unnumberedsec unnamed-faq-70 |
| @example |
| @verbatim |
| To: Bik.Dhaliwal@bis.org |
| Subject: Re: Flex question |
| In-reply-to: Your message of Mon, 26 Jan 1998 13:05:35 PST. |
| Date: Tue, 27 Jan 1998 22:41:52 PST |
| From: Vern Paxson <vern> |
| |
| > That requirement involves knowing |
| > the character position at which a particular token was matched |
| > in the lexer. |
| |
| The way you have to do this is by explicitly keeping track of where |
| you are in the file, by counting the number of characters scanned |
| for each token (available in yyleng). It may prove convenient to |
| do this by redefining YY_USER_ACTION, as described in the manual. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-71 |
| @unnumberedsec unnamed-faq-71 |
| @example |
| @verbatim |
| To: Vladimir Alexiev <vladimir@cs.ualberta.ca> |
| Subject: Re: flex: how to control start condition from parser? |
| In-reply-to: Your message of Mon, 26 Jan 1998 05:50:16 PST. |
| Date: Tue, 27 Jan 1998 22:45:37 PST |
| From: Vern Paxson <vern> |
| |
| > It seems useful for the parser to be able to tell the lexer about such |
| > context dependencies, because then they don't have to be limited to |
| > local or sequential context. |
| |
| One way to do this is to have the parser call a stub routine that's |
| included in the scanner's .l file, and consequently that has access ot |
| BEGIN. The only ugliness is that the parser can't pass in the state |
| it wants, because those aren't visible - but if you don't have many |
| such states, then using a different set of names doesn't seem like |
| to much of a burden. |
| |
| While generating a .h file like you suggests is certainly cleaner, |
| flex development has come to a virtual stand-still :-(, so a workaround |
| like the above is much more pragmatic than waiting for a new feature. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-72 |
| @unnumberedsec unnamed-faq-72 |
| @example |
| @verbatim |
| To: Barbara Denny <denny@3com.com> |
| Subject: Re: freebsd flex bug? |
| In-reply-to: Your message of Fri, 30 Jan 1998 12:00:43 PST. |
| Date: Fri, 30 Jan 1998 12:42:32 PST |
| From: Vern Paxson <vern> |
| |
| > lex.yy.c:1996: parse error before `=' |
| |
| This is the key, identifying this error. (It may help to pinpoint |
| it by using flex -L, so it doesn't generate #line directives in its |
| output.) I will bet you heavy money that you have a start condition |
| name that is also a variable name, or something like that; flex spits |
| out #define's for each start condition name, mapping them to a number, |
| so you can wind up with: |
| |
| %x foo |
| %% |
| ... |
| %% |
| void bar() |
| { |
| int foo = 3; |
| } |
| |
| and the penultimate will turn into "int 1 = 3" after C preprocessing, |
| since flex will put "#define foo 1" in the generated scanner. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-73 |
| @unnumberedsec unnamed-faq-73 |
| @example |
| @verbatim |
| To: Maurice Petrie <mpetrie@infoscigroup.com> |
| Subject: Re: Lost flex .l file |
| In-reply-to: Your message of Mon, 02 Feb 1998 14:10:01 PST. |
| Date: Mon, 02 Feb 1998 11:15:12 PST |
| From: Vern Paxson <vern> |
| |
| > I am curious as to |
| > whether there is a simple way to backtrack from the generated source to |
| > reproduce the lost list of tokens we are searching on. |
| |
| In theory, it's straight-forward to go from the DFA representation |
| back to a regular-expression representation - the two are isomorphic. |
| In practice, a huge headache, because you have to unpack all the tables |
| back into a single DFA representation, and then write a program to munch |
| on that and translate it into an RE. |
| |
| Sorry for the less-than-happy news ... |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-74 |
| @unnumberedsec unnamed-faq-74 |
| @example |
| @verbatim |
| To: jimmey@lexis-nexis.com (Jimmey Todd) |
| Subject: Re: Flex performance question |
| In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. |
| Date: Thu, 19 Feb 1998 08:48:51 PST |
| From: Vern Paxson <vern> |
| |
| > What I have found, is that the smaller the data chunk, the faster the |
| > program executes. This is the opposite of what I expected. Should this be |
| > happening this way? |
| |
| This is exactly what will happen if your input file has embedded NULs. |
| From the man page: |
| |
| A final note: flex is slow when matching NUL's, particularly |
| when a token contains multiple NUL's. It's best to write |
| rules which match short amounts of text if it's anticipated |
| that the text will often include NUL's. |
| |
| So that's the first thing to look for. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-75 |
| @unnumberedsec unnamed-faq-75 |
| @example |
| @verbatim |
| To: jimmey@lexis-nexis.com (Jimmey Todd) |
| Subject: Re: Flex performance question |
| In-reply-to: Your message of Thu, 19 Feb 1998 11:01:17 PST. |
| Date: Thu, 19 Feb 1998 15:42:25 PST |
| From: Vern Paxson <vern> |
| |
| So there are several problems. |
| |
| First, to go fast, you want to match as much text as possible, which |
| your scanners don't in the case that what they're scanning is *not* |
| a <RN> tag. So you want a rule like: |
| |
| [^<]+ |
| |
| Second, C++ scanners are particularly slow if they're interactive, |
| which they are by default. Using -B speeds it up by a factor of 3-4 |
| on my workstation. |
| |
| Third, C++ scanners that use the istream interface are slow, because |
| of how poorly implemented istream's are. I built two versions of |
| the following scanner: |
| |
| %% |
| .*\n |
| .* |
| %% |
| |
| and the C version inhales a 2.5MB file on my workstation in 0.8 seconds. |
| The C++ istream version, using -B, takes 3.8 seconds. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-76 |
| @unnumberedsec unnamed-faq-76 |
| @example |
| @verbatim |
| To: "Frescatore, David (CRD, TAD)" <frescatore@exc01crdge.crd.ge.com> |
| Subject: Re: FLEX 2.5 & THE YEAR 2000 |
| In-reply-to: Your message of Wed, 03 Jun 1998 11:26:22 PDT. |
| Date: Wed, 03 Jun 1998 10:22:26 PDT |
| From: Vern Paxson <vern> |
| |
| > I am researching the Y2K problem with General Electric R&D |
| > and need to know if there are any known issues concerning |
| > the above mentioned software and Y2K regardless of version. |
| |
| There shouldn't be, all it ever does with the date is ask the system |
| for it and then print it out. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-77 |
| @unnumberedsec unnamed-faq-77 |
| @example |
| @verbatim |
| To: "Hans Dermot Doran" <htd@ibhdoran.com> |
| Subject: Re: flex problem |
| In-reply-to: Your message of Wed, 15 Jul 1998 21:30:13 PDT. |
| Date: Tue, 21 Jul 1998 14:23:34 PDT |
| From: Vern Paxson <vern> |
| |
| > To overcome this, I gets() the stdin into a string and lex the string. The |
| > string is lexed OK except that the end of string isn't lexed properly |
| > (yy_scan_string()), that is the lexer dosn't recognise the end of string. |
| |
| Flex doesn't contain mechanisms for recognizing buffer endpoints. But if |
| you use fgets instead (which you should anyway, to protect against buffer |
| overflows), then the final \n will be preserved in the string, and you can |
| scan that in order to find the end of the string. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-78 |
| @unnumberedsec unnamed-faq-78 |
| @example |
| @verbatim |
| To: soumen@almaden.ibm.com |
| Subject: Re: Flex++ 2.5.3 instance member vs. static member |
| In-reply-to: Your message of Mon, 27 Jul 1998 02:10:04 PDT. |
| Date: Tue, 28 Jul 1998 01:10:34 PDT |
| From: Vern Paxson <vern> |
| |
| > %{ |
| > int mylineno = 0; |
| > %} |
| > ws [ \t]+ |
| > alpha [A-Za-z] |
| > dig [0-9] |
| > %% |
| > |
| > Now you'd expect mylineno to be a member of each instance of class |
| > yyFlexLexer, but is this the case? A look at the lex.yy.cc file seems to |
| > indicate otherwise; unless I am missing something the declaration of |
| > mylineno seems to be outside any class scope. |
| > |
| > How will this work if I want to run a multi-threaded application with each |
| > thread creating a FlexLexer instance? |
| |
| Derive your own subclass and make mylineno a member variable of it. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-79 |
| @unnumberedsec unnamed-faq-79 |
| @example |
| @verbatim |
| To: Adoram Rogel <adoram@hybridge.com> |
| Subject: Re: More than 32K states change hangs |
| In-reply-to: Your message of Tue, 04 Aug 1998 16:55:39 PDT. |
| Date: Tue, 04 Aug 1998 22:28:45 PDT |
| From: Vern Paxson <vern> |
| |
| > Vern Paxson, |
| > |
| > I followed your advice, posted on Usenet bu you, and emailed to me |
| > personally by you, on how to overcome the 32K states limit. I'm running |
| > on Linux machines. |
| > I took the full source of version 2.5.4 and did the following changes in |
| > flexdef.h: |
| > #define JAMSTATE -327660 |
| > #define MAXIMUM_MNS 319990 |
| > #define BAD_SUBSCRIPT -327670 |
| > #define MAX_SHORT 327000 |
| > |
| > and compiled. |
| > All looked fine, including check and bigcheck, so I installed. |
| |
| Hmmm, you shouldn't increase MAX_SHORT, though looking through my email |
| archives I see that I did indeed recommend doing so. Try setting it back |
| to 32700; that should suffice that you no longer need -Ca. If it still |
| hangs, then the interesting question is - where? |
| |
| > Compiling the same hanged program with a out-of-the-box (RedHat 4.2 |
| > distribution of Linux) |
| > flex 2.5.4 binary works. |
| |
| Since Linux comes with source code, you should diff it against what |
| you have to see what problems they missed. |
| |
| > Should I always compile with the -Ca option now ? even short and simple |
| > filters ? |
| |
| No, definitely not. It's meant to be for those situations where you |
| absolutely must squeeze every last cycle out of your scanner. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-80 |
| @unnumberedsec unnamed-faq-80 |
| @example |
| @verbatim |
| To: "Schmackpfeffer, Craig" <Craig.Schmackpfeffer@usa.xerox.com> |
| Subject: Re: flex output for static code portion |
| In-reply-to: Your message of Tue, 11 Aug 1998 11:55:30 PDT. |
| Date: Mon, 17 Aug 1998 23:57:42 PDT |
| From: Vern Paxson <vern> |
| |
| > I would like to use flex under the hood to generate a binary file |
| > containing the data structures that control the parse. |
| |
| This has been on the wish-list for a long time. In principle it's |
| straight-forward - you redirect mkdata() et al's I/O to another file, |
| and modify the skeleton to have a start-up function that slurps these |
| into dynamic arrays. The concerns are (1) the scanner generation code |
| is hairy and full of corner cases, so it's easy to get surprised when |
| going down this path :-( ; and (2) being careful about buffering so |
| that when the tables change you make sure the scanner starts in the |
| correct state and reading at the right point in the input file. |
| |
| > I was wondering if you know of anyone who has used flex in this way. |
| |
| I don't - but it seems like a reasonable project to undertake (unlike |
| numerous other flex tweaks :-). |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-81 |
| @unnumberedsec unnamed-faq-81 |
| @example |
| @verbatim |
| Received: from 131.173.17.11 (131.173.17.11 [131.173.17.11]) |
| by ee.lbl.gov (8.9.1/8.9.1) with ESMTP id AAA03838 |
| for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 00:47:57 -0700 (PDT) |
| Received: from hal.cl-ki.uni-osnabrueck.de (hal.cl-ki.Uni-Osnabrueck.DE [131.173.141.2]) |
| by deimos.rz.uni-osnabrueck.de (8.8.7/8.8.8) with ESMTP id JAA34694 |
| for <vern@ee.lbl.gov>; Thu, 20 Aug 1998 09:47:55 +0200 |
| Received: (from georg@localhost) by hal.cl-ki.uni-osnabrueck.de (8.6.12/8.6.12) id JAA34834 for vern@ee.lbl.gov; Thu, 20 Aug 1998 09:47:54 +0200 |
| From: Georg Rehm <georg@hal.cl-ki.uni-osnabrueck.de> |
| Message-Id: <199808200747.JAA34834@hal.cl-ki.uni-osnabrueck.de> |
| Subject: "flex scanner push-back overflow" |
| To: vern@ee.lbl.gov |
| Date: Thu, 20 Aug 1998 09:47:54 +0200 (MEST) |
| Reply-To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE |
| X-NoJunk: Do NOT send commercial mail, spam or ads to this address! |
| X-URL: http://www.cl-ki.uni-osnabrueck.de/~georg/ |
| X-Mailer: ELM [version 2.4ME+ PL28 (25)] |
| MIME-Version: 1.0 |
| Content-Type: text/plain; charset=US-ASCII |
| Content-Transfer-Encoding: 7bit |
| |
| Hi Vern, |
| |
| Yesterday, I encountered a strange problem: I use the macro processor m4 |
| to include some lengthy lists into a .l file. Following is a flex macro |
| definition that causes some serious pain in my neck: |
| |
| AUTHOR ("A. Boucard / L. Boucard"|"A. Dastarac / M. Levent"|"A.Boucaud / L.Boucaud"|"Abderrahim Lamchichi"|"Achmat Dangor"|"Adeline Toullier"|"Adewale Maja-Pearce"|"Ahmed Ziri"|"Akram Ellyas"|"Alain Bihr"|"Alain Gresh"|"Alain Guillemoles"|"Alain Joxe"|"Alain Morice"|"Alain Renon"|"Alain Zecchini"|"Albert Memmi"|"Alberto Manguel"|"Alex De Waal"|"Alfonso Artico"| [...]) |
| |
| The complete list contains about 10kB. When I try to "flex" this file |
| (on a Solaris 2.6 machine, using a modified flex 2.5.4 (I only increased |
| some of the predefined values in flexdefs.h) I get the error: |
| |
| myflex/flex -8 sentag.tmp.l |
| flex scanner push-back overflow |
| |
| When I remove the slashes in the macro definition everything works fine. |
| As I understand it, the double quotes escape the slash-character so it |
| really means "/" and not "trailing context". Furthermore, I tried to |
| escape the slashes with backslashes, but with no use, the same error message |
| appeared when flexing the code. |
| |
| Do you have an idea what's going on here? |
| |
| Greetings from Germany, |
| Georg |
| -- |
| Georg Rehm georg@cl-ki.uni-osnabrueck.de |
| Institute for Semantic Information Processing, University of Osnabrueck, FRG |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-82 |
| @unnumberedsec unnamed-faq-82 |
| @example |
| @verbatim |
| To: Georg.Rehm@CL-KI.Uni-Osnabrueck.DE |
| Subject: Re: "flex scanner push-back overflow" |
| In-reply-to: Your message of Thu, 20 Aug 1998 09:47:54 PDT. |
| Date: Thu, 20 Aug 1998 07:05:35 PDT |
| From: Vern Paxson <vern> |
| |
| > myflex/flex -8 sentag.tmp.l |
| > flex scanner push-back overflow |
| |
| Flex itself uses a flex scanner. That scanner is running out of buffer |
| space when it tries to unput() the humongous macro you've defined. When |
| you remove the '/'s, you make it small enough so that it fits in the buffer; |
| removing spaces would do the same thing. |
| |
| The fix is to either rethink how come you're using such a big macro and |
| perhaps there's another/better way to do it; or to rebuild flex's own |
| scan.c with a larger value for |
| |
| #define YY_BUF_SIZE 16384 |
| |
| - Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-83 |
| @unnumberedsec unnamed-faq-83 |
| @example |
| @verbatim |
| To: Jan Kort <jan@research.techforce.nl> |
| Subject: Re: Flex |
| In-reply-to: Your message of Fri, 04 Sep 1998 12:18:43 +0200. |
| Date: Sat, 05 Sep 1998 00:59:49 PDT |
| From: Vern Paxson <vern> |
| |
| > %% |
| > |
| > "TEST1\n" { fprintf(stderr, "TEST1\n"); yyless(5); } |
| > ^\n { fprintf(stderr, "empty line\n"); } |
| > . { } |
| > \n { fprintf(stderr, "new line\n"); } |
| > |
| > %% |
| > -- input --------------------------------------- |
| > TEST1 |
| > -- output -------------------------------------- |
| > TEST1 |
| > empty line |
| > ------------------------------------------------ |
| |
| IMHO, it's not clear whether or not this is in fact a bug. It depends |
| on whether you view yyless() as backing up in the input stream, or as |
| pushing new characters onto the beginning of the input stream. Flex |
| interprets it as the latter (for implementation convenience, I'll admit), |
| and so considers the newline as in fact matching at the beginning of a |
| line, as after all the last token scanned an entire line and so the |
| scanner is now at the beginning of a new line. |
| |
| I agree that this is counter-intuitive for yyless(), given its |
| functional description (it's less so for unput(), depending on whether |
| you're unput()'ing new text or scanned text). But I don't plan to |
| change it any time soon, as it's a pain to do so. Consequently, |
| you do indeed need to use yy_set_bol() and YY_AT_BOL() to tweak |
| your scanner into the behavior you desire. |
| |
| Sorry for the less-than-completely-satisfactory answer. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-84 |
| @unnumberedsec unnamed-faq-84 |
| @example |
| @verbatim |
| To: Patrick Krusenotto <krusenot@mac-info-link.de> |
| Subject: Re: Problems with restarting flex-2.5.2-generated scanner |
| In-reply-to: Your message of Thu, 24 Sep 1998 10:14:07 PDT. |
| Date: Thu, 24 Sep 1998 23:28:43 PDT |
| From: Vern Paxson <vern> |
| |
| > I am using flex-2.5.2 and bison 1.25 for Solaris and I am desperately |
| > trying to make my scanner restart with a new file after my parser stops |
| > with a parse error. When my compiler restarts, the parser always |
| > receives the token after the token (in the old file!) that caused the |
| > parser error. |
| |
| I suspect the problem is that your parser has read ahead in order |
| to attempt to resolve an ambiguity, and when it's restarted it picks |
| up with that token rather than reading a fresh one. If you're using |
| yacc, then the special "error" production can sometimes be used to |
| consume tokens in an attempt to get the parser into a consistent state. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-85 |
| @unnumberedsec unnamed-faq-85 |
| @example |
| @verbatim |
| To: Henric Jungheim <junghelh@pe-nelson.com> |
| Subject: Re: flex 2.5.4a |
| In-reply-to: Your message of Tue, 27 Oct 1998 16:41:42 PST. |
| Date: Tue, 27 Oct 1998 16:50:14 PST |
| From: Vern Paxson <vern> |
| |
| > This brings up a feature request: How about a command line |
| > option to specify the filename when reading from stdin? That way one |
| > doesn't need to create a temporary file in order to get the "#line" |
| > directives to make sense. |
| |
| Use -o combined with -t (per the man page description of -o). |
| |
| > P.S., Is there any simple way to use non-blocking IO to parse multiple |
| > streams? |
| |
| Simple, no. |
| |
| One approach might be to return a magic character on EWOULDBLOCK and |
| have a rule |
| |
| .*<magic-character> // put back .*, eat magic character |
| |
| This is off the top of my head, not sure it'll work. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-86 |
| @unnumberedsec unnamed-faq-86 |
| @example |
| @verbatim |
| To: "Repko, Billy D" <billy.d.repko@intel.com> |
| Subject: Re: Compiling scanners |
| In-reply-to: Your message of Wed, 13 Jan 1999 10:52:47 PST. |
| Date: Thu, 14 Jan 1999 00:25:30 PST |
| From: Vern Paxson <vern> |
| |
| > It appears that maybe it cannot find the lfl library. |
| |
| The Makefile in the distribution builds it, so you should have it. |
| It's exceedingly trivial, just a main() that calls yylex() and |
| a yyrap() that always returns 1. |
| |
| > %% |
| > \n ++num_lines; ++num_chars; |
| > . ++num_chars; |
| |
| You can't indent your rules like this - that's where the errors are coming |
| from. Flex copies indented text to the output file, it's how you do things |
| like |
| |
| int num_lines_seen = 0; |
| |
| to declare local variables. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-87 |
| @unnumberedsec unnamed-faq-87 |
| @example |
| @verbatim |
| To: Erick Branderhorst <Erick.Branderhorst@asml.nl> |
| Subject: Re: flex input buffer |
| In-reply-to: Your message of Tue, 09 Feb 1999 13:53:46 PST. |
| Date: Tue, 09 Feb 1999 21:03:37 PST |
| From: Vern Paxson <vern> |
| |
| > In the flex.skl file the size of the default input buffers is set. Can you |
| > explain why this size is set and why it is such a high number. |
| |
| It's large to optimize performance when scanning large files. You can |
| safely make it a lot lower if needed. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-88 |
| @unnumberedsec unnamed-faq-88 |
| @example |
| @verbatim |
| To: "Guido Minnen" <guidomi@cogs.susx.ac.uk> |
| Subject: Re: Flex error message |
| In-reply-to: Your message of Wed, 24 Feb 1999 15:31:46 PST. |
| Date: Thu, 25 Feb 1999 00:11:31 PST |
| From: Vern Paxson <vern> |
| |
| > I'm extending a larger scanner written in Flex and I keep running into |
| > problems. More specifically, I get the error message: |
| > "flex: input rules are too complicated (>= 32000 NFA states)" |
| |
| Increase the definitions in flexdef.h for: |
| |
| #define JAMSTATE -32766 /* marks a reference to the state that always j |
| ams */ |
| #define MAXIMUM_MNS 31999 |
| #define BAD_SUBSCRIPT -32767 |
| |
| recompile everything, and it should all work. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-90 |
| @unnumberedsec unnamed-faq-90 |
| @example |
| @verbatim |
| To: "Dmitriy Goldobin" <gold@ems.chel.su> |
| Subject: Re: FLEX trouble |
| In-reply-to: Your message of Mon, 31 May 1999 18:44:49 PDT. |
| Date: Tue, 01 Jun 1999 00:15:07 PDT |
| From: Vern Paxson <vern> |
| |
| > I have a trouble with FLEX. Why rule "/*".*"*/" work properly,=20 |
| > but rule "/*"(.|\n)*"*/" don't work ? |
| |
| The second of these will have to scan the entire input stream (because |
| "(.|\n)*" matches an arbitrary amount of any text) in order to see if |
| it ends with "*/", terminating the comment. That potentially will overflow |
| the input buffer. |
| |
| > More complex rule "/*"([^*]|(\*/[^/]))*"*/ give an error |
| > 'unrecognized rule'. |
| |
| You can't use the '/' operator inside parentheses. It's not clear |
| what "(a/b)*" actually means. |
| |
| > I now use workaround with state <comment>, but single-rule is |
| > better, i think. |
| |
| Single-rule is nice but will always have the problem of either setting |
| restrictions on comments (like not allowing multi-line comments) and/or |
| running the risk of consuming the entire input stream, as noted above. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-91 |
| @unnumberedsec unnamed-faq-91 |
| @example |
| @verbatim |
| Received: from mc-qout4.whowhere.com (mc-qout4.whowhere.com [209.185.123.18]) |
| by ee.lbl.gov (8.9.3/8.9.3) with SMTP id IAA05100 |
| for <vern@ee.lbl.gov>; Tue, 15 Jun 1999 08:56:06 -0700 (PDT) |
| Received: from Unknown/Local ([?.?.?.?]) by my-deja.com; Tue Jun 15 08:55:43 1999 |
| To: vern@ee.lbl.gov |
| Date: Tue, 15 Jun 1999 08:55:43 -0700 |
| From: "Aki Niimura" <neko@my-deja.com> |
| Message-ID: <KNONDOHDOBGAEAAA@my-deja.com> |
| Mime-Version: 1.0 |
| Cc: |
| X-Sent-Mail: on |
| Reply-To: |
| X-Mailer: MailCity Service |
| Subject: A question on flex C++ scanner |
| X-Sender-Ip: 12.72.207.61 |
| Organization: My Deja Email (http://www.my-deja.com:80) |
| Content-Type: text/plain; charset=us-ascii |
| Content-Transfer-Encoding: 7bit |
| |
| Dear Dr. Paxon, |
| |
| I have been using flex for years. |
| It works very well on many projects. |
| Most case, I used it to generate a scanner on C language. |
| However, one project I needed to generate a scanner |
| on C++ lanuage. Thanks to your enhancement, flex did |
| the job. |
| |
| Currently, I'm working on enhancing my previous project. |
| I need to deal with multiple input streams (recursive |
| inclusion) in this scanner (C++). |
| I did similar thing for another scanner (C) as you |
| explained in your documentation. |
| |
| The generated scanner (C++) has necessary methods: |
| - switch_to_buffer(struct yy_buffer_state *b) |
| - yy_create_buffer(istream *is, int sz) |
| - yy_delete_buffer(struct yy_buffer_state *b) |
| |
| However, I couldn't figure out how to access current |
| buffer (yy_current_buffer). |
| |
| yy_current_buffer is a protected member of yyFlexLexer. |
| I can't access it directly. |
| Then, I thought yy_create_buffer() with is = 0 might |
| return current stream buffer. But it seems not as far |
| as I checked the source. (flex 2.5.4) |
| |
| I went through the Web in addition to Flex documentation. |
| However, it hasn't been successful, so far. |
| |
| It is not my intention to bother you, but, can you |
| comment about how to obtain the current stream buffer? |
| |
| Your response would be highly appreciated. |
| |
| Best regards, |
| Aki Niimura |
| |
| --== Sent via Deja.com http://www.deja.com/ ==-- |
| Share what you know. Learn what you don't. |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-92 |
| @unnumberedsec unnamed-faq-92 |
| @example |
| @verbatim |
| To: neko@my-deja.com |
| Subject: Re: A question on flex C++ scanner |
| In-reply-to: Your message of Tue, 15 Jun 1999 08:55:43 PDT. |
| Date: Tue, 15 Jun 1999 09:04:24 PDT |
| From: Vern Paxson <vern> |
| |
| > However, I couldn't figure out how to access current |
| > buffer (yy_current_buffer). |
| |
| Derive your own subclass from yyFlexLexer. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-93 |
| @unnumberedsec unnamed-faq-93 |
| @example |
| @verbatim |
| To: "Stones, Darren" <Darren.Stones@nectech.co.uk> |
| Subject: Re: You're the man to see? |
| In-reply-to: Your message of Wed, 23 Jun 1999 11:10:29 PDT. |
| Date: Wed, 23 Jun 1999 09:01:40 PDT |
| From: Vern Paxson <vern> |
| |
| > I hope you can help me. I am using Flex and Bison to produce an interpreted |
| > language. However all goes well until I try to implement an IF statement or |
| > a WHILE. I cannot get this to work as the parser parses all the conditions |
| > eg. the TRUE and FALSE conditons to check for a rule match. So I cannot |
| > make a decision!! |
| |
| You need to use the parser to build a parse tree (= abstract syntax trwee), |
| and when that's all done you recursively evaluate the tree, binding variables |
| to values at that time. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-94 |
| @unnumberedsec unnamed-faq-94 |
| @example |
| @verbatim |
| To: Petr Danecek <petr@ics.cas.cz> |
| Subject: Re: flex - question |
| In-reply-to: Your message of Mon, 28 Jun 1999 19:21:41 PDT. |
| Date: Fri, 02 Jul 1999 16:52:13 PDT |
| From: Vern Paxson <vern> |
| |
| > file, it takes an enormous amount of time. It is funny, because the |
| > source code has only 12 rules!!! I think it looks like an exponencial |
| > growth. |
| |
| Right, that's the problem - some patterns (those with a lot of |
| ambiguity, where yours has because at any given time the scanner can |
| be in the middle of all sorts of combinations of the different |
| rules) blow up exponentially. |
| |
| For your rules, there is an easy fix. Change the ".*" that comes fater |
| the directory name to "[^ ]*". With that in place, the rules are no |
| longer nearly so ambiguous, because then once one of the directories |
| has been matched, no other can be matched (since they all require a |
| leading blank). |
| |
| If that's not an acceptable solution, then you can enter a start state |
| to pick up the .*\n after each directory is matched. |
| |
| Also note that for speed, you'll want to add a ".*" rule at the end, |
| otherwise rules that don't match any of the patterns will be matched |
| very slowly, a character at a time. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-95 |
| @unnumberedsec unnamed-faq-95 |
| @example |
| @verbatim |
| To: Tielman Koekemoer <tielman@spi.co.za> |
| Subject: Re: Please help. |
| In-reply-to: Your message of Thu, 08 Jul 1999 13:20:37 PDT. |
| Date: Thu, 08 Jul 1999 08:20:39 PDT |
| From: Vern Paxson <vern> |
| |
| > I was hoping you could help me with my problem. |
| > |
| > I tried compiling (gnu)flex on a Solaris 2.4 machine |
| > but when I ran make (after configure) I got an error. |
| > |
| > -------------------------------------------------------------- |
| > gcc -c -I. -I. -g -O parse.c |
| > ./flex -t -p ./scan.l >scan.c |
| > sh: ./flex: not found |
| > *** Error code 1 |
| > make: Fatal error: Command failed for target `scan.c' |
| > ------------------------------------------------------------- |
| > |
| > What's strange to me is that I'm only |
| > trying to install flex now. I then edited the Makefile to |
| > and changed where it says "FLEX = flex" to "FLEX = lex" |
| > ( lex: the native Solaris one ) but then it complains about |
| > the "-p" option. Is there any way I can compile flex without |
| > using flex or lex? |
| > |
| > Thanks so much for your time. |
| |
| You managed to step on the bootstrap sequence, which first copies |
| initscan.c to scan.c in order to build flex. Try fetching a fresh |
| distribution from ftp.ee.lbl.gov. (Or you can first try removing |
| ".bootstrap" and doing a make again.) |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-96 |
| @unnumberedsec unnamed-faq-96 |
| @example |
| @verbatim |
| To: Tielman Koekemoer <tielman@spi.co.za> |
| Subject: Re: Please help. |
| In-reply-to: Your message of Fri, 09 Jul 1999 09:16:14 PDT. |
| Date: Fri, 09 Jul 1999 00:27:20 PDT |
| From: Vern Paxson <vern> |
| |
| > First I removed .bootstrap (and ran make) - no luck. I downloaded the |
| > software but I still have the same problem. Is there anything else I |
| > could try. |
| |
| Try: |
| |
| cp initscan.c scan.c |
| touch scan.c |
| make scan.o |
| |
| If this last tries to first build scan.c from scan.l using ./flex, then |
| your "make" is broken, in which case compile scan.c to scan.o by hand. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-97 |
| @unnumberedsec unnamed-faq-97 |
| @example |
| @verbatim |
| To: Sumanth Kamenani <skamenan@crl.nmsu.edu> |
| Subject: Re: Error |
| In-reply-to: Your message of Mon, 19 Jul 1999 23:08:41 PDT. |
| Date: Tue, 20 Jul 1999 00:18:26 PDT |
| From: Vern Paxson <vern> |
| |
| > I am getting a compilation error. The error is given as "unknown symbol- yylex". |
| |
| The parser relies on calling yylex(), but you're instead using the C++ scanning |
| class, so you need to supply a yylex() "glue" function that calls an instance |
| scanner of the scanner (e.g., "scanner->yylex()"). |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-98 |
| @unnumberedsec unnamed-faq-98 |
| @example |
| @verbatim |
| To: daniel@synchrods.synchrods.COM (Daniel Senderowicz) |
| Subject: Re: lex |
| In-reply-to: Your message of Mon, 22 Nov 1999 11:19:04 PST. |
| Date: Tue, 23 Nov 1999 15:54:30 PST |
| From: Vern Paxson <vern> |
| |
| Well, your problem is the |
| |
| switch (yybgin-yysvec-1) { /* witchcraft */ |
| |
| at the beginning of lex rules. "witchcraft" == "non-portable". It's |
| assuming knowledge of the AT&T lex's internal variables. |
| |
| For flex, you can probably do the equivalent using a switch on YYSTATE. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-99 |
| @unnumberedsec unnamed-faq-99 |
| @example |
| @verbatim |
| To: archow@hss.hns.com |
| Subject: Re: Regarding distribution of flex and yacc based grammars |
| In-reply-to: Your message of Sun, 19 Dec 1999 17:50:24 +0530. |
| Date: Wed, 22 Dec 1999 01:56:24 PST |
| From: Vern Paxson <vern> |
| |
| > When we provide the customer with an object code distribution, is it |
| > necessary for us to provide source |
| > for the generated C files from flex and bison since they are generated by |
| > flex and bison ? |
| |
| For flex, no. I don't know what the current state of this is for bison. |
| |
| > Also, is there any requrirement for us to neccessarily provide source for |
| > the grammar files which are fed into flex and bison ? |
| |
| Again, for flex, no. |
| |
| See the file "COPYING" in the flex distribution for the legalese. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-100 |
| @unnumberedsec unnamed-faq-100 |
| @example |
| @verbatim |
| To: Martin Gallwey <gallweym@hyperion.moe.ul.ie> |
| Subject: Re: Flex, and self referencing rules |
| In-reply-to: Your message of Sun, 20 Feb 2000 01:01:21 PST. |
| Date: Sat, 19 Feb 2000 18:33:16 PST |
| From: Vern Paxson <vern> |
| |
| > However, I do not use unput anywhere. I do use self-referencing |
| > rules like this: |
| > |
| > UnaryExpr ({UnionExpr})|("-"{UnaryExpr}) |
| |
| You can't do this - flex is *not* a parser like yacc (which does indeed |
| allow recursion), it is a scanner that's confined to regular expressions. |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @c TODO: Evaluate this faq. |
| @node unnamed-faq-101 |
| @unnumberedsec unnamed-faq-101 |
| @example |
| @verbatim |
| To: slg3@lehigh.edu (SAMUEL L. GULDEN) |
| Subject: Re: Flex problem |
| In-reply-to: Your message of Thu, 02 Mar 2000 12:29:04 PST. |
| Date: Thu, 02 Mar 2000 23:00:46 PST |
| From: Vern Paxson <vern> |
| |
| If this is exactly your program: |
| |
| > digit [0-9] |
| > digits {digit}+ |
| > whitespace [ \t\n]+ |
| > |
| > %% |
| > "[" { printf("open_brac\n");} |
| > "]" { printf("close_brac\n");} |
| > "+" { printf("addop\n");} |
| > "*" { printf("multop\n");} |
| > {digits} { printf("NUMBER = %s\n", yytext);} |
| > whitespace ; |
| |
| then the problem is that the last rule needs to be "{whitespace}" ! |
| |
| Vern |
| @end verbatim |
| @end example |
| |
| @node What is the difference between YYLEX_PARAM and YY_DECL? |
| @unnumberedsec What is the difference between YYLEX_PARAM and YY_DECL? |
| |
| YYLEX_PARAM is not a flex symbol. It is for Bison. It tells Bison to pass extra |
| params when it calls yylex() from the parser. |
| |
| YY_DECL is the Flex declaration of yylex. The default is similar to this: |
| |
| @example |
| @verbatim |
| #define int yy_lex () |
| @end verbatim |
| @end example |
| |
| |
| @node Why do I get "conflicting types for yylex" error? |
| @unnumberedsec Why do I get "conflicting types for yylex" error? |
| |
| This is a compiler error regarding a generated Bison parser, not a Flex scanner. |
| It means you need a prototype of yylex() in the top of the Bison file. |
| Be sure the prototype matches YY_DECL. |
| |
| @node How do I access the values set in a Flex action from within a Bison action? |
| @unnumberedsec How do I access the values set in a Flex action from within a Bison action? |
| |
| With $1, $2, $3, etc. These are called "Semantic Values" in the Bison manual. |
| See @ref{Top, , , bison, the GNU Bison Manual}. |
| |
| @node Appendices, Indices, FAQ, Top |
| @appendix Appendices |
| |
| @menu |
| * Makefiles and Flex:: |
| * Bison Bridge:: |
| * M4 Dependency:: |
| * Common Patterns:: |
| @end menu |
| |
| @node Makefiles and Flex, Bison Bridge, Appendices, Appendices |
| @appendixsec Makefiles and Flex |
| |
| @cindex Makefile, syntax |
| |
| In this appendix, we provide tips for writing Makefiles to build your scanners. |
| |
| In a traditional build environment, we say that the @file{.c} files are the |
| sources, and the @file{.o} files are the intermediate files. When using |
| @code{flex}, however, the @file{.l} files are the sources, and the generated |
| @file{.c} files (along with the @file{.o} files) are the intermediate files. |
| This requires you to carefully plan your Makefile. |
| |
| Modern @command{make} programs understand that @file{foo.l} is intended to |
| generate @file{lex.yy.c} or @file{foo.c}, and will behave |
| accordingly@footnote{GNU @command{make} and GNU @command{automake} are two such |
| programs that provide implicit rules for flex-generated scanners.}@footnote{GNU @command{automake} |
| may generate code to execute flex in lex-compatible mode, or to stdout. If this is not what you want, |
| then you should provide an explicit rule in your Makefile.am}. The |
| following Makefile does not explicitly instruct @command{make} how to build |
| @file{foo.c} from @file{foo.l}. Instead, it relies on the implicit rules of the |
| @command{make} program to build the intermediate file, @file{scan.c}: |
| |
| @cindex Makefile, example of implicit rules |
| @example |
| @verbatim |
| # Basic Makefile -- relies on implicit rules |
| # Creates "myprogram" from "scan.l" and "myprogram.c" |
| # |
| LEX=flex |
| myprogram: scan.o myprogram.o |
| scan.o: scan.l |
| |
| @end verbatim |
| @end example |
| |
| |
| For simple cases, the above may be sufficient. For other cases, |
| you may have to explicitly instruct @command{make} how to build your scanner. |
| The following is an example of a Makefile containing explicit rules: |
| |
| @cindex Makefile, explicit example |
| @example |
| @verbatim |
| # Basic Makefile -- provides explicit rules |
| # Creates "myprogram" from "scan.l" and "myprogram.c" |
| # |
| LEX=flex |
| myprogram: scan.o myprogram.o |
| $(CC) -o $@ $(LDFLAGS) $^ |
| |
| myprogram.o: myprogram.c |
| $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ |
| |
| scan.o: scan.c |
| $(CC) $(CPPFLAGS) $(CFLAGS) -o $@ -c $^ |
| |
| scan.c: scan.l |
| $(LEX) $(LFLAGS) -o $@ $^ |
| |
| clean: |
| $(RM) *.o scan.c |
| |
| @end verbatim |
| @end example |
| |
| Notice in the above example that @file{scan.c} is in the @code{clean} target. |
| This is because we consider the file @file{scan.c} to be an intermediate file. |
| |
| Finally, we provide a realistic example of a @code{flex} scanner used with a |
| @code{bison} parser@footnote{This example also applies to yacc parsers.}. |
| There is a tricky problem we have to deal with. Since a @code{flex} scanner |
| will typically include a header file (e.g., @file{y.tab.h}) generated by the |
| parser, we need to be sure that the header file is generated BEFORE the scanner |
| is compiled. We handle this case in the following example: |
| |
| @example |
| @verbatim |
| # Makefile example -- scanner and parser. |
| # Creates "myprogram" from "scan.l", "parse.y", and "myprogram.c" |
| # |
| LEX = flex |
| YACC = bison -y |
| YFLAGS = -d |
| objects = scan.o parse.o myprogram.o |
| |
| myprogram: $(objects) |
| scan.o: scan.l parse.c |
| parse.o: parse.y |
| myprogram.o: myprogram.c |
| |
| @end verbatim |
| @end example |
| |
| In the above example, notice the line, |
| |
| @example |
| @verbatim |
| scan.o: scan.l parse.c |
| @end verbatim |
| @end example |
| |
| , which lists the file @file{parse.c} (the generated parser) as a dependency of |
| @file{scan.o}. We want to ensure that the parser is created before the scanner |
| is compiled, and the above line seems to do the trick. Feel free to experiment |
| with your specific implementation of @command{make}. |
| |
| |
| For more details on writing Makefiles, see @ref{Top, , , make, The |
| GNU Make Manual}. |
| |
| @node Bison Bridge, M4 Dependency, Makefiles and Flex, Appendices |
| @section C Scanners with Bison Parsers |
| |
| @cindex bison, bridging with flex |
| @vindex yylval |
| @vindex yylloc |
| @tindex YYLTYPE |
| @tindex YYSTYPE |
| |
| This section describes the @code{flex} features useful when integrating |
| @code{flex} with @code{GNU bison}@footnote{The features described here are |
| purely optional, and are by no means the only way to use flex with bison. |
| We merely provide some glue to ease development of your parser-scanner pair.}. |
| Skip this section if you are not using |
| @code{bison} with your scanner. Here we discuss only the @code{flex} |
| half of the @code{flex} and @code{bison} pair. We do not discuss |
| @code{bison} in any detail. For more information about generating |
| @code{bison} parsers, see @ref{Top, , , bison, the GNU Bison Manual}. |
| |
| A compatible @code{bison} scanner is generated by declaring @samp{%option |
| bison-bridge} or by supplying @samp{--bison-bridge} when invoking @code{flex} |
| from the command line. This instructs @code{flex} that the macro |
| @code{yylval} may be used. The data type for |
| @code{yylval}, @code{YYSTYPE}, |
| is typically defined in a header file, included in section 1 of the |
| @code{flex} input file. For a list of functions and macros |
| available, @xref{bison-functions}. |
| |
| The declaration of yylex becomes, |
| |
| @findex yylex (reentrant version) |
| @example |
| @verbatim |
| int yylex ( YYSTYPE * lvalp, yyscan_t scanner ); |
| @end verbatim |
| @end example |
| |
| If @code{%option bison-locations} is specified, then the declaration |
| becomes, |
| |
| @findex yylex (reentrant version) |
| @example |
| @verbatim |
| int yylex ( YYSTYPE * lvalp, YYLTYPE * llocp, yyscan_t scanner ); |
| @end verbatim |
| @end example |
| |
| Note that the macros @code{yylval} and @code{yylloc} evaluate to pointers. |
| Support for @code{yylloc} is optional in @code{bison}, so it is optional in |
| @code{flex} as well. The following is an example of a @code{flex} scanner that |
| is compatible with @code{bison}. |
| |
| @cindex bison, scanner to be called from bison |
| @example |
| @verbatim |
| /* Scanner for "C" assignment statements... sort of. */ |
| %{ |
| #include "y.tab.h" /* Generated by bison. */ |
| %} |
| |
| %option bison-bridge bison-locations |
| % |
| |
| [[:digit:]]+ { yylval->num = atoi(yytext); return NUMBER;} |
| [[:alnum:]]+ { yylval->str = strdup(yytext); return STRING;} |
| "="|";" { return yytext[0];} |
| . {} |
| % |
| @end verbatim |
| @end example |
| |
| As you can see, there really is no magic here. We just use |
| @code{yylval} as we would any other variable. The data type of |
| @code{yylval} is generated by @code{bison}, and included in the file |
| @file{y.tab.h}. Here is the corresponding @code{bison} parser: |
| |
| @cindex bison, parser |
| @example |
| @verbatim |
| /* Parser to convert "C" assignments to lisp. */ |
| %{ |
| /* Pass the argument to yyparse through to yylex. */ |
| #define YYPARSE_PARAM scanner |
| #define YYLEX_PARAM scanner |
| %} |
| %locations |
| %pure_parser |
| %union { |
| int num; |
| char* str; |
| } |
| %token <str> STRING |
| %token <num> NUMBER |
| %% |
| assignment: |
| STRING '=' NUMBER ';' { |
| printf( "(setf %s %d)", $1, $3 ); |
| } |
| ; |
| @end verbatim |
| @end example |
| |
| @node M4 Dependency, Common Patterns, Bison Bridge, Appendices |
| @section M4 Dependency |
| @cindex m4 |
| The macro processor @code{m4}@footnote{The use of m4 is subject to change in |
| future revisions of flex. It is not part of the public API of flex. Do not depend on it.} |
| must be installed wherever flex is installed. |
| @code{flex} invokes @samp{m4}, found by searching the directories in the |
| @code{PATH} environment variable. Any code you place in section 1 or in the |
| actions will be sent through m4. Please follow these rules to protect your |
| code from unwanted @code{m4} processing. |
| |
| @itemize |
| |
| @item Do not use symbols that begin with, @samp{m4_}, such as, @samp{m4_define}, |
| or @samp{m4_include}, since those are reserved for @code{m4} macro names. If for |
| some reason you need m4_ as a prefix, use a preprocessor #define to get your |
| symbol past m4 unmangled. |
| |
| @item Do not use the strings @samp{[[} or @samp{]]} anywhere in your code. The |
| former is not valid in C, except within comments and strings, but the latter is valid in |
| code such as @code{x[y[z]]}. The solution is simple. To get the literal string |
| @code{"]]"}, use @code{"]""]"}. To get the array notation @code{x[y[z]]}, |
| use @code{x[y[z] ]}. Flex will attempt to detect these sequences in user code, and |
| escape them. However, it's best to avoid this complexity where possible, by |
| removing such sequences from your code. |
| |
| @end itemize |
| |
| @code{m4} is only required at the time you run @code{flex}. The generated |
| scanner is ordinary C or C++, and does @emph{not} require @code{m4}. |
| |
| @node Common Patterns, ,M4 Dependency, Appendices |
| @section Common Patterns |
| @cindex patterns, common |
| |
| This appendix provides examples of common regular expressions you might use |
| in your scanner. |
| |
| @menu |
| * Numbers:: |
| * Identifiers:: |
| * Quoted Constructs:: |
| * Addresses:: |
| @end menu |
| |
| |
| @node Numbers, Identifiers, ,Common Patterns |
| @subsection Numbers |
| |
| @table @asis |
| |
| @item C99 decimal constant |
| @code{([[:digit:]]@{-@}[0])[[:digit:]]*} |
| |
| @item C99 hexadecimal constant |
| @code{0[xX][[:xdigit:]]+} |
| |
| @item C99 octal constant |
| @code{0[01234567]*} |
| |
| @item C99 floating point constant |
| @verbatim |
| {dseq} ([[:digit:]]+) |
| {dseq_opt} ([[:digit:]]*) |
| {frac} (({dseq_opt}"."{dseq})|{dseq}".") |
| {exp} ([eE][+-]?{dseq}) |
| {exp_opt} ({exp}?) |
| {fsuff} [flFL] |
| {fsuff_opt} ({fsuff}?) |
| {hpref} (0[xX]) |
| {hdseq} ([[:xdigit:]]+) |
| {hdseq_opt} ([[:xdigit:]]*) |
| {hfrac} (({hdseq_opt}"."{hdseq})|({hdseq}".")) |
| {bexp} ([pP][+-]?{dseq}) |
| {dfc} (({frac}{exp_opt}{fsuff_opt})|({dseq}{exp}{fsuff_opt})) |
| {hfc} (({hpref}{hfrac}{bexp}{fsuff_opt})|({hpref}{hdseq}{bexp}{fsuff_opt})) |
| |
| {c99_floating_point_constant} ({dfc}|{hfc}) |
| @end verbatim |
| |
| See C99 section 6.4.4.2 for the gory details. |
| |
| @end table |
| |
| @node Identifiers, Quoted Constructs, Numbers, Common Patterns |
| @subsection Identifiers |
| |
| @table @asis |
| |
| @item C99 Identifier |
| @verbatim |
| ucn ((\\u([[:xdigit:]]{4}))|(\\U([[:xdigit:]]{8}))) |
| nondigit [_[:alpha:]] |
| c99_id ([_[:alpha:]]|{ucn})([_[:alnum:]]|{ucn})* |
| @end verbatim |
| |
| Technically, the above pattern does not encompass all possible C99 identifiers, since C99 allows for |
| "implementation-defined" characters. In practice, C compilers follow the above pattern, with the |
| addition of the @samp{$} character. |
| |
| @item UTF-8 Encoded Unicode Code Point |
| @verbatim |
| [\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF]([\x80-\xBF]{2})|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF]([\x80-\xBF]{2})|[\xF1-\xF3]([\x80-\xBF]{3})|\xF4[\x80-\x8F]([\x80-\xBF]{2}) |
| @end verbatim |
| |
| @end table |
| |
| @node Quoted Constructs, Addresses, Identifiers, Common Patterns |
| @subsection Quoted Constructs |
| |
| @table @asis |
| @item C99 String Literal |
| @code{L?\"([^\"\\\n]|(\\['\"?\\abfnrtv])|(\\([0123456]@{1,3@}))|(\\x[[:xdigit:]]+)|(\\u([[:xdigit:]]@{4@}))|(\\U([[:xdigit:]]@{8@})))*\"} |
| |
| @item C99 Comment |
| @code{("/*"([^*]|"*"[^/])*"*/")|("/"(\\\n)*"/"[^\n]*)} |
| |
| Note that in C99, a @samp{//}-style comment may be split across lines, and, contrary to popular belief, |
| does not include the trailing @samp{\n} character. |
| |
| A better way to scan @samp{/* */} comments is by line, rather than matching |
| possibly huge comments all at once. This will allow you to scan comments of |
| unlimited length, as long as line breaks appear at sane intervals. This is also |
| more efficient when used with automatic line number processing. @xref{option-yylineno}. |
| |
| @verbatim |
| <INITIAL>{ |
| "/*" BEGIN(COMMENT); |
| } |
| <COMMENT>{ |
| "*/" BEGIN(0); |
| [^*\n]+ ; |
| "*"[^/] ; |
| \n ; |
| } |
| @end verbatim |
| |
| @end table |
| |
| @node Addresses, ,Quoted Constructs, Common Patterns |
| @subsection Addresses |
| |
| @table @asis |
| |
| @item IPv4 Address |
| @verbatim |
| dec-octet [0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5] |
| IPv4address {dec-octet}\.{dec-octet}\.{dec-octet}\.{dec-octet} |
| @end verbatim |
| |
| @item IPv6 Address |
| @verbatim |
| h16 [0-9A-Fa-f]{1,4} |
| ls32 {h16}:{h16}|{IPv4address} |
| IPv6address ({h16}:){6}{ls32}| |
| ::({h16}:){5}{ls32}| |
| ({h16})?::({h16}:){4}{ls32}| |
| (({h16}:){0,1}{h16})?::({h16}:){3}{ls32}| |
| (({h16}:){0,2}{h16})?::({h16}:){2}{ls32}| |
| (({h16}:){0,3}{h16})?::{h16}:{ls32}| |
| (({h16}:){0,4}{h16})?::{ls32}| |
| (({h16}:){0,5}{h16})?::{h16}| |
| (({h16}:){0,6}{h16})?:: |
| @end verbatim |
| |
| See @uref{http://www.ietf.org/rfc/rfc2373.txt, RFC 2373} for details. |
| Note that you have to fold the definition of @code{IPv6address} into one |
| line and that it also matches the ``unspecified address'' ``::''. |
| |
| @item URI |
| @code{(([^:/?#]+):)?("//"([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?} |
| |
| This pattern is nearly useless, since it allows just about any character |
| to appear in a URI, including spaces and control characters. See |
| @uref{http://www.ietf.org/rfc/rfc2396.txt, RFC 2396} for details. |
| |
| @end table |
| |
| |
| @node Indices, , Appendices, Top |
| @unnumbered Indices |
| |
| @menu |
| * Concept Index:: |
| * Index of Functions and Macros:: |
| * Index of Variables:: |
| * Index of Data Types:: |
| * Index of Hooks:: |
| * Index of Scanner Options:: |
| @end menu |
| |
| @node Concept Index, Index of Functions and Macros, Indices, Indices |
| @unnumberedsec Concept Index |
| |
| @printindex cp |
| |
| @node Index of Functions and Macros, Index of Variables, Concept Index, Indices |
| @unnumberedsec Index of Functions and Macros |
| |
| This is an index of functions and preprocessor macros that look like functions. |
| For macros that expand to variables or constants, see @ref{Index of Variables}. |
| |
| @printindex fn |
| |
| @node Index of Variables, Index of Data Types, Index of Functions and Macros, Indices |
| @unnumberedsec Index of Variables |
| |
| This is an index of variables, constants, and preprocessor macros |
| that expand to variables or constants. |
| |
| @printindex vr |
| |
| @node Index of Data Types, Index of Hooks, Index of Variables, Indices |
| @unnumberedsec Index of Data Types |
| @printindex tp |
| |
| @node Index of Hooks, Index of Scanner Options, Index of Data Types, Indices |
| @unnumberedsec Index of Hooks |
| |
| This is an index of "hooks" that the user may define. These hooks typically correspond |
| to specific locations in the generated scanner, and may be used to insert arbitrary code. |
| |
| @printindex hk |
| |
| @node Index of Scanner Options, , Index of Hooks, Indices |
| @unnumberedsec Index of Scanner Options |
| |
| @printindex op |
| |
| @c A vim script to name the faq entries. delete this when faqs are no longer |
| @c named "unnamed-faq-XXX". |
| @c |
| @c fu! Faq2 () range abort |
| @c let @r=input("Rename to: ") |
| @c exe "%s/" . @w . "/" . @r . "/g" |
| @c normal 'f |
| @c endf |
| @c nnoremap <F5> 1G/@node\s\+unnamed-faq-\d\+<cr>mfww"wy5ezt:call Faq2()<cr> |
| |
| @bye |