| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
| |
| <!--Converted with LaTeX2HTML 99.2beta8 (1.43) |
| original version by: Nikos Drakos, CBLU, University of Leeds |
| * revised and updated by: Marcus Hennecke, Ross Moore, Herb Swan |
| * with significant contributions from: |
| Jens Lippmann, Marek Rouchal, Martin Wilck and others --> |
| <HTML> |
| <HEAD> |
| <TITLE>JFlex User's Manual</TITLE> |
| <META NAME="description" CONTENT="JFlex User's Manual"> |
| <META NAME="keywords" CONTENT="manual"> |
| <META NAME="resource-type" CONTENT="document"> |
| <META NAME="distribution" CONTENT="global"> |
| |
| <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
| <META NAME="Generator" CONTENT="LaTeX2HTML v99.2beta8"> |
| <META HTTP-EQUIV="Content-Style-Type" CONTENT="text/css"> |
| |
| <LINK REL="STYLESHEET" HREF="manual.css"> |
| |
| </HEAD> |
| |
| <BODY > |
| |
| <P> |
| |
| <CENTER> |
| <A NAME="TOP"></a> |
| <A HREF="http://www.jflex.de"><IMG SRC="logo.gif" BORDER=0 HEIGHT=223 WIDTH=577></a> |
| </CENTER> |
| |
| <P> |
| <DIV ALIGN="CENTER"> |
| <I><FONT SIZE="+2">The Fast Lexical Analyser Generator</FONT> |
| <BR></I></DIV> |
| <P></P> |
| <DIV ALIGN="CENTER"></DIV> |
| <P></P> |
| <DIV ALIGN="CENTER"><I>Copyright ©1998-2004 by <A NAME="tex2html1" |
| HREF="http://www.doclsf.de">Gerwin Klein</A> |
| <BR></I></DIV> |
| <P><P><BR> |
| <DIV ALIGN="CENTER"><I><FONT SIZE="+4"><I><B>JFlex User's Manual</B></I></FONT> |
| <BR></I></DIV> |
| <P><P><BR> |
| <DIV ALIGN="CENTER"><I>Version 1.4, April 12, 2004 |
| |
| </I></DIV> |
| |
| <P> |
| <BR> |
| |
| <H2><A NAME="SECTION00010000000000000000"> |
| Contents</A> |
| </H2> |
| <!--Table of Contents--> |
| |
| <UL> |
| <LI><A NAME="tex2html80" |
| HREF="manual.html">Contents</A> |
| <LI><A NAME="tex2html81" |
| HREF="manual.html#SECTION00020000000000000000">Introduction</A> |
| <UL> |
| <LI><A NAME="tex2html82" |
| HREF="manual.html#SECTION00021000000000000000">Design goals</A> |
| <LI><A NAME="tex2html83" |
| HREF="manual.html#SECTION00022000000000000000">About this manual</A> |
| </UL> |
| <LI><A NAME="tex2html84" |
| HREF="manual.html#SECTION00030000000000000000">Installing and Running JFlex</A> |
| <UL> |
| <LI><A NAME="tex2html85" |
| HREF="manual.html#SECTION00031000000000000000">Installing JFlex</A> |
| <LI><A NAME="tex2html86" |
| HREF="manual.html#SECTION00032000000000000000">Running JFlex</A> |
| </UL> |
| <LI><A NAME="tex2html87" |
| HREF="manual.html#SECTION00040000000000000000">A simple Example: How to work with JFlex</A> |
| <UL> |
| <LI><A NAME="tex2html88" |
| HREF="manual.html#SECTION00041000000000000000">Code to include</A> |
| <LI><A NAME="tex2html89" |
| HREF="manual.html#SECTION00042000000000000000">Options and Macros</A> |
| <LI><A NAME="tex2html90" |
| HREF="manual.html#SECTION00043000000000000000">Rules and Actions</A> |
| <LI><A NAME="tex2html91" |
| HREF="manual.html#SECTION00044000000000000000">How to get it going</A> |
| </UL> |
| <LI><A NAME="tex2html92" |
| HREF="manual.html#SECTION00050000000000000000">Lexical Specifications</A> |
| <UL> |
| <LI><A NAME="tex2html93" |
| HREF="manual.html#SECTION00051000000000000000">User code</A> |
| <LI><A NAME="tex2html94" |
| HREF="manual.html#SECTION00052000000000000000">Options and declarations</A> |
| <LI><A NAME="tex2html95" |
| HREF="manual.html#SECTION00053000000000000000">Lexical rules</A> |
| </UL> |
| <LI><A NAME="tex2html96" |
| HREF="manual.html#SECTION00060000000000000000">Encodings, Platforms, and Unicode</A> |
| <UL> |
| <LI><A NAME="tex2html97" |
| HREF="manual.html#SECTION00061000000000000000">The Problem</A> |
| <LI><A NAME="tex2html98" |
| HREF="manual.html#SECTION00062000000000000000">Scanning text files</A> |
| <LI><A NAME="tex2html99" |
| HREF="manual.html#SECTION00063000000000000000">Scanning binaries</A> |
| </UL> |
| <LI><A NAME="tex2html100" |
| HREF="manual.html#SECTION00070000000000000000">A few words on performance</A> |
| <UL> |
| <LI><A NAME="tex2html101" |
| HREF="manual.html#SECTION00071000000000000000">Comparison of JLex and JFlex</A> |
| <LI><A NAME="tex2html102" |
| HREF="manual.html#SECTION00072000000000000000">How to write a faster specification</A> |
| </UL> |
| <LI><A NAME="tex2html103" |
| HREF="manual.html#SECTION00080000000000000000">Porting Issues</A> |
| <UL> |
| <LI><A NAME="tex2html104" |
| HREF="manual.html#SECTION00081000000000000000">Porting from JLex</A> |
| <LI><A NAME="tex2html105" |
| HREF="manual.html#SECTION00082000000000000000">Porting from lex/flex</A> |
| </UL> |
| <LI><A NAME="tex2html106" |
| HREF="manual.html#SECTION00090000000000000000">Working together</A> |
| <UL> |
| <LI><A NAME="tex2html107" |
| HREF="manual.html#SECTION00091000000000000000">JFlex and CUP</A> |
| <LI><A NAME="tex2html108" |
| HREF="manual.html#SECTION00092000000000000000">JFlex and BYacc/J</A> |
| </UL> |
| <LI><A NAME="tex2html109" |
| HREF="manual.html#SECTION000100000000000000000">Bugs and Deficiencies</A> |
| <UL> |
| <LI><A NAME="tex2html110" |
| HREF="manual.html#SECTION000101000000000000000">Deficiencies</A> |
| <LI><A NAME="tex2html111" |
| HREF="manual.html#SECTION000102000000000000000">Bugs</A> |
| </UL> |
| <LI><A NAME="tex2html112" |
| HREF="manual.html#SECTION000110000000000000000">Copying and License</A> |
| <LI><A NAME="tex2html113" |
| HREF="manual.html#SECTION000120000000000000000">Bibliography</A> |
| </UL> |
| <!--End of Table of Contents--> |
| |
| <H1><A NAME="SECTION00020000000000000000"></A><A NAME="Intro"></A><BR> |
| Introduction |
| </H1> |
| JFlex is a lexical analyzer generator for Java<A NAME="tex2html2" |
| HREF="#foot32"><SUP><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="footnote.png"></SUP></A>written in Java. It is also a rewrite of the very useful tool JLex [<A |
| HREF="manual.html#JLex">3</A>] which |
| was developed by Elliot Berk at Princeton University. As Vern Paxon states |
| for his C/C++ tool flex [<A |
| HREF="manual.html#flex">11</A>]: They do not share any code though. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00021000000000000000"> |
| Design goals</A> |
| </H2> |
| The main design goals of JFlex are: |
| |
| <UL> |
| <LI><B>Full unicode support</B> |
| </LI> |
| <LI><B>Fast generated scanners </B> |
| </LI> |
| <LI><B>Fast scanner generation</B> |
| </LI> |
| <LI><B>Convenient specification syntax</B> |
| </LI> |
| <LI><B>Platform independence</B> |
| </LI> |
| <LI><B>JLex compatibility</B> |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H2><A NAME="SECTION00022000000000000000"> |
| About this manual</A> |
| </H2> |
| This manual gives a brief but complete description of the tool JFlex. It |
| assumes that you are familiar with the issue of lexical analysis. The references [<A |
| HREF="manual.html#Aho">1</A>], |
| [<A |
| HREF="manual.html#Appel">2</A>], and [<A |
| HREF="manual.html#Maurer">13</A>] provide a good introduction to this topic. |
| |
| <P> |
| The next section of this manual describes <A HREF="manual.html#Installing"><I>installation procedures</I></A> |
| for JFlex. If you never worked with JLex or |
| just want to compare a JLex and a JFlex scanner specification you |
| should also read <A HREF="manual.html#Example"><I>Working with JFlex - an example</I></A> |
| (section <A HREF="manual.html#Example"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>). All options and the complete |
| specification syntax are presented in |
| <A HREF="manual.html#Specifications"><I>Lexical specifications</I></A> (section <A HREF="manual.html#Specifications"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>); |
| <A HREF="manual.html#sec:encodings"><I>Encodings, Platforms, and Unicode</I></A> (section <A HREF="manual.html#sec:encodings"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>) |
| provides information about scannig text vs. binary files. |
| If you are interested in performance |
| considerations and comparing JLex with JFlex speed, |
| <A HREF="manual.html#performance"><I>a few words on performance</I></A> (section <A HREF="manual.html#performance"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>) |
| might be just right for you. Those who want to |
| use their old JLex specifications may want to check out section <A HREF="manual.html#Porting"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> |
| <A HREF="manual.html#Porting"><I>Porting from JLex</I></A> to avoid possible problems |
| with not portable or non standard JLex behavior that has been fixed in |
| JFlex. Section <A HREF="manual.html#lexport"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> talks about porting scanners from the |
| Unix tools lex and flex. Interfacing JFlex scanners with the LALR |
| parser generators CUP and BYacc/J is explained in <A HREF="manual.html#WorkingTog"><I>working |
| together</I></A> (section <A HREF="manual.html#WorkingTog"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>). Section <A HREF="manual.html#Bugs"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> |
| <A HREF="manual.html#Bugs"><I>Bugs</I></A> gives a list of currently known active bugs. |
| The manual concludes with notes about |
| <A HREF="manual.html#Copyright"><I>Copying and License</I></A> (section <A HREF="manual.html#Copyright"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>) and |
| <A HREF="manual.html#References">references</A>. |
| |
| <P> |
| |
| <H1><A NAME="SECTION00030000000000000000"></A><A NAME="Installing"></A><BR> |
| Installing and Running JFlex |
| </H1> |
| |
| <P> |
| |
| <H2><A NAME="SECTION00031000000000000000"> |
| Installing JFlex</A> |
| </H2> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00031100000000000000"></A><A NAME="install:windows"></A><BR> |
| Windows |
| </H3> |
| To install JFlex on Windows 95/98/NT/XP, follow these three steps: |
| |
| <OL> |
| <LI>Unzip the file you downloaded into the directory you want JFlex in (using |
| something like |
| <A NAME="tex2html3" |
| HREF="http://www.winzip.com">WinZip</A>). |
| If you unzipped it to say <code>C:\</code>, the following directory structure |
| should be generated: |
| |
| <PRE> |
| C:\JFlex\ |
| +--bin\ (start scripts) |
| +--doc\ (FAQ and manual) |
| +--examples\ |
| +--binary\ (scanning binary files) |
| +--byaccj\ (calculator example for BYacc/J) |
| +--cup\ (calculator example for cup) |
| +--interpreter\ (interpreter example for cup) |
| +--java\ (Java lexer specification) |
| +--simple\ (example scanner) |
| +--standalone\ (a simple standalone scanner) |
| +--lib\ (the precompiled classes) |
| +--src\ |
| +--JFlex\ (source code of JFlex) |
| +--JFlex\gui (source code of JFlex UI classes) |
| +--java_cup\runtime\ (source code of cup runtime classes) |
| </PRE> |
| |
| <P> |
| </LI> |
| <LI>Edit the file <B><code>bin\jflex.bat</code></B> |
| (in the example it's <code>C:\JFlex\bin\jflex.bat</code>) |
| such that |
| |
| <P> |
| |
| <UL> |
| <LI><B><TT>JAVA_HOME</TT></B> contains the directory where your Java JDK is installed |
| (for instance <code>C:\java</code>) and |
| </LI> |
| <LI><B><TT>JFLEX_HOME</TT></B> the directory that contains JFlex (in the example: |
| <code>C:\JFlex</code>) |
| </LI> |
| </UL> |
| |
| <P> |
| </LI> |
| <LI>Include the <code>bin\</code> directory of JFlex in your path. |
| (the one that contains the start script, in the example: <code>C:\JFlex\bin</code>). |
| </LI> |
| </OL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00031200000000000000"> |
| Unix with tar archive</A> |
| </H3> |
| |
| <P> |
| To install JFlex on a Unix system, follow these two steps: |
| |
| <UL> |
| <LI>Uncompress the archive into a directory of your choice |
| with GNU tar, for instance to <TT>/usr/share</TT>: |
| |
| <P> |
| <TT>tar -C /usr/share -xvzf jflex-1.4.tar.gz</TT> |
| |
| <P> |
| (The example is for site wide installation. You need to |
| be root for that. User installation works exactly the |
| same way--just choose a directory where you have write |
| permission) |
| |
| <P> |
| </LI> |
| <LI>Make a symbolic link from somewhere in your binary |
| path to <TT>bin/jflex</TT>, for instance: |
| |
| <P> |
| <TT>ln -s /usr/share/JFlex/bin/jflex /usr/bin/jflex</TT> |
| |
| <P> |
| If the java interpreter is not in your binary path, you |
| need to supply its location in the script <TT>bin/jflex</TT>. |
| </LI> |
| </UL> |
| |
| <P> |
| You can verify the integrity of the downloaded file with |
| the MD5 checksum available on the <A NAME="tex2html4" |
| HREF="http://www.jflex.de/download.html">JFlex download page</A>. |
| If you put the checksum file in the same directory |
| as the archive, you run: |
| |
| <P> |
| <code>md5sum --check </code><TT>jflex-1.4.tar.gz.md5</TT> |
| |
| <P> |
| It should tell you |
| |
| <P> |
| <TT>jflex-1.4.tar.gz: OK</TT> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00031300000000000000"> |
| Linux with RPM</A> |
| </H3> |
| |
| <P> |
| |
| <UL> |
| <LI>become root |
| </LI> |
| <LI>issue |
| <BR> <TT>rpm -U jflex-1.4-0.rpm</TT> |
| </LI> |
| </UL> |
| |
| <P> |
| You can verify the integrity of the downloaded <TT>rpm</TT> file with |
| |
| <P> |
| <code>rpm --checksig </code><TT>jflex-1.4-0.rpm</TT> |
| |
| <P> |
| This requires my pgp public key. If you don't have it, you can use |
| |
| <P> |
| <code>rpm --checksig --nopgp </code><TT>jflex-1.4-0.rpm</TT> |
| |
| <P> |
| or you can get it from <A NAME="tex2html5" |
| HREF="http://www.jflex.de/public-key.asc"><TT>http://www.jflex.de/public-key.asc</TT></A>. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00032000000000000000"> |
| Running JFlex</A> |
| </H2> |
| You run JFlex with: |
| |
| <P> |
| <TT>jflex <options> <inputfiles></TT> |
| |
| <P> |
| It is also possible to skip the start script in <code>bin\</code> |
| and include the file <code>lib\JFlex.jar</code> |
| in your <TT>CLASSPATH</TT> environment variable instead. |
| |
| <P> |
| Then you run JFlex with: |
| |
| <P> |
| <TT>java JFlex.Main <options> <inputfiles></TT> |
| |
| <P> |
| The input files and options are in both cases optional. If you don't provide a file name on |
| the command line, JFlex will pop up a window to ask you for one. |
| |
| <P> |
| JFlex knows about the following options: |
| |
| <P> |
| <DL> |
| <DT></DT> |
| <DD><code>-d <directory></code> |
| <BR> writes the generated file to the directory <code><directory></code> |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--skel <file></code> |
| <BR> uses external skeleton <code><file></code>. This is mainly for JFlex |
| maintenance and special low level customizations. Use only when you |
| know what you are doing! JFlex comes with a skeleton file in the |
| <TT>src</TT> directory that reflects exactly the internal, precompiled |
| skeleton and can be used with the <TT>-skel</TT> option. |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--nomin</code> |
| <BR> skip the DFA minimization step during scanner generation. |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--jlex</code> |
| <BR> tries even harder to comply to JLex interpretation of specs. |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--dot</code> |
| <BR> generate graphviz dot files for the NFA, DFA and minimized |
| DFA. This feature is still in alpha status, and not |
| fully implemented yet. |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--dump</code> |
| <BR> display transition tables of NFA, initial DFA, and minimized DFA |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--verbose</code> or <TT>-v</TT> |
| <BR> display generation progress messages (enabled by default) |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--quiet</code> or <TT>-q</TT> |
| <BR> display error messages only (no chatter about what JFlex is |
| currently doing) |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--time</code> |
| <BR> display time statistics about the code generation process |
| (not very accurate) |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--version</code> |
| <BR> print version number |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--info</code> |
| <BR> print system and JDK information (useful if you'd like |
| to report a problem) |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--pack</code> |
| <BR> use the %pack code generation method by default |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--table</code> |
| <BR> use the %table code generation method by default |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--switch</code> |
| <BR> use the %switch code generation method by default |
| |
| <P> |
| </DD> |
| <DT></DT> |
| <DD><code>--help</code> or <TT>-h</TT> |
| <BR> print a help message explaining options and usage of JFlex. |
| </DD> |
| </DL> |
| |
| <P> |
| |
| <H1><A NAME="SECTION00040000000000000000"></A><A NAME="Example"></A><BR> |
| A simple Example: How to work with JFlex |
| </H1> |
| To demonstrate what a lexical specification with JFlex looks like, this |
| section presents a part of the specification for the Java language. |
| The example does not describe the whole lexical structure of Java programs, |
| but only a small and simplified part of it (some keywords, some operators, |
| comments and only two kinds of literals). It also shows how to interface |
| with the LALR parser generator CUP [<A |
| HREF="manual.html#CUP">8</A>] and therefore |
| uses a class <TT>sym</TT> (generated by CUP), where integer constants for |
| the terminal tokens of the CUP grammar are declared. JFlex comes with a |
| directory <TT>examples</TT>, where you can find a small standalone scanner |
| that doesn't need other tools like CUP to give you a running example. |
| The "<TT>examples</TT>" directory also contains a <EM>complete</EM> JFlex |
| specification of the lexical structure of Java programs together with the |
| CUP parser specification for Java by |
| <A NAME="tex2html6" |
| HREF="mailto:cananian@alumni.princeton.edu">C. Scott Ananian</A>, obtained |
| from the CUP [<A |
| HREF="manual.html#CUP">8</A>] website (it was modified to interface with the JFlex scanner). |
| Both specifications adhere to the Java Language Specification [<A |
| HREF="manual.html#LangSpec">7</A>]. |
| |
| <P> |
| <FONT SIZE="-1"><A NAME="CodeTop"></A></FONT><PRE> |
| /* JFlex example: part of Java language lexer specification */ |
| import java_cup.runtime.*; |
| |
| /** |
| * This class is a simple example lexer. |
| */ |
| %% |
| </PRE><FONT SIZE="-1"> |
| <A NAME="CodeOptions"></A></FONT><PRE> |
| %class Lexer |
| %unicode |
| %cup |
| %line |
| %column |
| </PRE><FONT SIZE="-1"> |
| <A NAME="CodeScannerCode"></A></FONT><PRE> |
| %{ |
| StringBuffer string = new StringBuffer(); |
| |
| private Symbol symbol(int type) { |
| return new Symbol(type, yyline, yycolumn); |
| } |
| private Symbol symbol(int type, Object value) { |
| return new Symbol(type, yyline, yycolumn, value); |
| } |
| %} |
| </PRE><FONT SIZE="-1"> |
| <A NAME="CodeMacros"></A></FONT><PRE> |
| LineTerminator = \r|\n|\r\n |
| InputCharacter = [^\r\n] |
| WhiteSpace = {LineTerminator} | [ \t\f] |
| |
| /* comments */ |
| Comment = {TraditionalComment} | {EndOfLineComment} | {DocumentationComment} |
| |
| TraditionalComment = "/*" [^*] ~"*/" | "/*" "*"+ "/" |
| EndOfLineComment = "//" {InputCharacter}* {LineTerminator} |
| DocumentationComment = "/**" {CommentContent} "*"+ "/" |
| CommentContent = ( [^*] | \*+ [^/*] )* |
| |
| Identifier = [:jletter:] [:jletterdigit:]* |
| |
| DecIntegerLiteral = 0 | [1-9][0-9]* |
| </PRE><FONT SIZE="-1"> |
| <A NAME="CodeStateDecl"></A></FONT><PRE> |
| %state STRING |
| |
| %% |
| </PRE><FONT SIZE="-1"> |
| <A NAME="CodeRulesYYINITIAL"></A></FONT><PRE> |
| /* keywords */ |
| <YYINITIAL> "abstract" { return symbol(sym.ABSTRACT); } |
| <YYINITIAL> "boolean" { return symbol(sym.BOOLEAN); } |
| <YYINITIAL> "break" { return symbol(sym.BREAK); } |
| </PRE><FONT SIZE="-1"> |
| <A NAME="CodeRulesBunch"></A></FONT><PRE> |
| <YYINITIAL> { |
| /* identifiers */ |
| {Identifier} { return symbol(sym.IDENTIFIER); } |
| |
| /* literals */ |
| {DecIntegerLiteral} { return symbol(sym.INTEGER_LITERAL); } |
| \" { string.setLength(0); yybegin(STRING); } |
| |
| /* operators */ |
| "=" { return symbol(sym.EQ); } |
| "==" { return symbol(sym.EQEQ); } |
| "+" { return symbol(sym.PLUS); } |
| |
| /* comments */ |
| {Comment} { /* ignore */ } |
| |
| /* whitespace */ |
| {WhiteSpace} { /* ignore */ } |
| } |
| </PRE><FONT SIZE="-1"> |
| <A NAME="CodeRulesYYtext"></A></FONT><PRE> |
| <STRING> { |
| \" { yybegin(YYINITIAL); |
| return symbol(sym.STRING_LITERAL, |
| string.toString()); } |
| [^\n\r\"\\]+ { string.append( yytext() ); } |
| \\t { string.append('\t'); } |
| \\n { string.append('\n'); } |
| |
| \\r { string.append('\r'); } |
| \\\" { string.append('\"'); } |
| \\ { string.append('\\'); } |
| } |
| </PRE><FONT SIZE="-1"> |
| <A NAME="CodeRulesAllStates"></A></FONT><PRE> |
| /* error fallback */ |
| .|\n { throw new Error("Illegal character <"+ |
| yytext()+">"); } |
| </PRE> |
| |
| <P> |
| From this specification JFlex generates a <TT>.java</TT> file with one |
| class that contains code for the scanner. The class will have a |
| constructor taking a <TT>java.io.Reader</TT> from which the input is |
| read. The class will also have a function <TT>yylex()</TT> that runs the |
| scanner and that can be used to get the next token from the input (in this |
| example the function actually has the name <TT>next_token()</TT> because |
| the specification uses the <TT>%cup</TT> switch). |
| |
| <P> |
| As with JLex, the specification consists of three parts, divided by <TT>%%</TT>: |
| |
| <UL> |
| <LI><A HREF="manual.html#ExampleUserCode">usercode</A>, |
| </LI> |
| <LI><A HREF="manual.html#ExampleOptions">options and declarations</A> and |
| </LI> |
| <LI><A HREF="manual.html#ExampleLexRules">lexical rules</A>. |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H2><A NAME="SECTION00041000000000000000"></A><A NAME="ExampleUserCode"></A><BR> |
| Code to include |
| </H2> |
| Let's take a look at the first section, ``user code'': The text up to the |
| first line starting with <TT>%%</TT> is copied verbatim to the top |
| of the generated lexer class (before the actual class declaration). |
| Beside <TT>package</TT> and <TT>import</TT> statements there is usually not much |
| to do here. If the code ends with a javadoc class comment, the generated class |
| will get this comment, if not, JFlex will generate one automatically. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00042000000000000000"></A><A NAME="ExampleOptions"></A><BR> |
| Options and Macros |
| </H2> |
| The second section ``options and declarations'' is more interesting. It consists |
| of a set of options, code that is included inside the generated scanner |
| class, lexical states and macro declarations. Each JFlex option must begin |
| a line of the specification and starts with a <TT>%</TT>. In our example |
| the following options are used: |
| |
| <P> |
| |
| <UL> |
| <LI><TT><A HREF="manual.html#CodeOptions">%class Lexer</A></TT> tells JFlex to give the |
| generated class the name ``Lexer'' and to write the code to a file ``<TT>Lexer.java</TT>''. |
| |
| <P> |
| </LI> |
| <LI><TT><A HREF="manual.html#CodeOptions">%unicode</A></TT> defines the set of characters the scanner will |
| work on. For scanning text files, <TT>%unicode</TT> should always be used. See also |
| section <A HREF="manual.html#sec:encodings"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> for more information on character sets, encodings, and |
| scanning text vs. binary files. |
| |
| <P> |
| </LI> |
| <LI><TT><A HREF="manual.html#CodeOptions">%cup</A></TT> switches to CUP compatibility |
| mode to interface with a CUP generated parser. |
| |
| <P> |
| </LI> |
| <LI><TT><A HREF="manual.html#CodeOptions">%line</A></TT> switches line counting on (the |
| current line number can be accessed via the variable <TT>yyline</TT>) |
| |
| <P> |
| </LI> |
| <LI><TT><A HREF="manual.html#CodeOptions">%column</A></TT> switches column counting on |
| (current column is accessed via <TT>yycolumn</TT>) |
| |
| <P> |
| </LI> |
| </UL> |
| <A NAME="ExampleScannerCode"></A> |
| <P> |
| The code included in <TT><A HREF="manual.html#CodeScannerCode">%{...%}</A></TT> |
| is copied verbatim into the generated lexer class source. |
| Here you can declare member variables and functions that are used |
| inside scanner actions. In our example we declare a <TT>StringBuffer</TT> ``<TT>string</TT>'' |
| in which we will store parts of string literals and two helper functions |
| ``<TT>symbol</TT>'' that create <TT>java_cup.runtime.Symbol</TT> objects |
| with position information of the current token (see section <A HREF="manual.html#CUPWork"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> |
| <A HREF="manual.html#CUPWork"><I>JFlex and CUP</I></A> |
| for how to interface with the parser generator CUP). As JFlex options, both |
| <code>%{</code> and <code>\%}</code> must begin a line. |
| <A NAME="ExampleMacros"></A> |
| <P> |
| The specification continues with macro declarations. Macros are |
| abbreviations for regular expressions, used to make lexical specifications |
| easier to read and understand. A macro declaration |
| consists of a macro identifier followed by <TT>=</TT>, then followed by |
| the regular expression it represents. This regular expression may |
| itself contain macro usages. Although this allows a grammar like specification |
| style, macros are still just abbreviations and not non terminals - they |
| cannot be recursive or mutually recursive. Cycles in macro definitions |
| are detected and reported at generation time by JFlex. |
| |
| <P> |
| Here some of the example macros in more detail: |
| |
| <UL> |
| <LI><TT><A HREF="manual.html#CodeMacros">LineTerminator</A></TT> stands for the regular |
| expression that matches an ASCII CR, an ASCII LF or an CR followed by LF. |
| |
| <P> |
| </LI> |
| <LI><TT><A HREF="manual.html#CodeMacros">InputCharacter</A></TT> stands for all characters |
| that are not a CR or LF. |
| |
| <P> |
| </LI> |
| <LI><TT><A HREF="manual.html#CodeMacros">TraditionalComment</A></TT> is the expression |
| that matches the string <TT>"/*"</TT> followed by a character that is not |
| a <TT>*</TT> followed by anything that matches the macro |
| <TT><A HREF="manual.html#CodeMacros">CommentContent</A></TT> |
| followed by any number of <TT>*</TT> followed by <TT>/</TT>. |
| |
| <P> |
| </LI> |
| <LI><TT><A HREF="manual.html#CodeMacros">CommentContent</A></TT> matches zero or more |
| occurrences of any character except a <TT>*</TT> or any number of |
| <TT>*</TT> followed by a character that is not a <TT>/</TT> |
| |
| <P> |
| </LI> |
| <LI><TT><A HREF="manual.html#CodeMacros">Identifier</A></TT> matches each string that |
| starts with a character of class <TT>jletter</TT> followed by zero or more characters |
| of class <TT>jletterdigit</TT>. <TT>jletter</TT> and <TT>jletterdigit</TT> |
| are predefined character classes. <TT>jletter</TT> includes all characters for which |
| the Java function <TT>Character.isJavaIdentifierStart</TT> returns <TT>true</TT> and |
| <TT>jletterdigit</TT> all characters for that <TT>Character.isJavaIdentifierPart</TT> |
| returns <TT>true</TT>. |
| </LI> |
| </UL> |
| <A NAME="ExampleStateDecl"></A> |
| <P> |
| The last part of the second section in our |
| lexical specification is a lexical state declaration: |
| <TT><A HREF="manual.html#CodeStateDecl">%state STRING</A></TT> |
| declares a lexical state <TT>STRING</TT> that can be |
| used in the ``lexical rules'' part of the specification. A state declaration |
| is a line starting with <TT>%state</TT> followed by a space or comma |
| separated list of state identifiers. There can be more than one line starting |
| with <TT>%state</TT>. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00043000000000000000"></A><A NAME="ExampleLexRules"></A><BR> |
| Rules and Actions |
| </H2> |
| The "lexical rules" section of a JFlex specification contains regular expressions |
| and actions (Java code) that are executed when the scanner matches the |
| associated regular expression. As the scanner reads its input, it keeps |
| track of all regular expressions and activates the action of the expression |
| that has the longest match. Our specification above for instance would with input |
| "<TT>breaker</TT>" match the regular expression for <TT><A HREF="manual.html#CodeMacros">Identifier</A></TT> |
| and not the keyword "<TT><A HREF="manual.html#CodeRulesYYINITIAL">break</A></TT>" |
| followed by the Identifier "<TT>er</TT>", because rule <code>{Identifier}</code> |
| matches more of this input at once (i.e. it matches all of it) |
| than any other rule in the specification. If two regular expressions both |
| have the longest match for a certain input, the scanner chooses the action |
| of the expression that appears first in the specification. In that way, we |
| get for input "<TT>break</TT>" the keyword "<TT>break</TT>" and not an |
| Identifier "<TT>break</TT>". |
| |
| <P> |
| Additional to regular expression matches, one can use lexical states to |
| refine a specification. A lexical state acts like a start condition. |
| If the scanner is in lexical state <TT>STRING</TT>, only expressions that |
| are preceded by the start condition <TT><STRING></TT> can be matched. |
| A start condition of a regular expression can contain more than one lexical |
| state. It is then matched when the lexer is in any of these lexical states. |
| The lexical state <TT>YYINITIAL</TT> is predefined and is also the state |
| in which the lexer begins scanning. If a regular expression has no start |
| conditions it is matched in <EM>all</EM> lexical states. |
| <A NAME="ExampleRulesStateBunch"></A> |
| <P> |
| Since you often have a bunch of expressions with the same start conditions, |
| JFlex allows the same abbreviation as the Unix tool <TT>flex</TT>: |
| <PRE> |
| <STRING> { |
| expr1 { action1 } |
| expr2 { action2 } |
| } |
| </PRE> |
| means that both <TT>expr1</TT> and <TT>expr2</TT> have start condition <TT><STRING></TT>. |
| <A NAME="ExampleRulesYYINITIAL"></A> |
| <P> |
| The first three rules in our example demonstrate the syntax of a regular |
| expression preceded by the start condition <TT><YYINITIAL></TT>. |
| |
| <P> |
| <TT><A HREF="manual.html#CodeRulesYYINITIAL"><YYINITIAL> "abstract"</A><code> {</code> return symbol(sym.ABSTRACT); <code>}</code></TT> |
| |
| <P> |
| matches the input "<TT>abstract</TT>" only if the scanner is in its |
| start state "<TT>YYINITIAL</TT>". When the string "<TT>abstract</TT>" is |
| matched, the scanner function returns the CUP symbol <TT>sym.ABSTRACT</TT>. |
| If an action does not return a value, the scanning process is resumed immediately |
| after executing the action. |
| <A NAME="ExampleRulesBunch"></A> |
| <P> |
| The rules enclosed in |
| |
| <P> |
| <TT><A HREF="manual.html#CodeRulesBunch"><YYINITIAL> { |
| <BR> ... |
| <BR>}</A></TT> |
| |
| <P> |
| demonstrate the abbreviated syntax and are also only matched in state <TT>YYINITIAL</TT>. |
| <A NAME="ExampleRulesYYbegin"></A> |
| <P> |
| Of these rules, one may be of special interest: |
| |
| <P> |
| <code>\" { </code> <TT><A HREF="manual.html#CodeRulesBunch">string.setLength(0); yybegin(STRING);</A></TT><code> }</code> |
| |
| <P> |
| If the scanner matches a double quote in state <TT>YYINITIAL</TT> we |
| have recognized the start of a string literal. Therefore we clear our <TT>StringBuffer</TT> |
| that will hold the content of this string literal and tell the scanner |
| with <TT>yybegin(STRING)</TT> to switch into the lexical state <TT>STRING</TT>. |
| Because we do not yet return a value to the parser, our scanner proceeds |
| immediately. |
| <A NAME="ExampleRulesYYtext"></A> |
| <P> |
| In lexical state <TT>STRING</TT> another |
| rule demonstrates how to refer to the input that has been matched: |
| |
| <P> |
| <code>[^\n\r\"]+ { </code> <TT><A HREF="manual.html#CodeRulesYYtext">string.append( yytext() );</A></TT><code> }</code> |
| |
| <P> |
| The expression <code>[^\n\r\"]+</code> matches |
| all characters in the input up to the next backslash (indicating an |
| escape sequence such as <code>\n</code>), double quote (indicating the end |
| of the string), or line terminator (which must not occur in a string literal). |
| The matched region of the input is referred to with <TT><A HREF="manual.html#CodeRulesYYtext">yytext()</A></TT> |
| and appended to the content of the string literal parsed so far. |
| <A NAME="ExampleRuleLast"></A> |
| <P> |
| The last lexical rule in the example specification |
| is used as an error fallback. It matches any character in any state that |
| has not been matched by another rule. It doesn't conflict with any other |
| rule because it has the least priority (because it's the last rule) and |
| because it matches only one character (so it can't have longest match |
| precedence over any other rule). |
| |
| <P> |
| |
| <H2><A NAME="SECTION00044000000000000000"> |
| How to get it going</A> |
| </H2> |
| |
| <UL> |
| <LI>Install JFlex (see section <A HREF="manual.html#Installing"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> <A HREF="manual.html#Installing"><I>Installing JFlex</I></A>) |
| |
| <P> |
| </LI> |
| <LI>If you have written your specification file (or chosen one from the <TT>examples</TT> |
| directory), save it (say under the name <TT>java-lang.flex</TT>). |
| |
| <P> |
| </LI> |
| <LI>Run JFlex with |
| |
| <P> |
| <TT>jflex java-lang.flex</TT> |
| |
| <P> |
| </LI> |
| <LI>JFlex should then report some progress messages about generating the scanner |
| and write the generated code to the directory of your specification file. |
| |
| <P> |
| </LI> |
| <LI>Compile the generated <TT>.java</TT> file and your own classes. (If you |
| use CUP, generate your parser classes first) |
| |
| <P> |
| </LI> |
| <LI>That's it. |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H1><A NAME="SECTION00050000000000000000"></A><A NAME="Specifications"></A><BR> |
| Lexical Specifications |
| </H1> |
| As shown above, a lexical specification file for JFlex consists of three |
| parts divided by a single line starting with <TT>%%</TT>: |
| |
| <P> |
| <TT><A HREF="manual.html#SpecUsercode">UserCode</A></TT> |
| <BR><TT>%%</TT> |
| <BR><TT><A HREF="manual.html#SpecOptions">Options and declarations</A></TT> |
| <BR><TT>%%</TT> |
| <BR><TT><A HREF="manual.html#LexRules">Lexical rules</A></TT> |
| |
| <P> |
| In all parts of the specification comments of the form |
| <TT>/* comment text */</TT> and the Java style end of line comments starting with <TT>//</TT> |
| are permitted. JFlex comments do nest - so the number of <TT>/*</TT> and <TT>*/</TT> |
| should be balanced. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00051000000000000000"></A><A NAME="SpecUsercode"></A><BR> |
| User code |
| </H2> |
| The first part contains user code that is copied verbatim into the beginning |
| of the source file of the generated lexer before the scanner class is declared. |
| As shown in the example above, this is the place to put <TT>package</TT> |
| declarations and <TT>import</TT> |
| statements. It is possible, but not considered as good Java programming |
| style to put own helper class (such as token classes) in this section. |
| They should get their own <TT>.java</TT> file instead. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00052000000000000000"></A><A NAME="SpecOptions"></A><BR> |
| Options and declarations |
| </H2> |
| The second part of the lexical specification contains <A HREF="manual.html#SpecOptDirectives">options</A> |
| to customize your generated lexer (JFlex directives and Java code to include in |
| different parts of the lexer), declarations of <A HREF="manual.html#StateDecl">lexical states</A> and |
| <A HREF="manual.html#MacroDefs">macro definitions</A> for use in the third section |
| <A HREF="manual.html#LexRules">``Lexical rules''</A> of the lexical specification file. |
| <A NAME="SpecOptDirectives"></A> |
| <P> |
| Each JFlex directive must be situated at the beginning of a line |
| and starts with the <TT>%</TT> character. Directives that have one or |
| more parameters are described as follows: |
| |
| <P> |
| <TT>%class "classname"</TT> |
| |
| <P> |
| means that you start a line with <TT>%class</TT> followed by a space followed |
| by the name of the class for the generated scanner (the double quotes are |
| <I>not</I> to be entered, see the <A HREF="manual.html#CodeOptions">example specification</A> in |
| section <A HREF="manual.html#CodeOptions"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>). |
| |
| <P> |
| |
| <H3><A NAME="SECTION00052100000000000000"></A><A NAME="ClassOptions"></A><BR> |
| Class options and user class code |
| </H3> |
| These options regard name, constructor, API, and related parts of the |
| generated scanner class. |
| |
| <UL> |
| <LI><B><TT>%class "classname"</TT></B> |
| |
| <P> |
| Tells JFlex to give the generated class the name "<TT>classname</TT>" and to |
| write the generated code to a file "<TT>classname.java</TT>". If the |
| <TT>-d <directory></TT> command line option is not used, the code |
| will be written to the directory where the specification file resides. If |
| no <TT>%class</TT> directive is present in the specification, the generated |
| class will get the name "<TT>Yylex</TT>" and will be written to a file |
| "<TT>Yylex.java</TT>". There should be only one <TT>%class</TT> directive |
| in a specification. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%implements "interface 1"[, "interface 2", ..]</TT></B> |
| |
| <P> |
| Makes the generated class implement the specified interfaces. If more than |
| one <TT>%implements</TT> directive is present, all the specified interfaces |
| will be implemented. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%extends "classname"</TT></B> |
| |
| <P> |
| Makes the generated class a subclass of the class ``<TT>classname</TT>''. |
| There should be only one <TT>%extends</TT> directive in a specification. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%public</TT></B> |
| |
| <P> |
| Makes the generated class public (the class is only accessible in its |
| own package by default). |
| |
| <P> |
| </LI> |
| <LI><B><TT>%final</TT></B> |
| |
| <P> |
| Makes the generated class final. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%abstract</TT></B> |
| |
| <P> |
| Makes the generated class abstract. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%apiprivate</TT></B> |
| |
| <P> |
| Makes all generated methods and fields of the class |
| private. Exceptions are the constructor, user code in the |
| specification, and, if <code>%cup</code> is present, the method |
| <TT>next_token</TT>. All occurences of |
| <TT>" public "</TT> (one space character before and after <TT>public</TT>) |
| in the skeleton file are replaced by |
| <TT>" private "</TT> (even if a user-specified skeleton is used). |
| Access to the genarated class is expected to be mediated by user class |
| code (see next switch). |
| |
| <P> |
| </LI> |
| <LI><B><code>%{</code></B> |
| <BR><B><TT>...</TT></B> |
| <BR><B><code>%}</code></B> |
| |
| <P> |
| The code enclosed in <code>%{</code> and <code>%}</code> is copied verbatim |
| into the generated class. Here you can define your own member variables |
| and functions in the generated scanner. Like all options, both <code>%{</code> |
| and <code>%}</code> must start a line in the specification. If more than one |
| class code directive <code>%{...%}</code> is present, the code is concatenated |
| in order of appearance in the specification. |
| |
| <P> |
| </LI> |
| <LI><B><code>%init{</code></B> |
| <BR><B><TT>...</TT></B> |
| <BR><B><code>%init}</code></B> |
| |
| <P> |
| The code enclosed in <code>%init{</code> and <code>%init}</code> is copied |
| verbatim into the constructor of the generated class. Here, member |
| variables declared in the <code>%{...%}</code> directive can be initialized. |
| If more than one initializer option is present, the code is concatenated |
| in order of appearance in the specification. |
| |
| <P> |
| </LI> |
| <LI><B><code>%initthrow{</code></B> |
| <BR><B><TT>"exception1"[, "exception2", ...]</TT></B> |
| <BR><B><code>%initthrow}</code></B> |
| |
| <P> |
| or (on a single line) just |
| |
| <P> |
| <B><TT>%initthrow "exception1" [, "exception2", ...]</TT></B> |
| |
| <P> |
| Causes the specified exceptions to be declared in the <TT>throws</TT> |
| clause of the constructor. If more than one <code>%initthrow{</code> <TT>...</TT> <code>%initthrow}</code> |
| directive is present in the specification, all specified exceptions will |
| be declared. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%scanerror "exception"</TT></B> |
| |
| <P> |
| Causes the generated scanner to throw an instance of the specified |
| exception in case of an internal error (default is |
| <TT>java.lang.Error</TT>). Note that this exception is only for |
| internal scanner errors. With usual specifications it should never |
| occur (i.e. if there is an error fallback rule in the specification |
| and only the documented scanner API is used). |
| |
| <P> |
| </LI> |
| <LI><B><TT>%buffer "size"</TT></B> |
| |
| <P> |
| Set the initial size of the scan buffer to the specified value |
| (decimal, in bytes). The default value is 16384. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%include "filename"</TT></B> |
| |
| <P> |
| Replaces the <TT>%include</TT> verbatim by the specified file. This |
| feature is still experimental. It works, but error reporting can be |
| strange if a syntax error occurs on the last token in the included |
| file. |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00052200000000000000"></A><A NAME="ScanningMethod"></A><BR> |
| Scanning method |
| </H3> |
| This section shows how the scanning method can be customized. You can redefine |
| the name and return type of the method and it is possible to declare |
| exceptions that may be thrown in one of the actions of the specification. |
| If no return type is specified, the scanning method will be declared as |
| returning values of class <TT>Yytoken</TT>. |
| |
| <UL> |
| <LI><B><TT>%function "name"</TT></B> |
| |
| <P> |
| Causes the scanning method to get the specified name. If no <TT>%function</TT> |
| directive is present in the specification, the scanning method gets the |
| name ``<TT>yylex</TT>''. This directive overrides settings of the |
| <TT><A HREF="manual.html#CupMode">%cup</A></TT> switch. Please note that the default name |
| of the scanning method with the <TT><A HREF="manual.html#CupMode">%cup</A></TT> switch is |
| <TT>next_token</TT>. Overriding this name might lead to the generated scanner |
| being implicitly declared as <TT>abstract</TT>, because it does not provide |
| the method <TT>next_token</TT> of the interface <TT>java_cup.runtime.Scanner</TT>. |
| It is of course possible to provide a dummy implemention of that method |
| in the class code section, if you still want to override the function name. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%integer</TT></B> |
| <BR><B><TT>%int</TT></B> |
| |
| <P> |
| Both cause the scanning method to be declared as of Java type <TT>int</TT>. |
| Actions in the specification can then return <TT>int</TT> values as tokens. |
| The default end of file value under this setting is <TT>YYEOF</TT>, which is a <TT>public |
| static final int</TT> member of the generated class. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%intwrap</TT></B> |
| |
| <P> |
| Causes the scanning method to be declared as of the Java wrapper type |
| <TT>Integer</TT>. Actions in the specification can then return <TT>Integer</TT> |
| values as tokens. The default end of file value under this setting is <TT>null</TT>. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%type "typename"</TT></B> |
| |
| <P> |
| Causes the scanning method to be declared as returning values of the specified type. |
| Actions in the specification can then return values of <TT>typename</TT> |
| as tokens. The default end of file value under this setting is <TT>null</TT>. |
| If <TT>typename</TT> is not a subclass of <TT>java.lang.Object</TT>, |
| you should specify another end of file value using the |
| <A HREF="manual.html#eofval"><TT>%eofval{</TT> <TT>...</TT> <TT>%eofval}</TT></A> |
| directive or the <A HREF="manual.html#EOFRule"><TT><<EOF>></TT> rule</A>. |
| The <TT>%type</TT> directive overrides settings of the |
| <TT><A HREF="manual.html#CupMode">%cup</A></TT> switch. |
| |
| <P> |
| </LI> |
| <LI><B><code>%yylexthrow{</code></B> |
| <BR><B><TT>"exception1"[, "exception2", ... ]</TT></B> |
| <BR><B><code>%yylexthrow}</code></B> |
| |
| <P> |
| or (on a single line) just |
| |
| <P> |
| <B><TT>%yylexthrow "exception1" [, "exception2", ...]</TT></B> |
| |
| <P> |
| The exceptions listed inside <code>%yylexthrow{</code> <TT>...</TT> <code>%yylexthrow}</code> |
| will be declared in the throws clause of the scanning method. If there is |
| more than one <code>%yylexthrow{</code> <TT>...</TT> <code>%yylexthrow}</code> clause in |
| the specification, all specified exceptions will be declared. |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00052300000000000000"></A><A NAME="EOF"></A><BR> |
| The end of file |
| </H3> |
| There is always a default value that the scanning method will return when |
| the end of file has been reached. You may however define a specific value |
| to return and a specific piece of code that should be executed when the |
| end of file is reached. |
| |
| <P> |
| The default end of file values depends on the return type of the scanning method: |
| |
| <UL> |
| <LI>For <B><TT>%integer</TT></B>, the scanning method will return the value |
| <B><TT>YYEOF</TT></B>, which is a <TT>public static final int</TT> member |
| of the generated class. |
| |
| <P> |
| </LI> |
| <LI>For <B><TT>%intwrap</TT></B>, |
| </LI> |
| <LI>no specified type at all, or a |
| </LI> |
| <LI>user defined type, declared using <B><TT>%type</TT></B>, the value is <B><TT>null</TT></B>. |
| |
| <P> |
| </LI> |
| <LI>In CUP compatibility mode, using <B><TT>%cup</TT></B>, the value is |
| |
| <P> |
| <B><TT>new java_cup.runtime.Symbol(sym.EOF)</TT></B> |
| </LI> |
| </UL> |
| |
| <P> |
| User values and code to be executed at the end of file can be defined using these directives: |
| |
| <A NAME="eofval"></A><UL> |
| <LI><B><code>%eofval{</code></B> |
| <BR><B><TT>...</TT></B> |
| <BR><B><code>%eofval}</code></B> |
| |
| <P> |
| The code included in <code>%eofval{</code> <TT>...</TT> <code>%eofval}</code> will |
| be copied verbatim into the scanning method and will be executed <EM>each time</EM> |
| when the end of file is reached (this is possible when |
| the scanning method is called again after the end of file has been |
| reached). The code should return the value that indicates the end of |
| file to the parser. There should be only one <code>%eofval{</code> |
| <TT>...</TT> <code>%eofval}</code> clause in the specification. |
| The <code>%eofval{ ... %eofval}</code> directive overrides settings of the |
| <TT><A HREF="manual.html#CupMode">%cup</A></TT> switch and <TT><A HREF="manual.html#YaccMode">%byaccj</A></TT> switch. |
| As of version 1.2 JFlex provides |
| a more readable way to specify the end of file value using the |
| <A HREF="manual.html#EOFRule"><TT><<EOF>></TT> rule</A> (see also section <A HREF="manual.html#EOFRule"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>). |
| |
| <P> |
| </LI> |
| <LI><A NAME="eof"></A> <B><code>%eof{</code></B> |
| <BR> <B><TT>...</TT></B> |
| <BR> <B><code>%eof}</code></B> |
| |
| <P> |
| The code included in <code>%{eof ... %eof}</code> will be executed |
| exactly once, when the end of file is reached. The code is included |
| inside a method <TT>void yy_do_eof()</TT> and should not return any |
| value (use <code>%eofval{...%eofval}</code> or |
| <A HREF="manual.html#EOFRule"><TT><<EOF>></TT></A> for this purpose). If more than one |
| end of file code directive is present, the code will be concatenated |
| in order of appearance in the specification. |
| |
| <P> |
| </LI> |
| <LI><B><code>%eofthrow{</code></B> |
| <BR> <B><TT>"exception1"[,"exception2", ... ]</TT></B> |
| <BR> <B><code>%eofthrow}</code></B> |
| |
| <P> |
| or (on a single line) just |
| |
| <P> |
| <B><TT>%eofthrow "exception1" [, "exception2", ...]</TT></B> |
| |
| <P> |
| The exceptions listed inside <code>%eofthrow{...%eofthrow}</code> will |
| be declared in the throws clause of the method <TT>yy_do_eof()</TT> |
| (see <A HREF="manual.html#eof"><TT>%eof</TT></A> for more on that method). |
| If there is more than one <code>%eofthrow{...%eofthrow}</code> clause |
| in the specification, all specified exceptions will be declared. |
| |
| <P> |
| <A NAME="eofclose"></A></LI> |
| <LI><B><TT>%eofclose</TT></B> |
| |
| <P> |
| Causes JFlex to close the input stream at the end of file. The code |
| <TT>yyclose()</TT> is appended to the method <TT>yy_do_eof()</TT> |
| (together with the code specified in <code>%eof{...%eof}</code>) and |
| the exception <TT>java.io.IOException</TT> is declared in the throws |
| clause of this method (together with those of |
| <code>%eofthrow{...%eofthrow}</code>) |
| |
| <P> |
| </LI> |
| <LI><B><TT>%eofclose false</TT></B> |
| |
| <P> |
| Turns the effect of <TT>%eofclose</TT> off again (e.g. in case closing of |
| input stream is not wanted after <TT>%cup</TT>). |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00052400000000000000"></A><A NAME="Standalone"></A><BR> |
| Standalone scanners |
| </H3> |
| |
| <UL> |
| <LI><B><TT>%debug</TT></B> |
| |
| <P> |
| Creates a main function in the generated class that expects the name |
| of an input file on the command line and then runs the scanner on this |
| input file by printing information about each returned token to the Java |
| console until the end of file is reached. The information includes: |
| line number (if line counting is enabled), column (if column counting is enabled), |
| the matched text, and the executed action (with line number in the specification). |
| |
| <P> |
| </LI> |
| <LI><B><TT>%standalone</TT></B> |
| |
| <P> |
| Creates a main function in the generated class that expects the name |
| of an input file on the command line and then runs the scanner on this |
| input file. The values returned by the scanner are ignored, but any unmatched |
| text is printed to the Java console instead (as the C/C++ tool flex does, if |
| run as standalone program). To avoid having to use an extra token class, the |
| scanning method will be declared as having default type <TT>int</TT>, not <TT>YYtoken</TT> |
| (if there isn't any other type explicitly specified). |
| This is in most cases irrelevant, but could be useful to know when making |
| another scanner standalone for some purpose. You should also consider using |
| the <TT>%debug</TT> directive, if you just want to be able to run the scanner |
| without a parser attached for testing etc. |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00052500000000000000"></A><A NAME="CupMode"></A><BR> |
| CUP compatibility |
| </H3> |
| You may also want to read section <A HREF="manual.html#CUPWork"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> <A HREF="manual.html#CUPWork"><I>JFlex and CUP</I></A> |
| if you are interested in how to interface your generated |
| scanner with CUP. |
| |
| <UL> |
| <LI><B><TT>%cup</TT></B> |
| |
| <P> |
| The <TT>%cup</TT> directive enables the CUP compatibility mode and is equivalent |
| to the following set of directives: |
| |
| <P> |
| <PRE> |
| %implements java_cup.runtime.Scanner |
| %function next_token |
| %type java_cup.runtime.Symbol |
| %eofval{ |
| return new java_cup.runtime.Symbol(<CUPSYM>.EOF); |
| %eofval} |
| %eofclose |
| </PRE> |
| |
| <P> |
| The value of <TT><CUPSYM></TT> defaults to <TT>sym</TT> and can be |
| changed with the <TT>%cupsym</TT> directive. In JLex compatibility |
| mode (<TT>-jlex</TT> switch on the command line), <TT>%eofclose</TT> |
| will not be turned on. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%cupsym "classname"</TT></B> |
| |
| <P> |
| Customizes the name of the CUP generated class/interface |
| containing the names of terminal tokens. Default is <TT>sym</TT>. |
| The directive should not be used after <TT>%cup</TT>, but before. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%cupdebug</TT></B> |
| |
| <P> |
| Creates a main function in the generated class that expects the name |
| of an input file on the command line and then runs the scanner on this |
| input file. Prints line, column, matched text, and CUP symbol name for |
| each returned token to standard out. |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00052600000000000000"></A><A NAME="YaccMode"></A><BR> |
| BYacc/J compatibility |
| </H3> |
| You may also want to read section <A HREF="manual.html#YaccWork"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> <A HREF="manual.html#YaccWork"><I>JFlex and BYacc/J</I></A> |
| if you are interested in how to interface your generated |
| scanner with Byacc/J. |
| |
| <UL> |
| <LI><B><TT>%byacc</TT></B> |
| |
| <P> |
| The <TT>%byacc</TT> directive enables the BYacc/J compatibility mode and is equivalent |
| to the following set of directives: |
| |
| <P> |
| <PRE> |
| %integer |
| %eofval{ |
| return 0; |
| %eofval} |
| %eofclose |
| </PRE> |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00052700000000000000"></A><A NAME="CodeGeneration"></A><BR> |
| Code generation |
| </H3> |
| The following options define what kind of lexical analyzer code JFlex |
| will produce. <TT>%pack</TT> is the default setting and will be used, |
| when no code generation method is specified. |
| |
| <P> |
| |
| <UL> |
| <LI><B><TT>%switch</TT></B> |
| |
| <P> |
| With <TT>%switch</TT> JFlex will generate a scanner that has |
| the DFA hard coded into a nested switch statement. This method gives |
| a good deal of compression in terms of the size of the compiled |
| <TT>.class</TT> file while still providing very good performance. If your |
| scanner gets to big though (say more than about 200 states) |
| performance may vastly degenerate and you should consider using one |
| of the <TT>%table</TT> or <TT>%pack</TT> directives. If your scanner |
| gets even bigger (about 300 states), the Java compiler <TT>javac</TT> |
| could produce corrupted code, that will crash when executed or will |
| give you an <TT>java.lang.VerifyError</TT> when checked by the virtual |
| machine. This is due to the size limitation of 64 KB of Java |
| methods as described in the Java Virtual Machine Specification |
| [<A |
| HREF="manual.html#MachineSpec">10</A>]. In this case you will be forced to use the |
| <TT>%pack</TT> directive, since <TT>%switch</TT> |
| usually provides more compression of the DFA table than the |
| <TT>%table</TT> directive. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%table</TT></B> |
| |
| <P> |
| The <TT>%table</TT> direction causes JFlex to produce a classical |
| table driven scanner that encodes its DFA table in an array. In |
| this mode, JFlex only does a small amount of table compression (see |
| [<A |
| HREF="manual.html#ParseTable">6</A>], [<A |
| HREF="manual.html#SparseTable">12</A>], [<A |
| HREF="manual.html#Aho">1</A>] and [<A |
| HREF="manual.html#Maurer">13</A>] |
| for more details on the matter of table compression) and uses the |
| same method that JLex did up to version 1.2.1. See section <A HREF="manual.html#performance"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> |
| <A HREF="manual.html#performance">performance</A> of this manual to compare |
| these methods. The same reason as above (64 KB size limitation of |
| methods) causes the same problem, when the scanner gets too big. |
| This is, because the virtual machine treats static initializers of |
| arrays as normal methods. You will in this case again be forced to |
| use the <TT>%pack</TT> directive to avoid the problem. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%pack</TT></B> |
| |
| <P> |
| <TT>%pack</TT> causes JFlex to compress the generated DFA table and to |
| store it in one or more string literals. JFlex takes care that the |
| strings are not longer than permitted by the class file format. |
| The strings have to be unpacked when |
| the first scanner object is created and initialized. |
| After unpacking the internal access to the DFA table is exactly the |
| same as with option <TT>%table</TT> -- the only extra work to be done |
| at runtime is the unpacking process which is quite fast (not noticeable |
| in normal cases). It is in time complexity proportional to the |
| size of the expanded DFA table, and it is static, |
| i.e. it is done only once for a certain scanner class -- no matter |
| how often it is instantiated. Again, see section |
| <A HREF="manual.html#performance"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> <A HREF="manual.html#performance">performance</A> |
| on the performance of these scanners |
| With <TT>%pack</TT>, there should be practically no |
| limitation to the size of the scanner. <TT>%pack</TT> is the default |
| setting and will be used when no code generation method is specified. |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00052800000000000000"></A><A NAME="CharacterSets"></A><BR> |
| Character sets |
| </H3> |
| |
| <UL> |
| <LI><B><TT>%7bit</TT></B> |
| |
| <P> |
| Causes the generated scanner to use an 7 bit input character set (character |
| codes 0-127). Because this is the default value in JLex, JFlex also defaults |
| to 7 bit scanners. If an input character with a code greater than 127 is |
| encountered in an input at runtime, the scanner will throw an <TT>ArrayIndexOutofBoundsException</TT>. |
| Not only because of this, you should consider using the <TT>%unicode</TT> directive. |
| See also section <A HREF="manual.html#sec:encodings"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> for information about character encodings. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%full</TT></B> |
| <BR><B><TT>%8bit</TT></B> |
| |
| <P> |
| Both options cause the generated scanner to use an 8 bit input character |
| set (character codes 0-255). If an input character with a code greater |
| than 255 is encountered in an input at runtime, the scanner will throw |
| an <TT>ArrayIndexOutofBoundsException</TT>. Note that even if your platform |
| uses only one byte per character, the Unicode value of a character may |
| still be greater than 255. If you are scanning text files, you should |
| consider using the <TT>%unicode</TT> directive. See also section <A HREF="manual.html#sec:encodings"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> |
| for more information about character encodings. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%unicode</TT></B> |
| <BR><B><TT>%16bit</TT></B> |
| |
| <P> |
| Both options cause the generated scanner to use the full 16 bit Unicode input |
| character set (character codes 0-65535). There will be no runtime overflow when |
| using this set of input characters. <TT>%unicode</TT> does not mean that the |
| scanner will read two bytes at a time. What is read and what constitutes a |
| character depends on the runtime platform. See also section <A HREF="manual.html#sec:encodings"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> |
| for more information about character encodings. |
| |
| <P> |
| <A NAME="caseless"></A></LI> |
| <LI><B><TT>%caseless</TT></B> |
| <BR><B><TT>%ignorecase</TT></B> |
| |
| <P> |
| This option causes JFlex to handle all characters and strings in the |
| specification as if they were specified in both uppercase and lowercase form. |
| This enables an easy way to specify a scanner for a language with |
| case insensitive keywords. The string "<TT>break</TT>" in a specification is for instance |
| handled like the expression <TT>([bB][rR][eE][aA][kK])</TT>. The <TT>%caseless</TT> |
| option does not change the matched text and does not effect character classes. So |
| <TT>[a]</TT> still only matches the character <TT>a</TT> and not <TT>A</TT>, too. |
| Which letters are uppercase and which lowercase letters, is defined by the Unicode standard |
| and determined by JFlex with the Java methods <TT>Character.toUpperCase</TT> and |
| <TT>Character.toLowerCase</TT>. In JLex compatibility |
| mode (<TT>-jlex</TT> switch on the command line), <TT>%caseless</TT> |
| and <TT>%ignorecase</TT> also affect character classes. |
| |
| <P> |
| </LI> |
| </UL> |
| <H3><A NAME="SECTION00052900000000000000"></A><A NAME="Counting"></A><BR> |
| Line, character and column counting |
| </H3> |
| |
| <UL> |
| <LI><B><TT>%char</TT></B> |
| |
| <P> |
| Turns character counting on. The <TT>int</TT> member variable <TT>yychar</TT> |
| contains the number of characters (starting with 0) from the beginning |
| of input to the beginning of the current token. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%line</TT></B> |
| |
| <P> |
| Turns line counting on. The <TT>int</TT> member variable <TT>yyline</TT> |
| contains the number of lines (starting with 0) from the beginning of input |
| to the beginning of the current token. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%column</TT></B> |
| |
| <P> |
| Turns column counting on. The <TT>int</TT> member variable <TT>yycolumn</TT> |
| contains the number of characters (starting with 0) from the beginning |
| of the current line to the beginning of the current token. |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION000521000000000000000"></A><A NAME="Obsolete"></A><BR> |
| Obsolete JLex options |
| </H3> |
| |
| <UL> |
| <LI><B><TT>%notunix</TT></B> |
| |
| <P> |
| This JLex option is obsolete in JFlex but still recognized as valid directive. |
| It used to switch between Windows and Unix kind of line terminators (<code>\r\n</code> |
| and <code>\n</code>) for the <TT>$</TT> operator in regular expressions. JFlex |
| always recognizes both styles of platform dependent line terminators. |
| |
| <P> |
| </LI> |
| <LI><B><TT>%yyeof</TT></B> |
| |
| <P> |
| This JLex option is obsolete in JFlex but still recognized as valid directive. |
| In JLex it declares a public member constant <TT>YYEOF</TT>. JFlex declares it in any case. |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION000521100000000000000"></A><A NAME="StateDecl"></A><BR> |
| State declarations |
| </H3> |
| State declarations have the following from: |
| |
| <P> |
| <TT>%s[tate] "state identifier" [, "state identifier", ... ]</TT> for inclusive or |
| <BR><TT>%x[state] "state identifier" [, "state identifier", ... ]</TT> for exlusive states |
| |
| <P> |
| There may be more than one line of state declarations, each starting with |
| <TT>%state</TT> or <TT>%xstate</TT> (the first character is sufficient, |
| <TT>%s</TT> and <TT>%x</TT> works, too). State identifiers are letters followed |
| by a sequence of letters, digits or underscores. State identifiers can be separated |
| by whitespace or comma. |
| |
| <P> |
| The sequence |
| |
| <P> |
| <TT>%state STATE1</TT> |
| <BR><TT>%xstate STATE3, XYZ, STATE_10</TT> |
| <BR><TT>%state ABC STATE5</TT> |
| |
| <P> |
| declares the set of identifiers <TT>STATE1, STATE3, XYZ, |
| STATE_10, ABC, STATE5</TT> as lexical states, <TT>STATE1</TT>, <TT>ABC</TT>, <TT>STATE5</TT> |
| as inclusive, and <TT>STATE3</TT>, <TT>XYZ</TT>, <TT>STATE_10</TT> as exclusive. |
| See also section |
| <A HREF="manual.html#HowMatched"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> on the way lexical states influence how the input is |
| matched. |
| |
| <P> |
| |
| <H3><A NAME="SECTION000521200000000000000"></A><A NAME="MacroDefs"></A><BR> |
| Macro definitions |
| </H3> |
| A macro definition has the form |
| |
| <P> |
| <TT>macroidentifier = regular expression</TT> |
| |
| <P> |
| That means, a macro definition is a macro identifier (letter followed |
| by a sequence of letters, digits or underscores), that can later be |
| used to reference the macro, followed by optional whitespace, followed |
| by an "<TT>=</TT>", followed by optional whitespace, followed by a |
| regular expression (see section <A HREF="manual.html#LexRules"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> <A HREF="manual.html#LexRules"><I>lexical |
| rules</I></A> for more information about regular expressions). |
| |
| <P> |
| The regular expression on the right hand side must be well formed and |
| must not contain the <code>^</code>, <TT>/</TT> or <TT>$</TT> operators. <B>Differently |
| to JLex, macros are not just pieces of text that are expanded by copying</B> |
| - they are parsed and must be well formed. |
| |
| <P> |
| <B>This is a feature.</B> It eliminates some very hard to find bugs in |
| lexical specifications (such like not having parentheses around more |
| complicated macros - which is not necessary with JFlex). See section |
| <A HREF="manual.html#Porting"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> <A HREF="manual.html#Porting"><I>Porting from JLex</I></A> for more |
| details on the problems of JLex style macros. |
| |
| <P> |
| Since it is allowed to have macro usages in macro definitions, it is |
| possible to use a grammar like notation to specify the desired lexical |
| structure. Macros however remain just abbreviations of the regular expressions |
| they represent. They are not non terminals of a grammar and cannot be used |
| recursively in any way. JFlex detects cycles in macro definitions and reports |
| them at generation time. JFlex also warns you about macros that have been |
| defined but never used in the ``lexical rules'' section of the specification. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00053000000000000000"></A><A NAME="LexRules"></A><BR> |
| Lexical rules |
| </H2> |
| The ``lexical rules'' section of an JFlex specification contains a set of |
| regular expressions and actions (Java code) that are executed when the |
| scanner matches the associated regular expression. |
| |
| <P> |
| |
| <H3><A NAME="SECTION00053100000000000000"></A><A NAME="Grammar"></A><BR> |
| Syntax |
| </H3> |
| The syntax of the "lexical rules" section is described by the following |
| BNF grammar (terminal symbols are enclosed in 'quotes'): |
| |
| <P> |
| <PRE> |
| LexicalRules ::= Rule+ |
| Rule ::= [StateList] ['^'] RegExp [LookAhead] Action |
| | [StateList] '<<EOF>>' Action |
| | StateGroup |
| StateGroup ::= StateList '{' Rule+ '}' |
| StateList ::= '<' Identifier (',' Identifier)* '>' |
| LookAhead ::= '$' | '/' RegExp |
| Action ::= '{' JavaCode '}' | '|' |
| |
| RegExp ::= RegExp '|' RegExp |
| | RegExp RegExp |
| | '(' RegExp ')' |
| | ('!'|'~') RegExp |
| | RegExp ('*'|'+'|'?') |
| | RegExp "{" Number ["," Number] "}" |
| | '[' ['^'] (Character|Character'-'Character)* ']' |
| | PredefinedClass |
| | '{' Identifier '}' |
| | '"' StringCharacter+ '"' |
| | Character |
| |
| PredefinedClass ::= '[:jletter:]' |
| | '[:jletterdigit:]' |
| | '[:letter:]' |
| | '[:digit:]' |
| | '[:uppercase:]' |
| | '[:lowercase:]' |
| | '.' |
| </PRE> |
| |
| <P> |
| <A NAME="Terminals"></A>The grammar uses the following terminal symbols: |
| |
| <UL> |
| <LI><TT>JavaCode</TT> |
| <BR> a sequence of <EM><TT>BlockStatements</TT></EM> as described in the Java |
| Language Specification [<A |
| HREF="manual.html#LangSpec">7</A>], section 14.2. |
| |
| <P> |
| </LI> |
| <LI><TT>Number</TT> |
| <BR> a non negative decimal integer. |
| |
| <P> |
| </LI> |
| <LI><TT>Identifier</TT> |
| <BR> a letter <code>[a-zA-Z]</code> followed by a sequence of zero or more |
| letters, digits or underscores <code>[a-zA-Z0-9_]</code> |
| |
| <P> |
| </LI> |
| <LI><TT>Character</TT> |
| <BR> an escape sequence or any unicode character that is not one of these |
| meta characters: |
| <code> | ( ) { } [ ] < > \ . * + ? ^ $ / . " ~ !</code> |
| |
| <P> |
| </LI> |
| <LI><TT>StringCharacter</TT> |
| <BR> an escape sequence or any unicode character that is not one of these |
| meta characters: |
| <code> \ "</code> |
| |
| <P> |
| </LI> |
| <LI>An escape sequence |
| |
| <P> |
| |
| <UL> |
| <LI><code>\n</code> <code>\r</code> <code>\t</code> <code>\f</code> <code>\b</code> |
| </LI> |
| <LI>a <code>\x</code> followed by two hexadecimal digits <TT>[a-fA-F0-9]</TT> (denoting |
| a standard ASCII escape sequence), |
| |
| <P> |
| </LI> |
| <LI>a <code>\u</code> followed by four hexadecimal digits <TT>[a-fA-F0-9]</TT> |
| (denoting an unicode escape sequence), |
| |
| <P> |
| </LI> |
| <LI>a backslash followed by a three digit octal number from 000 to 377 (denoting |
| a standard ASCII escape sequence), or |
| |
| <P> |
| </LI> |
| <LI>a backslash followed by any other unicode character that stands for this |
| character. |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| Please note that the <code>\n</code> escape sequence stands for the ASCII |
| LF character - not for the end of line. If you would like to match the |
| line terminator, you should use the expression <code>\r|\n|\r\n</code> if you want |
| the Java conventions, or <code>\r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085</code> |
| if you want to be fully Unicode compliant (see also [<A |
| HREF="manual.html#unicode_rep">5</A>]). |
| |
| <P> |
| As of version 1.1 of JFlex the whitespace characters <TT>" "</TT> |
| (space) and <code>"\t"</code> (tab) can be used to improve the readability of |
| regular expressions. They will be ignored by JFlex. In character |
| classes and strings however, whitespace characters keep standing for |
| themselves (so the string <TT>" "</TT> still matches exactly one space |
| character and <code>[ \n]</code> still matches an ASCII LF or a space |
| character). |
| |
| <P> |
| JFlex applies the following standard operator precedences in regular |
| expression (from highest to lowest): |
| |
| <P> |
| |
| <UL> |
| <LI>unary postfix operators (<code>'*', '+', '?', {n}, {n,m}</code>) |
| |
| <P> |
| </LI> |
| <LI>unary prefix operators (<code>'!', '~'</code>) |
| |
| <P> |
| </LI> |
| <LI>concatenation (<TT>RegExp::= RegExp Regexp</TT>) |
| |
| <P> |
| </LI> |
| <LI>union (<code>RegExp::= RegExp '|' RegExp</code>) |
| </LI> |
| </UL> |
| |
| <P> |
| So the expression <code>a | abc | !cd*</code> for instance is parsed as |
| <code>(a|(abc)) | ((!c)(d*))</code>. |
| |
| <P> |
| |
| <H3><A NAME="SECTION00053200000000000000"></A><A NAME="Semantics"></A><BR> |
| Semantics |
| </H3> |
| This section gives an informal description of which text is matched by |
| a regular expression (i.e. an expression described by the <TT>RegExp</TT> |
| production of the grammar presented <A HREF="manual.html#Grammar">above</A>). |
| |
| <P> |
| A regular expression that consists solely of |
| |
| <UL> |
| <LI>a <TT>Character</TT> matches this character. |
| |
| <P> |
| </LI> |
| <LI>a character class <code>'[' (Character|Character'-'Character)* ']'</code> matches |
| any character in that class. A <TT>Character</TT> is to be considered an |
| element of a class, if it is listed in the class or if its code lies within |
| a listed character range <TT>Character'-'Character</TT>. So <code>[a0-3\n]</code> |
| for instance matches the characters |
| |
| <P> |
| <code>a 0 1 2 3 \n</code> |
| |
| <P> |
| If the list of characters is empty (i.e. just <code>[]</code>), the expression |
| matches nothing at all (the empty set), not even the empty string. This |
| may be useful in combination with the negation operator <code>'!'</code>. |
| |
| <P> |
| </LI> |
| <LI>a negated character class <code>'[^' (Character|Character'-'Character)* ']'</code> |
| matches all characters not listed in the class. If the list of characters |
| is empty (i.e. <code>[^]</code>), the expression matches any character of the |
| input character set. |
| |
| <P> |
| </LI> |
| <LI>a string <TT>'"' StringCharacter+ '"</TT> <TT>'</TT> matches the exact |
| text enclosed in double quotes. All meta characters but <code>\</code> and |
| <TT>"</TT> loose their special meaning inside a string. See also the |
| <A HREF="manual.html#caseless"><TT>%ignorecase</TT></A> switch. |
| |
| <P> |
| </LI> |
| <LI>a macro usage <code>'{' Identifier '}'</code> matches the input that is matched |
| by the right hand side of the macro with name "<TT>Identifier</TT>". |
| |
| <P> |
| <A NAME="predefCharCl"></A></LI> |
| <LI>a predefined character class matches any of |
| the characters in that class. There are the following predefined character |
| classes: |
| |
| <P> |
| <TT>.</TT> contains all characters but <code>\n</code>. |
| |
| <P> |
| All other predefined character classes are defined in the Unicode |
| specification or the Java Language Specification and determined by |
| Java functions of class |
| <TT>java</TT>.<TT>lang</TT>.<TT>Character</TT>. |
| |
| <P> |
| <PRE> |
| [:jletter:] isJavaIdentifierStart() |
| [:jletterdigit:] isJavaIdentifierPart() |
| [:letter:] isLetter() |
| [:digit:] isDigit() |
| [:uppercase:] isUpperCase() |
| [:lowercase:] isLowerCase() |
| </PRE> |
| |
| <P> |
| They are especially useful when working with the unicode character set. |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| If <TT>a</TT> and <TT>b</TT> are regular expressions, then |
| |
| <P> |
| <DL COMPACT> |
| <DT><TT>a | b</TT></DT> |
| <DD>(union) |
| |
| <P> |
| is the regular expression, that matches |
| all input that is matched by <TT>a</TT> or by <TT>b</TT>. |
| |
| <P> |
| </DD> |
| <DT><TT>a b</TT></DT> |
| <DD>(concatenation) |
| |
| <P> |
| is the regular expression, |
| that matches the input matched by <TT>a</TT> followed by the |
| input matched by <TT>b</TT>. |
| |
| <P> |
| </DD> |
| <DT><TT>a*</TT></DT> |
| <DD>(kleene closure) |
| |
| <P> |
| matches zero or more repetitions |
| of the input matched by <TT>a</TT> |
| |
| <P> |
| </DD> |
| <DT><TT>a+</TT></DT> |
| <DD>(iteration) |
| |
| <P> |
| is equivalent to <TT>aa*</TT> |
| |
| <P> |
| </DD> |
| <DT><TT>a?</TT></DT> |
| <DD>(option) |
| |
| <P> |
| matches the empty input or the input matched |
| by <TT>a</TT> |
| |
| <P> |
| </DD> |
| <DT><TT>!a</TT></DT> |
| <DD>(negation) |
| |
| <P> |
| matches everything but the strings matched by <TT>a</TT>. |
| Use with care: the construction of <code>!a</code> involves |
| an additional, possibly exponential NFA to DFA transformation |
| on the NFA for <TT>a</TT>. Note that |
| with negation and union you also have (by applying DeMorgan) |
| intersection and set difference: the intersection of |
| <TT>a</TT> and <TT>b</TT> is <code>!(!a|!b)</code>, the expression |
| that matches everything of <TT>a</TT> not matched by <TT>b</TT> is |
| <code>!(!a|b)</code> |
| |
| <P> |
| </DD> |
| <DT><TT>~a</TT></DT> |
| <DD>(upto) |
| |
| <P> |
| matches everything up to (and including) the first occurrence of a text |
| matched by <TT>a</TT>. The expression <code>~a</code> is equivalent |
| to <code>!([^]* a [^]* | "") a</code>. A traditional C-style comment |
| is matched by <code>"/*" ~"*/"</code> |
| |
| <P> |
| </DD> |
| <DT><TT>a{n}</TT></DT> |
| <DD>(repeat) |
| |
| <P> |
| is equivalent to <TT>n</TT> times the concatenation of <TT>a</TT>. |
| So <code>a{4}</code> for instance is equivalent to the expression <TT>a a a a</TT>. |
| The decimal integer <TT>n</TT> must be positive. |
| |
| <P> |
| </DD> |
| <DT><TT>a{n,m}</TT></DT> |
| <DD>is equivalent to at least <TT>n</TT> times and at most <TT>m</TT> times the |
| concatenation of <TT>a</TT>. So <code>a{2,4}</code> for instance is equivalent |
| to the expression <code>a a a? a?</code>. Both <TT>n</TT> and <TT>m</TT> are non |
| negative decimal integers and <TT>m</TT> must not be smaller than <TT>n</TT>. |
| |
| <P> |
| </DD> |
| <DT><TT>( a )</TT></DT> |
| <DD>matches the same input as <TT>a</TT>. |
| |
| <P> |
| </DD> |
| </DL> |
| |
| <P> |
| In a lexical rule, a regular expression <TT>r</TT> may be preceded by a |
| '<code>^</code>' (the beginning of line operator). <TT>r</TT> is then |
| only matched at the beginning of a line in the input. A line begins |
| after each occurrence of <code>\r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085</code> |
| (see also [<A |
| HREF="manual.html#unicode_rep">5</A>]) and at the beginning of input. |
| The preceding line terminator in the input is not consumed and can |
| be matched by another rule. |
| |
| <P> |
| In a lexical rule, a regular expression <TT>r</TT> may be followed by a |
| lookahead expression. A lookahead expression is either a '<TT>$</TT>' |
| (the end of line operator) or a <code>'/'</code> followed by an arbitrary |
| regular expression. In both cases the lookahead is not consumed and |
| not included in the matched text region, but it <EM>is</EM> considered |
| while determining which rule has the longest match (see also |
| <A HREF="manual.html#HowMatched"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> <A HREF="manual.html#HowMatched"><I>How the input is matched</I></A>). |
| |
| <P> |
| In the '<TT>$</TT>' case <TT>r</TT> is only matched at the end of a line in |
| the input. The end of a line is denoted by the regular expression |
| <code>\r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085</code>. |
| So <code>a$</code> is equivalent to <code>a / \r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085</code>.This is a bit different to the situation described in [<A |
| HREF="manual.html#unicode_rep">5</A>]: |
| since in JFlex <code>$</code> is a true trailing context, the end of file |
| does <B>not</B> count as end of line. |
| |
| <P> |
| <A NAME="trailingContext"></A> |
| <P> |
| For arbitrary lookahead (also called <EM>trailing context</EM>) the |
| expression is matched only when followed by input that matches the |
| trailing context. Unfortunately the lookahead expression is not |
| really arbitrary: In a rule <TT>r1 / r2</TT>, either the text matched |
| by <TT>r1</TT> must have a fixed length (e.g. if <TT>r1</TT> is a string) |
| or the beginning of the trailing context <TT>r2</TT> must not match the |
| end of <TT>r1</TT>. So for example <code>"abc" / "a"|"b"</code> is ok because |
| <TT>"abc"</TT> has a fixed length, <code>"a"|"ab" / "x"*</code> is ok because |
| no prefix of <TT>"x"*</TT> matches a postfix of <code>"a"|"ab"</code>, but |
| <code>"x"|"xy" / "yx"</code> is <EM>not</EM> possible, because the postfix <TT>"y"</TT> |
| of <TT>"x"|"xy"</TT> is also a prefix of <TT>"yx"</TT>. JFlex will report |
| such cases at generation time. The algorithm JFlex currently uses for matching |
| trailing context expressions is the one described in [<A |
| HREF="manual.html#Aho">1</A>] (leading |
| to the deficiencies mentioned above). |
| |
| <P> |
| <A NAME="EOFRule"></A>As of version 1.2, JFlex allows lex/flex style <TT>«EOF»</TT> rules in |
| lexical specifications. A rule |
| <PRE> |
| [StateList] <<EOF>> { some action code } |
| </PRE> |
| is very similar to the <A HREF="manual.html#eofval"><TT>%eofval</TT> directive</A> (section <A HREF="manual.html#eofval"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>). |
| The difference lies in the optional <TT>StateList</TT> that may precede the <TT>«EOF»</TT> rule. The |
| action code will only be executed when the end of file is read and the |
| scanner is currently in one of the lexical states listed in <TT>StateList</TT>. |
| The same <TT>StateGroup</TT> (see section <A HREF="manual.html#HowMatched"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A> |
| <A HREF="manual.html#HowMatched"><I>How the input is matched</I></A>) and precedence |
| rules as in the ``normal'' rule case apply |
| (i.e. if there is more than one <TT>«EOF»</TT> |
| rule for a certain lexical state, the action of the one appearing |
| earlier in the specification will be executed). <TT>«EOF»</TT> rules |
| override settings of the <TT>%cup</TT> and <TT>%byaccj</TT> options and |
| should not be mixed with the <TT>%eofval</TT> directive. |
| |
| <P> |
| An <TT>Action</TT> consists either of a piece of Java code enclosed in |
| curly braces or is the special <code>|</code> action. The <code>|</code> action is |
| an abbreviation for the action of the following expression. |
| |
| <P> |
| Example: |
| <PRE> |
| expression1 | |
| expression2 | |
| expression3 { some action } |
| </PRE> |
| is equivalent to the expanded form |
| <PRE> |
| expression1 { some action } |
| expression2 { some action } |
| expression3 { some action } |
| </PRE> |
| |
| <P> |
| They are useful when you work with trailing context expressions. The |
| expression <TT>a | (c / d) | b</TT> is not syntactically legal, but can |
| easily be expressed using the <code>|</code> action: |
| <PRE> |
| a | |
| c / d | |
| b { some action } |
| </PRE> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00053300000000000000"></A><A NAME="HowMatched"></A><BR> |
| How the input is matched |
| </H3> |
| When consuming its input, the scanner determines the regular expression |
| that matches the longest portion of the input (longest match rule). If |
| there is more than one regular expression that matches the longest portion |
| of input (i.e. they all match the same input), the generated scanner chooses |
| the expression that appears first in the specification. After determining |
| the active regular expression, the associated action is executed. If there |
| is no matching regular expression, the scanner terminates the program with |
| an error message (if the <TT>%standalone</TT> directive has been used, the |
| scanner prints the unmatched input to <TT>java.lang.System.out</TT> instead |
| and resumes scanning). |
| |
| <P> |
| Lexical states can be used to further restrict the set of regular expressions |
| that match the current input. |
| |
| <P> |
| |
| <UL> |
| <LI>A regular expression can only be matched when its associated set of lexical |
| states includes the currently active lexical state of the scanner or if |
| the set of associated lexical states is empty and the currently active lexical |
| state is inclusive. Exclusive and inclusive states only differ at this point: |
| rules with an empty set of associated states. |
| |
| <P> |
| </LI> |
| <LI>The currently active lexical state of the scanner can be changed from within |
| an action of a regular expression using the method <TT>yybegin()</TT>. |
| |
| <P> |
| </LI> |
| <LI>The scanner starts in the inclusive lexical state |
| <TT>YYINITIAL</TT>, which is always declared by default. |
| |
| <P> |
| </LI> |
| <LI>The set of lexical states associated with a regular expression is |
| the <TT>StateList</TT> that precedes the expression. If a rule is |
| contained in one or more <TT>StateGroups</TT>, then the states of |
| these are also associated with the rule, i.e. they accumulate over |
| <TT>StateGroups</TT>. |
| |
| <P> |
| Example: |
| <PRE> |
| %states A, B |
| %xstates C |
| %% |
| expr1 { yybegin(A); action } |
| <YYINITIAL, A> expr2 { action } |
| <A> { |
| expr3 { action } |
| <B,C> expr4 { action } |
| } |
| </PRE> |
| The first line declares two (inclusive) lexical states <TT>A</TT> and <TT>B</TT>, |
| the second line an exclusive lexical state <TT>C</TT>. |
| The default (inclusive) state <TT>YYINITIAL</TT> is always implicitly there and |
| doesn't need to be declared. The rule with <TT>expr1</TT> has no |
| states listed, and is thus matched in all states but the exclusive |
| ones, i.e. <TT>A</TT>, <TT>B</TT>, and <TT>YYINITIAL</TT>. In its |
| action, the scanner is switched to state <TT>A</TT>. The second rule |
| <TT>expr2</TT> can only match when the scanner is in state |
| <TT>YYINITIAL</TT> or <TT>A</TT>. The rule <TT>expr3</TT> can only be |
| matched in state <TT>A</TT> and <TT>expr4</TT> in states <TT>A</TT>, <TT>B</TT>, |
| and <TT>C</TT>. |
| |
| <P> |
| </LI> |
| <LI>Lexical states are declared and used as Java <TT>int</TT> constants in |
| the generated class under the same name as they are used in the specification. |
| There is no guarantee that the values of these integer constants are |
| distinct. They are pointers into the generated DFA table, and if JFlex |
| recognizes two states as lexically equivalent (if they are used with the |
| exact same set of regular expressions), then the two constants will get |
| the same value. |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00053400000000000000"> |
| The generated class</A> |
| </H3> |
| JFlex generates exactly one file containing one class from the specification |
| (unless you have declared another class in the first specification section). |
| |
| <P> |
| The generated class contains (among other things) the DFA tables, an input buffer, |
| the lexical states of the specification, a constructor, and the scanning method |
| with the user supplied actions. |
| |
| <P> |
| The name of the class is by default <TT>Yylex</TT>, it is customizable |
| with the <TT>%class</TT> directive (see also section |
| <A HREF="manual.html#ClassOptions"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>). The input buffer of the lexer is connected with an |
| input stream over the <TT>java.io.Reader</TT> object which is passed |
| to the lexer in the generated constructor. If you want to provide your |
| own constructor for the lexer, you should always call the generated |
| one in it to initialize the input buffer. The input buffer should not |
| be accessed directly, but only over the advertised API (see also |
| section <A HREF="manual.html#ScannerMethods"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>). Its internal implementation may change |
| between releases or skeleton files without notice. |
| |
| <P> |
| The main interface to the outside world is the generated scanning |
| method (default name <TT>yylex</TT>, default return type |
| <TT>Yytoken</TT>). Most of its aspects are customizable (name, return |
| type, declared exceptions etc., see also section |
| <A HREF="manual.html#ScanningMethod"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>). If it is called, it will consume input until |
| one of the expressions in the specification is matched or an error |
| occurs. If an expression is matched, the corresponding action is |
| executed. It may return a value of the specified return type (in which |
| case the scanning method return with this value), or if it doesn't |
| return a value, the scanner resumes consuming input until the next |
| expression is matched. If the end of file is reached, the scanner |
| executes the EOF action, and (also upon each further call to the scanning |
| method) returns the specified EOF value (see also section <A HREF="manual.html#EOF"><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="crossref.png"></A>). |
| |
| <P> |
| |
| <H3><A NAME="SECTION00053500000000000000"></A><A NAME="ScannerMethods"></A><BR> |
| Scanner methods and fields accessible in actions (API) |
| </H3> |
| Generated methods and member fields in JFlex scanners are prefixed |
| with <TT>yy</TT> to indicate that they are generated and to avoid name |
| conflicts with user code copied into the class. Since user code is |
| part of the same class, JFlex has no language means like the |
| <TT>private</TT> modifier to indicate which members and methods are |
| internal and which ones belong to the API. Instead, JFlex follows a |
| naming convention: everything starting with a <TT>zz</TT> prefix like |
| <TT>zzStartRead</TT> is to be considered internal and subject to |
| change without notice between JFlex releases. Methods and members of |
| the generated class that do not have a <TT>zz</TT> prefix like |
| <TT>yycharat</TT> belong to the API that the scanner class provides to |
| users in action code of the specification. They will be remain stable |
| and supported between JFlex releases as long as possible. |
| |
| <P> |
| Currently, the API consists of the following methods and member fields: |
| |
| <UL> |
| <LI><TT>String yytext()</TT> |
| <BR> returns the matched input text region |
| |
| <P> |
| </LI> |
| <LI><TT>int yylength()</TT> |
| <BR> returns the length of the matched input text region (does not require |
| a <TT>String</TT> object to be created) |
| |
| <P> |
| </LI> |
| <LI><TT>char yycharat(int pos)</TT> |
| <BR> returns the character at position <TT>pos</TT> from the matched text. |
| It is equivalent to <TT>yytext().charAt(pos)</TT>, but faster. <TT> pos</TT> must be a value from <TT>0</TT> to <TT>yylength()-1</TT>. |
| |
| <P> |
| </LI> |
| <LI><TT>void yyclose()</TT> |
| <BR> closes the input stream. All subsequent calls to the scanning method will |
| return the end of file value |
| |
| <P> |
| </LI> |
| <LI><TT>void yyreset(java.io.Reader reader)</TT> |
| <BR> closes the current input stream, and resets the scanner to read from |
| a new input stream. All internal variables are reset, the old input |
| stream <EM>cannot</EM> be reused (content of the internal buffer is |
| discarded and lost). The lexical state is set to <TT>YY_INITIAL</TT>. |
| |
| <P> |
| </LI> |
| <LI><TT>void yypushStream(java.io.Reader reader)</TT> |
| <BR> Stores the current input stream on a stack, and |
| reads from a new stream. Lexical state, line, |
| char, and column counting remain untouched. |
| The current input stream can be restored with |
| <TT>yypopstream</TT> (usually in an <TT>«EOF»</TT> action). |
| |
| <P> |
| A typical example for this are include files in |
| style of the C preprocessor. The corresponding |
| JFlex specification could look somewhat like this: |
| <PRE> |
| "#include" {FILE} { yypushStream(new FileReader(getFile(yytext()))); } |
| .. |
| <<EOF>> { if (yymoreStreams()) yypopStream(); else return EOF; } |
| </PRE> |
| |
| <P> |
| This method is only available in the skeleton file |
| <TT>skeleton.nested</TT>. You can find it in the |
| <TT>src</TT> directory of the JFlex distribution. |
| |
| <P> |
| </LI> |
| <LI><TT>void yypopStream()</TT> |
| <BR> Closes the current input stream and continues to |
| read from the one on top of the stream stack. |
| |
| <P> |
| This method is only available in the skeleton file |
| <TT>skeleton.nested</TT>. You can find it in the |
| <TT>src</TT> directory of the JFlex distribution. |
| |
| <P> |
| </LI> |
| <LI><TT>boolean yymoreStreams()</TT> |
| <BR> Returns true iff there are still streams for <TT>yypopStream</TT> |
| left to read from on the stream stack. |
| |
| <P> |
| This method is only available in the skeleton file |
| <TT>skeleton.nested</TT>. You can find it in the |
| <TT>src</TT> directory of the JFlex distribution. |
| |
| <P> |
| </LI> |
| <LI><TT>int yystate()</TT> |
| <BR> returns the current lexical state of the scanner. |
| |
| <P> |
| </LI> |
| <LI><TT>void yybegin(int lexicalState)</TT> |
| <BR> enters the lexical state <TT>lexicalState</TT> |
| |
| <P> |
| </LI> |
| <LI><TT>void yypushback(int number)</TT> |
| <BR> pushes <TT>number</TT> characters of the matched text back into the inputstream. |
| They will be read again in the next call of the scanning method. |
| The number of characters to be read again must not be greater than the length |
| of the matched text. The pushed back characters will after the call of |
| <TT>yypushback</TT> not be included in <TT>yylength</TT> and <TT>yytext()</TT>. |
| Please note that in Java strings are unchangeable, i.e. an action code like |
| <PRE> |
| String matched = yytext(); |
| yypushback(1); |
| return matched; |
| </PRE> |
| will return the whole matched text, while |
| <PRE> |
| yypushback(1); |
| return yytext(); |
| </PRE> |
| will return the matched text minus the last character. |
| |
| <P> |
| </LI> |
| <LI><TT>int yyline</TT> |
| <BR> contains the current line of input (starting with 0, only active with |
| the <TT><A HREF="manual.html#Counting">%line</A></TT> directive) |
| |
| <P> |
| </LI> |
| <LI><TT>int yychar</TT> |
| <BR> contains the current character count in the input (starting with 0, |
| only active with the <TT><A HREF="manual.html#Counting">%char</A></TT> directive) |
| |
| <P> |
| </LI> |
| <LI><TT>int yycolumn</TT> |
| <BR> contains the current column of the current line (starting with 0, only |
| active with the <TT><A HREF="manual.html#Counting">%column</A></TT> directive) |
| |
| <P> |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H1><A NAME="SECTION00060000000000000000"></A><A NAME="sec:encodings"></A><BR> |
| Encodings, Platforms, and Unicode |
| </H1> |
| |
| <P> |
| This section tries to shed some light on the issues of Unicode and |
| encodings, cross platform scanning, and how to deal with binary data. |
| My thanks go to Stephen Ostermiller for his input on this topic. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00061000000000000000"></A><A NAME="sec:howtoencoding"></A><BR> |
| The Problem |
| </H2> |
| |
| <P> |
| Before we dive straight into details, let's take a look at what the |
| problem is. The problem is Java's platform independence when you want |
| to use it. For scanners the interesting part about platform |
| independence is character encodings and how they are handled. |
| |
| <P> |
| If a program reads a file from disk, it gets a stream of bytes. In |
| earlier times, when the grass was green, and the world was much |
| simpler, everybody knew that the byte value 65 is, of course, an A. |
| It was no problem to see which bytes meant which characters (actually |
| these times never existed, but anyway). The normal Latin alphabet |
| only has 26 characters, so 7 bits or 128 distinct values should surely |
| be enough to map them, even if you allow yourself the luxury of upper |
| and lower case. Nowadays, things are different. The world suddenly |
| grew much larger, and all kinds of people wanted all kinds of special |
| characters, just because they use them in their language and writing. |
| This is were the mess starts. Since the 128 distinct values were |
| already filled up with other stuff, people began to use all 8 bits of |
| the byte, and extended the byte/character mappings to fit their need, |
| and of course everybody did it differently. Some people for instance |
| may have said ``let's use the value 213 for the German character ä''. Others |
| may have found that 213 should much rather mean é, because they didn't need |
| German and wrote French instead. As long as you use your program and |
| data files only on one platform, this is no problem, as all know what |
| means what, and everything gets used consistently. |
| |
| <P> |
| Now Java comes into play, and wants to run everywhere (once written, |
| that is) and now there suddenly is a problem: how do I get the same |
| program to say ä to a certain byte when it runs in Germany and maybe é |
| when it runs in France? And also the other way around: when I want to |
| say é on the screen, which byte value should I send to the operating |
| system? |
| |
| <P> |
| Java's solution to this is to use Unicode internally. Unicode aims to |
| be a superset of all known character sets and is therefore a perfect base |
| for encoding things that might get used all over the world. To make |
| things work correctly, you still have to know where you are and how to |
| map byte values to Unicode characters and vice versa, but the |
| important thing is, that this mapping is at least possible (you can |
| map Kanji characters to Unicode, but you cannot map them to ASCII or |
| iso-latin-1). |
| |
| <P> |
| |
| <H2><A NAME="SECTION00062000000000000000"></A><A NAME="sec:howtotext"></A><BR> |
| Scanning text files |
| </H2> |
| |
| <P> |
| Scanning text files is the standard application for scanners like |
| JFlex. Therefore it should also be the most convenient one. Most times |
| it is. |
| |
| <P> |
| The following scenario works like a breeze: |
| You work on a platform X, write your lexer specification there, can |
| use any obscure Unicode character in it as you like, and compile the |
| program. Your users work on any platform Y (possibly but not |
| necessarily something different from X), they write their input files |
| on Y and they run your program on Y. No problems. |
| |
| <P> |
| Java does this as follows: |
| If you want to read anything in Java that is supposed to contain text, |
| you use a <TT>FileReader</TT> or some <TT>InputStream</TT> together with |
| an <TT>InputStreamReader</TT>. <TT>InputStreams</TT> return the raw bytes, the |
| <TT>InputStreamReader</TT> converts the bytes into Unicode characters with |
| the platform's default encoding. If a text file is produced on the |
| same platform, the platform's default encoding should do the mapping |
| correctly. Since JFlex also uses readers and Unicode internally, this |
| mechanism also works for the scanner specifications. If you write an |
| <TT>A</TT> in your text editor and the editor uses the platform's encoding (say <TT>A</TT> is 65), |
| then Java translates this into the logical Unicode <TT>A</TT> internally. |
| If a user writes an <TT>A</TT> on a completely different platform (say <TT>A</TT> is 237 there), |
| then Java also translates this into the logical Unicode <TT>A</TT> internally. Scanning |
| is performed after that translation and both match. |
| |
| <P> |
| Note that because of this mapping from bytes to characters, you should always |
| use the <TT>%unicode</TT> switch in you lexer specification if you want to scan |
| text files. <TT>%8bit</TT> may not be enough, even if |
| you know that your platform only uses one byte per character. The encoding |
| Cp1252 used on many Windows machines for instance knows 256 characters, but |
| the character ´ with Cp1252 code <code>\x92</code> has the Unicode value <code>\u2019</code>, which |
| is larger than 255 and which would make your scanner throw an |
| <TT>ArrayIndexOutOfBoundsException</TT> if it is encountered. |
| |
| <P> |
| So for the usual case you don't have to do anything but use the |
| <TT>%unicode</TT> switch in your lexer specification. |
| |
| <P> |
| Things may break when you produce a text file on platform X and |
| consume it on a different platform Y. Let's say you have a file |
| written on a Windows PC using the encoding Cp1252. Then you move |
| this file to a Linux PC with encoding ISO 8859-1 and there you want |
| to run your scanner on it. Java now thinks the file is encoded |
| in ISO 8859-1 (the platform's default encoding) while it really is |
| encoded in Cp1252. For most characters |
| Cp1252 and ISO 8859-1 are the same, but for the byte values <code>\x80</code> |
| to <code>\x9f</code> they disagree: ISO 8859-1 is undefined there. You can fix |
| the problem by telling Java explicitly which encoding to use. When |
| constructing the <TT>InputStreamReader</TT>, you can give the encoding |
| as argument. The line |
| <DIV ALIGN="CENTER"> |
| <TT>Reader r = new InputStreamReader(input, "Cp1252"); </TT> |
| |
| </DIV> |
| will do the trick. |
| |
| <P> |
| Of course the encoding to use can also come from the data itself: |
| for instance, when you scan a HTML page, it may have embedded |
| information about its character encoding in the headers. |
| |
| <P> |
| More information about encodings, which ones are supported, how |
| they are called, and how to set them may be found in the |
| official Java documentation in the chapter about |
| internationalization. |
| The link |
| <A NAME="tex2html7" |
| HREF="http://java.sun.com/j2se/1.3/docs/guide/intl/"><TT>http://java.sun.com/j2se/1.3/docs/guide/intl/</TT></A> |
| leads to an online version of this for Sun's JDK 1.3. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00063000000000000000"></A><A NAME="sec:howtobinary"></A><BR> |
| Scanning binaries |
| </H2> |
| |
| <P> |
| Scanning binaries is both easier and more difficult |
| than scanning text files. It's easier because you want |
| the raw bytes and not their meaning, i.e. you don't want |
| any translation. |
| It's more difficult because it's not so easy to get |
| ``no translation'' when you use Java readers. |
| |
| <P> |
| The problem (for binaries) is that JFlex scanners are |
| designed to work on text. Therefore the interface is |
| the <TT>Reader</TT> class (there is a constructor |
| for <TT>InputStream</TT> instances, but it's just there |
| for convenience and wraps an <TT>InputStreamReader</TT> |
| around it to get characters, not bytes). |
| You can still get a binary scanner when you write |
| your own custom <TT>InputStreamReader</TT> class that |
| does explicitly no translation, but just copies |
| byte values to character codes instead. It sounds |
| quite easy, and actually it is no big deal, but there |
| are a few little pitfalls on the way. In the scanner |
| specification you can only enter positive character |
| codes (for bytes that is <code>\x00</code> |
| to <code>\xFF</code>). Java's <TT>byte</TT> type on the other hand |
| is a signed 8 bit integer (-128 to 127), so you have to convert |
| them properly in your custom <TT>Reader</TT>. Also, you should |
| take care when you write your lexer spec: if you |
| use text in there, it gets interpreted by an encoding |
| first, and what scanner you get as result might depend |
| on which platform you run JFlex on when you generate |
| the scanner (this is what you want for text, but for binaries it |
| gets in the way). If you are not sure, or if the development |
| platform might change, it's probably best to use character |
| code escapes in all places, since they don't change their |
| meaning. |
| |
| <P> |
| To illustrate these points, the example in <TT>examples/binary</TT> |
| contains a very small binary scanner that tries to |
| detect if a file is a Java <TT>class</TT> file. For that |
| purpose it looks if the file begins with the magic number <code>\xCAFEBABE</code>. |
| |
| <P> |
| |
| <H1><A NAME="SECTION00070000000000000000"></A><A NAME="performance"></A><BR> |
| A few words on performance |
| </H1> |
| This section gives some empirical results about the speed of JFlex generated |
| scanners in comparison to those generated by JLex, |
| compares a JFlex scanner with a <A HREF="manual.html#PerformanceHandwritten">handwritten</A> |
| one, and presents some <A HREF="manual.html#PerformanceTips">tips</A> on how to make |
| your specification produce a faster scanner. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00071000000000000000"></A><A NAME="PerformanceJLex"></A><BR> |
| Comparison of JLex and JFlex |
| </H2> |
| Scanners generated by the tool JLex are quite fast. It was however |
| possible to further improve the performance of generated scanners |
| using JFlex. The following table shows the results that were produced |
| by the scanner specification of a small toy programming language (in |
| fact the example from the JLex website). The scanner was generated |
| using JLex and all three different JFlex code generation methods. Then |
| it was run on a W98 system using Sun's JDK 1.3 with different sample inputs |
| of that toy programming language. All test runs were made under the |
| same conditions on an otherwise idle machine. |
| |
| <P> |
| The values presented in the table denote the time from the first call |
| to the scanning method to returning the EOF value and the speedup in |
| percent. The tests were run both int the mixed (HotSpot) JVM mode and |
| the pure interpreted mode. The mixed mode JVM brings |
| about a factor of 10 performance improvement, the difference between |
| JLex and JFlex only decreases slightly. |
| |
| <P> |
| <TABLE CELLPADDING=3 BORDER="1" WIDTH="100%"> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">KB</TD> |
| <TD ALIGN="CENTER">JVM</TD> |
| <TD ALIGN="RIGHT">JLex</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">325 ms</TD> |
| <TD ALIGN="RIGHT">261 ms</TD> |
| <TD ALIGN="RIGHT">24.5 %</TD> |
| <TD ALIGN="RIGHT">261 ms</TD> |
| <TD ALIGN="RIGHT">24.5 %</TD> |
| <TD ALIGN="RIGHT">261 ms</TD> |
| <TD ALIGN="RIGHT">24.5 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">127 ms</TD> |
| <TD ALIGN="RIGHT">98 ms</TD> |
| <TD ALIGN="RIGHT">29.6 %</TD> |
| <TD ALIGN="RIGHT">94 ms</TD> |
| <TD ALIGN="RIGHT">35.1 %</TD> |
| <TD ALIGN="RIGHT">96 ms</TD> |
| <TD ALIGN="RIGHT">32.3 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">66 ms</TD> |
| <TD ALIGN="RIGHT">50 ms</TD> |
| <TD ALIGN="RIGHT">32.0 %</TD> |
| <TD ALIGN="RIGHT">50 ms</TD> |
| <TD ALIGN="RIGHT">32.0 %</TD> |
| <TD ALIGN="RIGHT">48 ms</TD> |
| <TD ALIGN="RIGHT">37.5 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">4009 ms</TD> |
| <TD ALIGN="RIGHT">3025 ms</TD> |
| <TD ALIGN="RIGHT">32.5 %</TD> |
| <TD ALIGN="RIGHT">3258 ms</TD> |
| <TD ALIGN="RIGHT">23.1 %</TD> |
| <TD ALIGN="RIGHT">3231 ms</TD> |
| <TD ALIGN="RIGHT">24.1 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">1641 ms</TD> |
| <TD ALIGN="RIGHT">1155 ms</TD> |
| <TD ALIGN="RIGHT">42.1 %</TD> |
| <TD ALIGN="RIGHT">1245 ms</TD> |
| <TD ALIGN="RIGHT">31.8 %</TD> |
| <TD ALIGN="RIGHT">1234 ms</TD> |
| <TD ALIGN="RIGHT">33.0 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">817 ms</TD> |
| <TD ALIGN="RIGHT">573 ms</TD> |
| <TD ALIGN="RIGHT">42.6 %</TD> |
| <TD ALIGN="RIGHT">617 ms</TD> |
| <TD ALIGN="RIGHT">32.4 %</TD> |
| <TD ALIGN="RIGHT">613 ms</TD> |
| <TD ALIGN="RIGHT">33.3 %</TD> |
| </TR> |
| </TABLE> |
| |
| <P><BR> |
| |
| <P> |
| Since the scanning time of the lexical analyzer examined in the table |
| above includes lexical actions that often need to create new object instances, |
| another table shows the execution time for the same specification with empty |
| lexical actions to compare the pure scanning engines. |
| |
| <P> |
| <TABLE CELLPADDING=3 BORDER="1" WIDTH="100%"> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">KB</TD> |
| <TD ALIGN="CENTER">JVM</TD> |
| <TD ALIGN="RIGHT">JLex</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">204 ms</TD> |
| <TD ALIGN="RIGHT">140 ms</TD> |
| <TD ALIGN="RIGHT">45.7 %</TD> |
| <TD ALIGN="RIGHT">138 ms</TD> |
| <TD ALIGN="RIGHT">47.8 %</TD> |
| <TD ALIGN="RIGHT">140 ms</TD> |
| <TD ALIGN="RIGHT">45.7 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">83 ms</TD> |
| <TD ALIGN="RIGHT">55 ms</TD> |
| <TD ALIGN="RIGHT">50.9 %</TD> |
| <TD ALIGN="RIGHT">52 ms</TD> |
| <TD ALIGN="RIGHT">59.6 %</TD> |
| <TD ALIGN="RIGHT">52 ms</TD> |
| <TD ALIGN="RIGHT">59.6 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">41 ms</TD> |
| <TD ALIGN="RIGHT">28 ms</TD> |
| <TD ALIGN="RIGHT">46.4 %</TD> |
| <TD ALIGN="RIGHT">26 ms</TD> |
| <TD ALIGN="RIGHT">57.7 %</TD> |
| <TD ALIGN="RIGHT">26 ms</TD> |
| <TD ALIGN="RIGHT">57.7 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">2983 ms</TD> |
| <TD ALIGN="RIGHT">2036 ms</TD> |
| <TD ALIGN="RIGHT">46.5 %</TD> |
| <TD ALIGN="RIGHT">2230 ms</TD> |
| <TD ALIGN="RIGHT">33.8 %</TD> |
| <TD ALIGN="RIGHT">2232 ms</TD> |
| <TD ALIGN="RIGHT">33.6 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">1260 ms</TD> |
| <TD ALIGN="RIGHT">793 ms</TD> |
| <TD ALIGN="RIGHT">58.9 %</TD> |
| <TD ALIGN="RIGHT">865 ms</TD> |
| <TD ALIGN="RIGHT">45.7 %</TD> |
| <TD ALIGN="RIGHT">867 ms</TD> |
| <TD ALIGN="RIGHT">45.3 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">628 ms</TD> |
| <TD ALIGN="RIGHT">395 ms</TD> |
| <TD ALIGN="RIGHT">59.0 %</TD> |
| <TD ALIGN="RIGHT">432 ms</TD> |
| <TD ALIGN="RIGHT">45.4 %</TD> |
| <TD ALIGN="RIGHT">432 ms</TD> |
| <TD ALIGN="RIGHT">45.4 %</TD> |
| </TR> |
| </TABLE> |
| |
| <P><BR> |
| |
| <P> |
| Execution time of single instructions depends on the platform and |
| the implementation of the Java Virtual Machine the program is executed |
| on. Therefore the tables above cannot be used as a reference to which |
| code generation method of JFlex is the right one to choose in general. |
| The following table was produced by the same lexical specification and |
| the same input on a Linux system also using Sun's JDK 1.3. |
| |
| <P> |
| With actions: |
| |
| <P> |
| <TABLE CELLPADDING=3 BORDER="1" WIDTH="100%"> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">KB</TD> |
| <TD ALIGN="CENTER">JVM</TD> |
| <TD ALIGN="RIGHT">JLex</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">246 ms</TD> |
| <TD ALIGN="RIGHT">203 ms</TD> |
| <TD ALIGN="RIGHT">21.2 %</TD> |
| <TD ALIGN="RIGHT">193 ms</TD> |
| <TD ALIGN="RIGHT">27.5 %</TD> |
| <TD ALIGN="RIGHT">190 ms</TD> |
| <TD ALIGN="RIGHT">29.5 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">99 ms</TD> |
| <TD ALIGN="RIGHT">76 ms</TD> |
| <TD ALIGN="RIGHT">30.3 %</TD> |
| <TD ALIGN="RIGHT">69 ms</TD> |
| <TD ALIGN="RIGHT">43.5 %</TD> |
| <TD ALIGN="RIGHT">70 ms</TD> |
| <TD ALIGN="RIGHT">41.4 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">48 ms</TD> |
| <TD ALIGN="RIGHT">36 ms</TD> |
| <TD ALIGN="RIGHT">33.3 %</TD> |
| <TD ALIGN="RIGHT">34 ms</TD> |
| <TD ALIGN="RIGHT">41.2 %</TD> |
| <TD ALIGN="RIGHT">35 ms</TD> |
| <TD ALIGN="RIGHT">37.1 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">3251 ms</TD> |
| <TD ALIGN="RIGHT">2247 ms</TD> |
| <TD ALIGN="RIGHT">44.7 %</TD> |
| <TD ALIGN="RIGHT">2430 ms</TD> |
| <TD ALIGN="RIGHT">33.8 %</TD> |
| <TD ALIGN="RIGHT">2444 ms</TD> |
| <TD ALIGN="RIGHT">33.0 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">1320 ms</TD> |
| <TD ALIGN="RIGHT">848 ms</TD> |
| <TD ALIGN="RIGHT">55.7 %</TD> |
| <TD ALIGN="RIGHT">958 ms</TD> |
| <TD ALIGN="RIGHT">37.8 %</TD> |
| <TD ALIGN="RIGHT">920 ms</TD> |
| <TD ALIGN="RIGHT">43.5 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">658 ms</TD> |
| <TD ALIGN="RIGHT">423 ms</TD> |
| <TD ALIGN="RIGHT">55.6 %</TD> |
| <TD ALIGN="RIGHT">456 ms</TD> |
| <TD ALIGN="RIGHT">44.3 %</TD> |
| <TD ALIGN="RIGHT">452 ms</TD> |
| <TD ALIGN="RIGHT">45.6 %</TD> |
| </TR> |
| </TABLE> |
| |
| <P><BR> |
| |
| <P> |
| Without actions: |
| |
| <P> |
| <TABLE CELLPADDING=3 BORDER="1" WIDTH="100%"> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">KB</TD> |
| <TD ALIGN="CENTER">JVM</TD> |
| <TD ALIGN="RIGHT">JLex</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%switch</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%table</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| <TD ALIGN="RIGHT"><FONT SIZE="-1"><TT>%pack</TT></FONT></TD> |
| <TD ALIGN="RIGHT">speedup</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">136 ms</TD> |
| <TD ALIGN="RIGHT">78 ms</TD> |
| <TD ALIGN="RIGHT">74.4 %</TD> |
| <TD ALIGN="RIGHT">76 ms</TD> |
| <TD ALIGN="RIGHT">78.9 %</TD> |
| <TD ALIGN="RIGHT">77 ms</TD> |
| <TD ALIGN="RIGHT">76.6 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">59 ms</TD> |
| <TD ALIGN="RIGHT">31 ms</TD> |
| <TD ALIGN="RIGHT">90.3 %</TD> |
| <TD ALIGN="RIGHT">48 ms</TD> |
| <TD ALIGN="RIGHT">22.9 %</TD> |
| <TD ALIGN="RIGHT">32 ms</TD> |
| <TD ALIGN="RIGHT">84.4 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">28 ms</TD> |
| <TD ALIGN="RIGHT">15 ms</TD> |
| <TD ALIGN="RIGHT">86.7 %</TD> |
| <TD ALIGN="RIGHT">15 ms</TD> |
| <TD ALIGN="RIGHT">86.7 %</TD> |
| <TD ALIGN="RIGHT">15 ms</TD> |
| <TD ALIGN="RIGHT">86.7 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">1992 ms</TD> |
| <TD ALIGN="RIGHT">1047 ms</TD> |
| <TD ALIGN="RIGHT">90.3 %</TD> |
| <TD ALIGN="RIGHT">1246 ms</TD> |
| <TD ALIGN="RIGHT">59.9 %</TD> |
| <TD ALIGN="RIGHT">1215 ms</TD> |
| <TD ALIGN="RIGHT">64.0 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">187</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">859 ms</TD> |
| <TD ALIGN="RIGHT">408 ms</TD> |
| <TD ALIGN="RIGHT">110.5 %</TD> |
| <TD ALIGN="RIGHT">479 ms</TD> |
| <TD ALIGN="RIGHT">79.3 %</TD> |
| <TD ALIGN="RIGHT">487 ms</TD> |
| <TD ALIGN="RIGHT">76.4 %</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">93</TD> |
| <TD ALIGN="CENTER">interpr.</TD> |
| <TD ALIGN="RIGHT">435 ms</TD> |
| <TD ALIGN="RIGHT">200 ms</TD> |
| <TD ALIGN="RIGHT">117.5 %</TD> |
| <TD ALIGN="RIGHT">237 ms</TD> |
| <TD ALIGN="RIGHT">83.5 %</TD> |
| <TD ALIGN="RIGHT">242 ms</TD> |
| <TD ALIGN="RIGHT">79.8 %</TD> |
| </TR> |
| </TABLE> |
| |
| <P><BR> |
| |
| <P> |
| Although all JFlex scanners were faster than those generated by JLex, |
| slight differences between JFlex code generation methods show up when compared |
| to the run on the W98 system. |
| <A NAME="PerformanceHandwritten"></A> |
| <P> |
| The following table compares a handwritten scanner for the Java language |
| obtained from the website of CUP with the JFlex generated scanner for Java |
| that comes with JFlex in the <TT>examples</TT> directory. They were tested |
| on different <TT>.java</TT> files on a Linux machine with Sun's JDK 1.3. |
| |
| <P> |
| <TABLE CELLPADDING=3 BORDER="1" WIDTH="100%"> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">lines</TD> |
| <TD ALIGN="RIGHT">KB</TD> |
| <TD ALIGN="CENTER">JVM</TD> |
| <TD ALIGN="RIGHT">handwritten scanner</TD> |
| <TD ALIGN="CENTER" COLSPAN=2>JFlex generated scanner</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">19050</TD> |
| <TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">824 ms</TD> |
| <TD ALIGN="RIGHT">248 ms</TD> |
| <TD ALIGN="RIGHT">235 % faster</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">6350</TD> |
| <TD ALIGN="RIGHT">165</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">272 ms</TD> |
| <TD ALIGN="RIGHT">84 ms</TD> |
| <TD ALIGN="RIGHT">232 % faster</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">1270</TD> |
| <TD ALIGN="RIGHT">33</TD> |
| <TD ALIGN="CENTER">hotspot</TD> |
| <TD ALIGN="RIGHT">53 ms</TD> |
| <TD ALIGN="RIGHT">18 ms</TD> |
| <TD ALIGN="RIGHT">194 % faster</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">19050</TD> |
| <TD ALIGN="RIGHT">496</TD> |
| <TD ALIGN="CENTER">interpreted</TD> |
| <TD ALIGN="RIGHT">5.83 s</TD> |
| <TD ALIGN="RIGHT">3.85 s</TD> |
| <TD ALIGN="RIGHT">51 % faster</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">6350</TD> |
| <TD ALIGN="RIGHT">165</TD> |
| <TD ALIGN="CENTER">interpreted</TD> |
| <TD ALIGN="RIGHT">1.95 s</TD> |
| <TD ALIGN="RIGHT">1.29 s</TD> |
| <TD ALIGN="RIGHT">51 % faster</TD> |
| </TR> |
| <TR><TD ALIGN="LEFT"> </TD><TD ALIGN="RIGHT">1270</TD> |
| <TD ALIGN="RIGHT">33</TD> |
| <TD ALIGN="CENTER">interpreted</TD> |
| <TD ALIGN="RIGHT">0.38 s</TD> |
| <TD ALIGN="RIGHT">0.25 s</TD> |
| <TD ALIGN="RIGHT">52 % faster</TD> |
| </TR> |
| </TABLE> |
| |
| <P><BR> |
| |
| <P> |
| Although JDK 1.3 seems to speed up the handwritten scanner if compared |
| to JDK 1.1 or 1.2 more than the generated one, the generated scanner is |
| still up to 3.3 times as fast as the handwritten one. One example of |
| a handwritten scanner that is |
| considerably slower than the equivalent generated one is surely no |
| proof for all generated scanners being faster than handwritten. It is |
| clearly impossible to prove something like that, since you could |
| always write the generated scanner by hand. From a software |
| engineering point of view however, there is no excuse for writing a |
| scanner by hand since this task takes more time, is more difficult and |
| therefore more error prone than writing a compact, readable and easy |
| to change lexical specification. (I'd like to add, that I do <EM>not</EM> |
| think, that the handwritten scanner from the CUP website used here in |
| the test is stupid or badly written or anything like that. I actually |
| think, Scott did a great job with it, and that for learning about |
| lexers it is quite valuable to study it or even to write a similar one |
| for oneself.) |
| |
| <P> |
| |
| <H2><A NAME="SECTION00072000000000000000"></A><A NAME="PerformanceTips"></A><BR> |
| How to write a faster specification |
| </H2> |
| Although JFlex generated scanners show good performance without |
| special optimizations, there are some heuristics that can make a |
| lexical specification produce an even faster scanner. Those are |
| (roughly in order of performance gain): |
| |
| <P> |
| |
| <UL> |
| <LI>Avoid rules that require backtracking |
| |
| <P> |
| From the C/C++ flex [<A |
| HREF="manual.html#flex">11</A>] manpage: <EM>``Getting rid |
| of backtracking is messy and often may be an enormous amount of work for |
| a complicated scanner.''</EM> Backtracking is introduced by the longest match |
| rule and occurs for instance on this set of expressions: |
| |
| <P> |
| <TT> "averylongkeyword"</TT> |
| <BR><TT> .</TT> |
| |
| <P> |
| With input <TT>"averylongjoke"</TT> the scanner has to read all charcters |
| up to <TT>'j' </TT>to decide that rule <TT>.</TT> should be matched. All |
| characters of <TT>"verylong"</TT> have to be read again for the next |
| matching process. Backtracking can be avoided in general by adding |
| error rules that match those error conditions |
| |
| <P> |
| <code> "av"|"ave"|"avery"|"averyl"|..</code> |
| |
| <P> |
| While this is impractical in most scanners, there is still the |
| possibility to add a ``catch all'' rule for a lengthy list of keywords |
| <PRE> |
| "keyword1" { return symbol(KEYWORD1); } |
| .. |
| "keywordn" { return symbol(KEYWORDn); } |
| [a-z]+ { error("not a keyword"); } |
| </PRE> |
| Most programming language scanners already have a rule like this for |
| some kind of variable length identifiers. |
| |
| <P> |
| </LI> |
| <LI>Avoid line and column counting |
| |
| <P> |
| It costs multiple additional comparisons per input character and the |
| matched text has to be rescanned for counting. In most scanners it |
| is possible to do the line counting in the specification by |
| incrementing <TT>yyline</TT> each time a line terminator has been |
| matched. Column counting could also be included in actions. This |
| will be faster, but can in some cases become quite messy. |
| |
| <P> |
| </LI> |
| <LI>Avoid lookahead expressions and the end of line operator '$' |
| |
| <P> |
| The trailing context will first have to be read and then (because |
| it is not to be consumed) read again. |
| |
| <P> |
| </LI> |
| <LI>Avoid the beginning of line operator '<code>^</code>' |
| |
| <P> |
| It costs multiple additional comparisons per match. In some |
| cases one extra lookahead character is needed (when the last character read is |
| <code>\r</code> the scanner has to read one character ahead to check if |
| the next one is an <code>\n</code> or not). |
| |
| <P> |
| </LI> |
| <LI>Match as much text as possible in a rule. |
| |
| <P> |
| One rule is matched in the innermost loop of the scanner. After |
| each action some overhead for setting up the internal state of the |
| scanner is necessary. |
| </LI> |
| </UL> |
| |
| <P> |
| Note that writing more rules in a specification does not make the generated |
| scanner slower (except when you have to switch to another code generation |
| method because of the larger size). |
| |
| <P> |
| The two main rules of optimization apply also for lexical specifications: |
| |
| <OL> |
| <LI><B>don't do it</B> |
| </LI> |
| <LI><B>(for experts only) don't do it yet</B> |
| </LI> |
| </OL> |
| |
| <P> |
| Some of the performance tips above contradict a readable and compact |
| specification style. When in doubt or when requirements are not or not |
| yet fixed: don't use them - the specification can always be optimized |
| in a later state of the development process. |
| |
| <P> |
| |
| <H1><A NAME="SECTION00080000000000000000"> |
| Porting Issues</A> |
| </H1> |
| |
| <P> |
| |
| <H2><A NAME="SECTION00081000000000000000"></A><A NAME="Porting"></A><BR> |
| Porting from JLex |
| </H2> |
| JFlex was designed to read old JLex specifications unchanged and to |
| generate a scanner which behaves exactly the same as the one generated |
| by JLex with the only difference of being faster. |
| |
| <P> |
| This works as expected on all well formed JLex specifications. |
| |
| <P> |
| Since the statement above is somewhat absolute, let's take a look at |
| what ``well formed'' means here. A JLex specification is well formed, when |
| it |
| |
| <UL> |
| <LI>generates a working scanner with JLex |
| |
| <P> |
| </LI> |
| <LI>doesn't contain the unescaped characters <TT>!</TT> and <TT>~</TT> |
| |
| <P> |
| They are operators in JFlex while JLex treats them as normal |
| input characters. You can easily port such a JLex specification |
| to JFlex by replacing every <TT>!</TT> with <code>\!</code> and every |
| <code>~</code> with <code>\~</code> in all regular expressions. |
| |
| <P> |
| </LI> |
| <LI>has only complete regular expressions surrounded by parentheses in |
| macro definitions |
| |
| <P> |
| This may sound a bit harsh, but could otherwise be a major problem |
| - it can also help you find some disgusting bugs in your |
| specification that didn't show up in the first place. In JLex, a |
| right hand side of a macro is just a piece of text, that is copied |
| to the point where the macro is used. With this, some weird kind of |
| stuff like |
| <PRE> |
| macro1 = ("hello" |
| macro2 = {macro1})* |
| </PRE> |
| was possible (with <TT>macro2</TT> expanding to <code>("hello")*</code>). This |
| is not allowed in JFlex and you will have to transform such |
| definitions. There are however some more subtle kinds of errors that |
| can be introduced by JLex macros. Let's consider a definition like |
| <code>macro = a|b</code> and a usage like <code>{macro}*</code>. |
| This expands in JLex to <code>a|b*</code> and not to the probably intended |
| <code>(a|b)*</code>. |
| |
| <P> |
| JFlex uses always the second form of expansion, since this is the natural |
| form of thinking about abbreviations for regular expressions. |
| |
| <P> |
| Most specifications shouldn't suffer from this problem, because |
| macros often only contain (harmless) character classes like |
| <TT>alpha = [a-zA-Z]</TT> and more dangerous definitions like |
| |
| <P> |
| <code> ident = {alpha}({alpha}|{digit})*</code> |
| |
| <P> |
| are only used to write rules like |
| |
| <P> |
| <code> {ident} { .. action .. }</code> |
| |
| <P> |
| and not more complex expressions like |
| |
| <P> |
| <code> {ident}* { .. action .. }</code> |
| |
| <P> |
| where the kind of error presented above would show up. |
| </LI> |
| </UL> |
| |
| <P> |
| |
| <H2><A NAME="SECTION00082000000000000000"></A><A NAME="lexport"></A><BR> |
| Porting from lex/flex |
| </H2> |
| This section tries to give an overview of activities and possible |
| problems when porting a lexical specification from the C/C++ tools lex |
| and flex [<A |
| HREF="manual.html#flex">11</A>] available on most Unix systems to JFlex. |
| |
| <P> |
| Most of the C/C++ specific features are naturally not present in JFlex, |
| but most ``clean'' lex/flex lexical specifications can be ported to |
| JFlex without very much work. |
| |
| <P> |
| This section is by far not complete and is based mainly on a survey of |
| the flex man page and very little personal experience. If you do |
| engage in any porting activity from lex/flex to JFlex and encounter |
| problems, have better solutions for points presented here or have just |
| some tips you would like to share, please do <A NAME="tex2html8" |
| HREF="mailto:lsf@jflex.de">contact me</A>. I will |
| incorporate your experiences in this manual (with all due credit to you, |
| of course). |
| |
| <P> |
| |
| <H3><A NAME="SECTION00082100000000000000"> |
| Basic structure</A> |
| </H3> |
| A lexical specification for flex has the following basic structure: |
| <PRE> |
| definitions |
| %% |
| rules |
| %% |
| user code |
| </PRE> |
| |
| <P> |
| The <TT>user code</TT> section usually contains some C code that is used |
| in actions of the <TT>rules</TT> part of the specification. For JFlex most |
| of this code will have to be included in the class code <code>%{..%}</code> |
| directive in the <TT>options</TT> <TT>and declarations</TT> section (after |
| translating the C code to Java, of course). |
| |
| <P> |
| |
| <H3><A NAME="SECTION00082200000000000000"> |
| Macros and Regular Expression Syntax</A> |
| </H3> |
| The <TT>definitions</TT> section of a flex specification is quite similar |
| to the <TT>options and declarations</TT> part of JFlex specs. |
| |
| <P> |
| Macro definitions in flex have the form: |
| <PRE> |
| <identifier> <expression> |
| </PRE> |
| To port them to JFlex macros, just insert a <TT>=</TT> between <TT><identifier></TT> |
| and <TT><expression></TT>. |
| |
| <P> |
| The syntax and semantics of regular expressions in flex are pretty much the |
| same as in JFlex. A little attention is needed for some escape sequences |
| present in flex (such as <code>\a</code>) that are not supported in JFlex. These |
| escape sequences should be transformed into their octal or hexadecimal |
| equivalent. |
| |
| <P> |
| Another point are predefined character classes. Flex offers the ones directly |
| supported by C, JFlex offers the ones supported by Java. These classes will |
| sometimes have to be listed manually (if there is need for this feature, it |
| may be implemented in a future JFlex version). |
| |
| <P> |
| |
| <H3><A NAME="SECTION00082300000000000000"> |
| Lexical Rules</A> |
| </H3> |
| Since flex is mostly Unix based, the '<code>^</code>' (beginning of line) and |
| '<code>$</code>' (end of line) operators, consider the <code>\n</code> character as only line terminator. This should usually cause not much problems, but you |
| should be prepared for occurrences of <code>\r</code> or <code>\r\n</code> or one of |
| the characters <code>\u2028</code>, <code>\u2029</code>, <code>\u000B</code>, <code>\u000C</code>, |
| or <code>\u0085</code>. They are considered to be line terminators in Unicode and |
| therefore may not be consumed when |
| <code>^</code> or <code>$</code> is present in a rule. |
| <P> |
| The trailing context algorithm of flex is better than the one used in |
| JFlex. Therefore lookahead expressions could cause major headaches. JFlex |
| will issue an error message at generation time, if it cannot generate |
| a scanner for a certain lookahead expression. (sorry, I have no more tips here |
| on that yet. If anyone knows how the flex lookahead algorithm works (or any better one) |
| and can be efficiently implemented, again: please <A NAME="tex2html9" |
| HREF="mailto:lsf@jflex.de">contact me</A>). |
| |
| <P> |
| |
| <H1><A NAME="SECTION00090000000000000000"></A><A NAME="WorkingTog"></A><BR> |
| Working together |
| </H1> |
| |
| <P> |
| |
| <H2><A NAME="SECTION00091000000000000000"></A><A NAME="CUPWork"></A><BR> |
| JFlex and CUP |
| </H2> |
| One of the main design goals of JFlex was to make interfacing with the free |
| Java parser generator CUP [<A |
| HREF="manual.html#CUP">8</A>] as easy as possibly. |
| This has been done by giving |
| the <TT><A HREF="manual.html#CupMode">%cup</A></TT> directive a special meaning. An |
| interface however always has two sides. This section concentrates on the |
| CUP side of the story. |
| |
| <P> |
| |
| <H3><A NAME="SECTION00091100000000000000"> |
| CUP version 0.10j</A> |
| </H3> |
| Since CUP version 0.10j, this has been simplified greatly by the new |
| CUP scanner interface <TT>java_cup.runtime.Scanner</TT>. JFlex lexers now implement |
| this interface automatically when then <TT><A HREF="manual.html#CupMode">%cup</A></TT> |
| switch is used. There are no special <TT>parser code</TT>, <TT>init |
| code</TT> or <TT>scan with</TT> options any more that you have to provide |
| in your CUP parser specification. You can just concentrate on your grammar. |
| |
| <P> |
| If your generated Lexer has the class name <TT>Scanner</TT>, the parser |
| is started from the a main program like this: |
| |
| <P> |
| <PRE> |
| ... |
| try { |
| parser p = new parser(new Scanner(new FileReader(fileName))); |
| Object result = p.parse().value; |
| } |
| catch (Exception e) { |
| ... |
| </PRE> |
| |
| <P> |
| |
| <H3><A NAME="SECTION00091200000000000000"> |
| Using existing JFlex/CUP specifications with CUP 0.10j</A> |
| </H3> |
| If you already have an existing specification and you would like to upgrade |
| both JFlex and CUP to their newest version, you will probably have to adjust |
| your specification. |
| |
| <P> |
| The main difference between the <TT><A HREF="manual.html#CupMode">%cup</A></TT> switch in |
| JFlex 1.2.1 and lower, and the current JFlex version is, that JFlex scanners |
| now automatically implement the <TT>java_cup.runtime.Scanner</TT> interface. |
| This means, that the scanning function now changes its name from <TT>yylex()</TT> |
| to <TT>next_token()</TT>. |
| |
| <P> |
| The main difference from older CUP versions to 0.10j is, that CUP now |
| has a default constructor that accepts a <TT>java_cup.runtime.Scanner</TT> |
| as argument and that uses this scanner as |
| default (so no <TT>scan with</TT> code is necessary any more). |
| |
| <P> |
| If you have an existing CUP specification, it will probably look somewhat like this: |
| <PRE> |
| parser code {: |
| Lexer lexer; |
| |
| public parser (java.io.Reader input) { |
| lexer = new Lexer(input); |
| } |
| :}; |
| |
| scan with {: return lexer.yylex(); :}; |
| </PRE> |
| |
| <P> |
| To upgrade to CUP 0.10j, you could change it to look like this: |
| <PRE> |
| parser code {: |
| public parser (java.io.Reader input) { |
| super(new Lexer(input)); |
| } |
| :}; |
| </PRE> |
| |
| <P> |
| If you do not mind to change the method that is calling the parser, |
| you could remove the constructor entirely (and if there is nothing else |
| in it, the whole <TT>parser code</TT> section as well, of course). The calling |
| main procedure would then construct the parser as shown in the section above. |
| |
| <P> |
| The JFlex specification does not need to be changed. |
| |
| <P> |
| |
| <H3><A NAME="SECTION00091300000000000000"> |
| Using older versions of CUP</A> |
| </H3> |
| For people, who like or have to use older versions of CUP, the following section |
| explains ``the old way''. Please note, that the standard name of the scanning |
| function with the <TT><A HREF="manual.html#CupMode">%cup</A></TT> switch is not |
| <TT>yylex()</TT>, but <TT>next_token()</TT>. |
| |
| <P> |
| If you have a scanner specification that begins like this: |
| |
| <P> |
| <PRE> |
| package PACKAGE; |
| import java_cup.runtime.*; /* this is convenience, but not necessary */ |
| |
| %% |
| |
| %class Lexer |
| %cup |
| .. |
| </PRE> |
| |
| <P> |
| then it matches a CUP specification starting like |
| |
| <P> |
| <PRE> |
| package PACKAGE; |
| |
| parser code {: |
| Lexer lexer; |
| |
| public parser (java.io.Reader input) { |
| lexer = new Lexer(input); |
| } |
| :}; |
| |
| scan with {: return lexer.next_token(); :}; |
| |
| .. |
| </PRE> |
| |
| <P> |
| This assumes that the generated parser will get the name <TT>parser</TT>. |
| If it doesn't, you have to adjust the constructor name. |
| |
| <P> |
| The parser can then be started in a main routine like this: |
| |
| <P> |
| <PRE> |
| .. |
| try { |
| parser p = new parser(new FileReader(fileName)); |
| Object result = p.parse().value; |
| } |
| catch (Exception e) { |
| .. |
| </PRE> |
| |
| <P> |
| If you want the parser specification to be independent of the name of the generated |
| scanner, you can instead write an interface Lexer |
| |
| <P> |
| <PRE> |
| public interface Lexer { |
| public java_cup.runtime.Symbol next_token() throws java.io.IOException; |
| } |
| </PRE> |
| |
| <P> |
| change the parser code to: |
| |
| <P> |
| <PRE> |
| package PACKAGE; |
| |
| parser code {: |
| Lexer lexer; |
| |
| public parser (Lexer lexer) { |
| this.lexer = lexer; |
| } |
| :}; |
| |
| scan with {: return lexer.next_token(); :}; |
| |
| .. |
| </PRE> |
| |
| <P> |
| tell JFlex about the Lexer |
| interface using the <TT>%implements</TT> |
| directive: |
| |
| <P> |
| <PRE> |
| .. |
| %class Scanner /* not Lexer now since that is our interface! */ |
| %implements Lexer |
| %cup |
| .. |
| </PRE> |
| |
| <P> |
| and finally change the main routine to look like |
| |
| <P> |
| <PRE> |
| ... |
| try { |
| parser p = new parser(new Scanner(new FileReader(fileName))); |
| Object result = p.parse().value; |
| } |
| catch (Exception e) { |
| ... |
| </PRE> |
| |
| <P> |
| If you want to improve the error messages that CUP generated parsers |
| produce, you can also override the methods <TT>report_error</TT> and <TT>report_fatal_error</TT> |
| in the ``parser code'' section of the CUP specification. The new methods |
| could for instance use <TT>yyline</TT> and <TT>yycolumn</TT> (stored in |
| the <TT>left</TT> and <TT>right</TT> members of class <TT>java_cup.runtime.Symbol</TT>) |
| to report error positions more conveniently for the user. The lexer and |
| parser for the Java language in the <TT>examples/java</TT> directory of the |
| JFlex distribution use this style of error reporting. These specifications |
| also demonstrate the techniques above in action. |
| |
| <P> |
| |
| <H2><A NAME="SECTION00092000000000000000"></A><A NAME="YaccWork"></A><BR> |
| JFlex and BYacc/J |
| </H2> |
| |
| <P> |
| JFlex has builtin support for the Java extension |
| <A NAME="tex2html10" |
| HREF="http://troi.lincom-asg.com/~rjamison/byacc/">BYacc/J</A> |
| [<A |
| HREF="manual.html#BYaccJ">9</A>] by Bob Jamison |
| to the classical Berkeley Yacc parser generator. |
| This section describes how to interface BYacc/J with JFlex. It |
| builds on many helpful suggestions and comments from Larry Bell. |
| |
| <P> |
| Since Yacc's architecture is a bit different from CUP's, the |
| interface setup also works in a slightly different manner. |
| BYacc/J expects a function <TT>int yylex()</TT> in the parser |
| class that returns each next token. Semantic values are expected |
| in a field <TT>yylval</TT> of type <TT>parserval</TT> where ``<TT>parser</TT>'' |
| is the name of the generated parser class. |
| |
| <P> |
| For a small calculator example, one could use a setup like the |
| following on the JFlex side: |
| |
| <P> |
| <PRE> |
| %% |
| |
| %byaccj |
| |
| %{ |
| /* store a reference to the parser object */ |
| private parser yyparser; |
| |
| /* constructor taking an additional parser object */ |
| public Yylex(java.io.Reader r, parser yyparser) { |
| this(r); |
| this.yyparser = yyparser; |
| } |
| %} |
| |
| NUM = [0-9]+ ("." [0-9]+)? |
| NL = \n | \r | \r\n |
| |
| %% |
| |
| /* operators */ |
| "+" | |
| .. |
| "(" | |
| ")" { return (int) yycharat(0); } |
| |
| /* newline */ |
| {NL} { return parser.NL; } |
| |
| /* float */ |
| {NUM} { yyparser.yylval = new parserval(Double.parseDouble(yytext())); |
| return parser.NUM; } |
| </PRE> |
| |
| <P> |
| The lexer expects a reference to the parser in its constructor. |
| Since Yacc allows direct use of terminal characters like <TT>'+'</TT> |
| in its specifications, we just return the character code for |
| single char matches (e.g. the operators in the example). Symbolic |
| token names are stored as <TT>public static int</TT> constants in |
| the generated parser class. They are used as in the <TT>NL</TT> token |
| above. Finally, for some tokens, a semantic value may have to be |
| communicated to the parser. The <TT>NUM</TT> rule demonstrates that |
| bit. |
| |
| <P> |
| A matching BYacc/J parser specification could look like this: |
| <PRE> |
| %{ |
| import java.io.*; |
| %} |
| |
| %token NL /* newline */ |
| %token <dval> NUM /* a number */ |
| |
| %type <dval> exp |
| |
| %left '-' '+' |
| .. |
| %right '^' /* exponentiation */ |
| |
| %% |
| |
| .. |
| |
| exp: NUM { $$ = $1; } |
| | exp '+' exp { $$ = $1 + $3; } |
| .. |
| | exp '^' exp { $$ = Math.pow($1, $3); } |
| | '(' exp ')' { $$ = $2; } |
| ; |
| |
| %% |
| /* a reference to the lexer object */ |
| private Yylex lexer; |
| |
| /* interface to the lexer */ |
| private int yylex () { |
| int yyl_return = -1; |
| try { |
| yyl_return = lexer.yylex(); |
| } |
| catch (IOException e) { |
| System.err.println("IO error :"+e); |
| } |
| return yyl_return; |
| } |
| |
| /* error reporting */ |
| public void yyerror (String error) { |
| System.err.println ("Error: " + error); |
| } |
| |
| /* lexer is created in the constructor */ |
| public parser(Reader r) { |
| lexer = new Yylex(r, this); |
| } |
| |
| /* that's how you use the parser */ |
| public static void main(String args[]) throws IOException { |
| parser yyparser = new parser(new FileReader(args[0])); |
| yyparser.yyparse(); |
| } |
| </PRE> |
| |
| <P> |
| Here, the customized part is mostly in the user code section: |
| We create the lexer in the constructor of the parser and store |
| a reference to it for later use in the parser's <TT>int yylex()</TT> |
| method. This <TT>yylex</TT> in the parser only calls <TT>int yylex()</TT> |
| of the generated lexer and passes the result on. If something goes |
| wrong, it returns -1 to indicate an error. |
| |
| <P> |
| Runnable versions of the specifications above |
| are located in the <TT>examples/byaccj</TT> directory of the JFlex |
| distribution. |
| |
| <P> |
| |
| <H1><A NAME="SECTION000100000000000000000"></A><A NAME="Bugs"></A><BR> |
| Bugs and Deficiencies |
| </H1> |
| |
| <P> |
| |
| <H2><A NAME="SECTION000101000000000000000"> |
| Deficiencies</A> |
| </H2> |
| The trailing context algorithm described in [<A |
| HREF="manual.html#Aho">1</A>] and used in |
| JFlex is incorrect. It does not work, when a postfix of the regular |
| expression matches a prefix of the trailing context and the length |
| of the text matched by the expression does not have a fixed size. |
| JFlex will report these cases as errors at generation time. |
| |
| <P> |
| |
| <H2><A NAME="SECTION000102000000000000000"> |
| Bugs</A> |
| </H2> |
| |
| <P> |
| As of April 12, 2004 the following bugs are known in JFlex: |
| |
| <UL> |
| <LI>The check if a lookahead expression is legal fails on some expressions. |
| The lookahead algorithm itself works as advertised, but JFlex will not |
| report all lookahead expressions that the algorithm can't handle at generation |
| time. Some cases are caught by the check, but not all. |
| |
| <P> |
| <B>Workaround:</B> Check lookahead expressions manually. A lookahead expression |
| <TT>r1/r2</TT> is ok, if no postfix of <TT>r1</TT> can match a prefix of <TT>r2</TT>. |
| </LI> |
| </UL> |
| |
| <P> |
| If you find new ones, please use the bugs section of the |
| <A NAME="tex2html11" |
| HREF="http://www.jflex.de/">JFlex website</A> |
| to report them. |
| |
| <P> |
| |
| <H1><A NAME="SECTION000110000000000000000"></A><A NAME="Copyright"></A><BR> |
| Copying and License |
| </H1> |
| JFlex is free software, published under the terms of the |
| <A NAME="tex2html12" |
| HREF="http://www.fsf.org/copyleft/gpl.html">GNU General Public License</A>. |
| |
| <P> |
| There is absolutely NO WARRANTY for JFlex, its code and its documentation. |
| |
| <P> |
| The code generated by JFlex inherits the copyright of the specification it |
| was produced from. If it was your specification, you may use the generated |
| code without restriction. |
| |
| <P> |
| See the file <A NAME="tex2html13" |
| HREF="COPYRIGHT"><TT>COPYRIGHT</TT></A> |
| for more information. |
| |
| <P> |
| |
| <H2><A NAME="SECTION000120000000000000000"></A><A NAME="References"></A><BR> |
| Bibliography |
| </H2><DL COMPACT><DD> |
| |
| <P> |
| <P></P><DT><A NAME="Aho">1</A> |
| <DD> |
| A. Aho, R. Sethi, J. Ullman, <EM>Compilers: Principles, Techniques, and Tools</EM>, 1986 |
| |
| <P> |
| <P></P><DT><A NAME="Appel">2</A> |
| <DD> |
| A. W. Appel, <EM>Modern Compiler Implementation in Java: basic techniques</EM>, 1997 |
| |
| <P> |
| <P></P><DT><A NAME="JLex">3</A> |
| <DD> |
| E. Berk, <EM>JLex: A lexical analyser generator for Java</EM>, |
| <BR> <A NAME="tex2html14" |
| HREF="http://www.cs.princeton.edu/~appel/modern/java/JLex/"><TT>http://www.cs.princeton.edu/~appel/modern/java/JLex/</TT></A> |
| <P> |
| <P></P><DT><A NAME="fast">4</A> |
| <DD> |
| K. Brouwer, W. Gellerich,E. Ploedereder, |
| <EM>Myths and Facts about the Efficient Implementation of Finite Automata and Lexical Analysis</EM>, |
| in: Proceedings of the 7th International Conference on Compiler Construction (CC '98), 1998 |
| |
| <P> |
| <P></P><DT><A NAME="unicode_rep">5</A> |
| <DD> |
| M. Davis, <EM>Unicode Regular Expression Guidelines</EM>, Unicode Technical Report #18, 2000 |
| <BR> <A NAME="tex2html15" |
| HREF="http://www.unicode.org/unicode/reports/tr18/tr18-5.1.html"><TT>http://www.unicode.org/unicode/reports/tr18/tr18-5.1.html</TT></A> |
| <P> |
| <P></P><DT><A NAME="ParseTable">6</A> |
| <DD> |
| P. Dencker, K. Dürre, J. Henft, <EM>Optimization of Parser Tables for portable Compilers</EM>, |
| in: ACM Transactions on Programming Languages and Systems 6(4), 1984 |
| |
| <P> |
| <P></P><DT><A NAME="LangSpec">7</A> |
| <DD> |
| J. Gosling, B. Joy, G. Steele, <EM>The Java Language Specifcation</EM>, 1996, |
| <BR> <A NAME="tex2html16" |
| HREF="http://www.javasoft.com/docs/books/jls/"><TT>http://www.javasoft.com/docs/books/jls/</TT></A> |
| <P> |
| <P></P><DT><A NAME="CUP">8</A> |
| <DD> |
| S. E. Hudson, <EM>CUP LALR Parser Generator for Java</EM>, |
| <BR> <A NAME="tex2html17" |
| HREF="http://www.cs.princeton.edu/~appel/modern/java/CUP/"><TT>http://www.cs.princeton.edu/~appel/modern/java/CUP/</TT></A> |
| <P> |
| <P></P><DT><A NAME="BYaccJ">9</A> |
| <DD> |
| B. Jamison, <EM>BYacc/J</EM>, |
| <BR> <A NAME="tex2html18" |
| HREF="http://troi.lincom-asg.com/~rjamison/byacc/"><TT>http://troi.lincom-asg.com/~rjamison/byacc/</TT></A> |
| <P> |
| <P></P><DT><A NAME="MachineSpec">10</A> |
| <DD> |
| T. Lindholm, F. Yellin, <EM>The Java Virtual Machine Specification</EM>, 1996, |
| <BR> <A NAME="tex2html19" |
| HREF="http://www.javasoft.com/docs/books/vmspec/"><TT>http://www.javasoft.com/docs/books/vmspec/</TT></A> |
| <P> |
| <P></P><DT><A NAME="flex">11</A> |
| <DD> |
| V. Paxon, <EM>flex - The fast lexical analyzer generator</EM>, 1995 |
| |
| <P> |
| <P></P><DT><A NAME="SparseTable">12</A> |
| <DD> |
| R. E. Tarjan, A. Yao, <EM>Storing a Sparse Table</EM>, in: Communications of the ACM 22(11), 1979 |
| |
| <P> |
| <P></P><DT><A NAME="Maurer">13</A> |
| <DD> |
| R. Wilhelm, D. Maurer, <EM>Übersetzerbau</EM>, Berlin 1997<SUP>2</SUP> |
| |
| <P> |
| </DL> |
| |
| <P> |
| <BR><HR><H4>Footnotes</H4> |
| <DL> |
| <DT><A NAME="foot32">... Java</A><A NAME="foot32" |
| HREF="manual.html#tex2html2"><SUP><IMG ALIGN="BOTTOM" BORDER="1" ALT="[*]" SRC="footnote.png"></SUP></A> |
| <DD>Java is a trademark of |
| Sun Microsystems, Inc., and refers to Sun's Java programming language. |
| JFlex is not sponsored by or affiliated with Sun Microsystems, Inc. |
| |
| </DL><BR><HR> |
| <ADDRESS> |
| Mon Apr 12 20:58:12 EST 2004, <a href="http://www.doclsf.de">Gerwin Klein</a> |
| </ADDRESS> |
| </BODY> |
| </HTML> |