| * Bison 3.6 |
| ** doc |
| I feel its ugly to use the GNU style to declare functions in the doc. It |
| generates tons of white space in the page, and may contribute to bad page |
| breaks. |
| |
| Also, we seem to teach YYPRINT very early on, although it should be |
| considered deprecated: %printer is superior. |
| |
| ** improve syntax errors (UTF-8, internationalization) |
| Bison depends on the current locale. For instance: |
| |
| %define parse.error verbose |
| %code top { |
| #include <stdio.h> |
| #include <stdlib.h> |
| void yyerror(const char* msg) { fprintf(stderr, "%s\n", msg); } |
| int yylex() { return 0; } |
| } |
| %% |
| exp: "↦" | "🎅🐃" | '\n' |
| %% |
| int main() { return yyparse(); } |
| |
| gives different results with/without LC_ALL=C. |
| |
| $ LC_ALL=C /opt/local/bin/bison /tmp/mangle.y -o ascii.c |
| $ /opt/local/bin/bison /tmp/mangle.y -o utf8.c |
| $ diff -u ascii.c utf8.c -I#line |
| --- ascii.c 2019-01-12 08:15:35.878010093 +0100 |
| +++ utf8.c 2019-01-12 08:15:38.856495929 +0100 |
| @@ -415,9 +415,8 @@ |
| First, the terminals, then, starting at YYNTOKENS, nonterminals. */ |
| static const char *const yytname[] = |
| { |
| - "$end", "error", "$undefined", "\"\\342\\206\\246\"", |
| - "\"\\360\\237\\216\\205\\360\\237\\220\\203\"", "'\\n'", "$accept", |
| - "exp", YY_NULLPTR |
| + "$end", "error", "$undefined", "\"↦\"", "\"🎅🐃\"", "'\\n'", |
| + "$accept", "exp", YY_NULLPTR |
| }; |
| #endif |
| |
| $ gcc ascii.c -o ascii && ./ascii |
| syntax error, unexpected $end, expecting "\342\206\246" or "\360\237\216\205\360\237\220\203" or '\n' |
| $ gcc utf8.c -o utf8 && ./utf8 |
| syntax error, unexpected $end, expecting ↦ or 🎅🐃 or '\n' |
| |
| |
| While at it, we should stop using "$end" by default, in favor of "end of |
| file", or "end of input", whatever. See how lalr1.java does that. |
| |
| ** consistency |
| token vs terminal, variable vs non terminal. |
| |
| ** Stop indentation in diagnostics |
| Before Bison 2.7, we printed "flatly" the dependencies in long diagnostics: |
| |
| input.y:2.7-12: %type redeclaration for exp |
| input.y:1.7-12: previous declaration |
| |
| In Bison 2.7, we indented them |
| |
| input.y:2.7-12: error: %type redeclaration for exp |
| input.y:1.7-12: previous declaration |
| |
| Later we quoted the source in the diagnostics, and today we have: |
| |
| /tmp/foo.y:1.12-14: warning: symbol FOO redeclared [-Wother] |
| 1 | %token FOO FOO |
| | ^~~ |
| /tmp/foo.y:1.8-10: previous declaration |
| 1 | %token FOO FOO |
| | ^~~ |
| |
| The indentation is no longer helping. We should probably get rid of it, or |
| maybe keep it only when -fno-caret. GCC displays this as a "note": |
| |
| $ g++-mp-9 -Wall /tmp/foo.c -c |
| /tmp/foo.c:1:10: error: redefinition of 'int foo' |
| 1 | int foo, foo; |
| | ^~~ |
| /tmp/foo.c:1:5: note: 'int foo' previously declared here |
| 1 | int foo, foo; |
| | ^~~ |
| |
| Likewise for Clang, contrary to what I believed (because "note:" is written |
| in black, so it doesn't show in my terminal :-) |
| |
| $ clang++-mp-8.0 -Wall /tmp/foo.c -c |
| clang: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated] |
| /tmp/foo.c:1:10: error: redefinition of 'foo' |
| int foo, foo; |
| ^ |
| /tmp/foo.c:1:5: note: previous definition is here |
| int foo, foo; |
| ^ |
| 1 error generated. |
| |
| See also the item "Complaint submessage indentation" below. |
| |
| ** api.token.raw |
| Maybe we should exhibit the YYUNDEFTOK token. It could also be assigned a |
| semantic value so that yyerror could be used to report invalid lexemes. |
| See also the item "$undefined" below. |
| |
| ** C++ |
| Move to int everywhere instead of unsigned? stack_size, etc. The parser |
| itself uses int (for yylen for instance), yet stack is based on size_t. |
| |
| Maybe locations should also move to ints. |
| |
| Paul Eggert already covered most of this. But before publishing these |
| changes, we need to ask our C++ users if they agree with that change, or if |
| we need some migration path. Could be a %define variable, or simply |
| %require "3.5". |
| |
| * Bison 3.7 |
| ** Unit rules / Injection rules (Akim Demaille) |
| Maybe we could expand unit rules (or "injections", see |
| https://homepages.cwi.nl/~daybuild/daily-books/syntax/2-sdf/sdf.html), i.e., |
| transform |
| |
| exp: arith | bool; |
| arith: exp '+' exp; |
| bool: exp '&' exp; |
| |
| into |
| |
| exp: exp '+' exp | exp '&' exp; |
| |
| when there are no actions. This can significantly speed up some grammars. |
| I can't find the papers. In particular the book 'LR parsing: Theory and |
| Practice' is impossible to find, but according to 'Parsing Techniques: a |
| Practical Guide', it includes information about this issue. Does anybody |
| have it? |
| |
| ** clean up (Akim Demaille) |
| Do not work on these items now, as I (Akim) have branches with a lot of |
| changes in this area (hitting several files), and no desire to have to fix |
| conflicts. Addressing these items will happen after my branches have been |
| merged. |
| |
| *** lalr.c |
| Introduce a goto struct, and use it in place of from_state/to_state. |
| Rename states1 as path, length as pathlen. |
| Introduce inline functions for things such as nullable[*rp - ntokens] |
| where we need to map from symbol number to nterm number. |
| |
| There are probably a significant part of the relations management that |
| should be migrated on top of a bitsetv. |
| |
| *** closure |
| It should probably take a "state*" instead of two arguments. |
| |
| *** traces |
| The "automaton" and "set" categories are not so useful. We should probably |
| introduce lr(0) and lalr, just the way we have ielr categories. The |
| "closure" function is too verbose, it should probably have its own category. |
| |
| "set" can still be used for summariring the important sets. That would make |
| tests easy to maintain. |
| |
| *** complain.* |
| Rename these guys as "diagnostics.*" (or "diagnose.*"), since that's the |
| name they have in gcc, clang, etc. Likewise for the complain_* series of |
| functions. |
| |
| *** ritem |
| states/nstates, rules/nrules, ..., ritem/nritems |
| Fix the latter. |
| |
| * Modernization |
| Fix data/skeletons/yacc.c so that it defines YYPTRDIFF_T properly for modern |
| and older C++ compilers. Currently the code defaults to defining it to |
| 'long' for non-GCC compilers, but it should use the proper C++ magic to |
| define it to the same type as the C ptrdiff_t type. |
| |
| * Completion |
| Several features are not available in all the backends. |
| |
| - push parsers: glr.cc, lalr1.cc |
| - ielr: C++ and Java |
| - glr: Java |
| - token constructors: Java and C |
| |
| * Bugs |
| ** Autotest has quotation issues |
| tests/input.at:1730:AT_SETUP([%define errors]) |
| |
| -> |
| |
| $ ./tests/testsuite -l | grep errors | sed q |
| 38: input.at:1730 errors |
| |
| * Short term |
| ** Get rid of YYPRINT and b4_toknum |
| Besides yytoknum is wrong when api.token.raw is defined. |
| |
| ** Better design for diagnostics |
| The current implementation of diagnostics is adhoc, it grew organically. It |
| works as a series of calls to several functions, with dependency of the |
| latter calls on the former. For instance: |
| |
| complain (&sym->location, |
| sym->content->status == needed ? complaint : Wother, |
| _("symbol %s is used, but is not defined as a token" |
| " and has no rules; did you mean %s?"), |
| quote_n (0, sym->tag), |
| quote_n (1, best->tag)); |
| if (feature_flag & feature_caret) |
| location_caret_suggestion (sym->location, best->tag, stderr); |
| |
| We should rewrite this in a more FP way: |
| |
| 1. build a rich structure that denotes the (complete) diagnostic. |
| "Complete" in the sense that it also contains the suggestions, the list |
| of possible matches, etc. |
| |
| 2. send this to the pretty-printing routine. The diagnostic structure |
| should be sufficient so that we can generate all the 'format' of |
| diagnostics, including the fixits. |
| |
| If properly done, this diagnostic module can be detached from Bison and be |
| put in gnulib. It could be used, for instance, for errors caught by |
| xgettext. |
| |
| There's certainly already something alike in GCC. At least that's the |
| impression I get from reading the "-fdiagnostics-format=FORMAT" part of this |
| page: |
| |
| https://gcc.gnu.org/onlinedocs/gcc/Diagnostic-Message-Formatting-Options.html |
| |
| ** Graphviz display code thoughts |
| The code for the --graph option is over two files: print_graph, and |
| graphviz. This is because Bison used to also produce VCG graphs, but since |
| this is no longer true, maybe we could consider these files for fusion. |
| |
| An other consideration worth noting is that print_graph.c (correct me if I |
| am wrong) should contain generic functions, whereas graphviz.c and other |
| potential files should contain just the specific code for that output |
| format. It will probably prove difficult to tell if the implementation is |
| actually generic whilst only having support for a single format, but it |
| would be nice to keep stuff a bit tidier: right now, the construction of the |
| bitset used to show reductions is in the graphviz-specific code, and on the |
| opposite side we have some use of \l, which is graphviz-specific, in what |
| should be generic code. |
| |
| Little effort seems to have been given to factoring these files and their |
| rint{,-xml} counterpart. We would very much like to re-use the pretty format |
| of states from .output for the graphs, etc. |
| |
| Since graphviz dies on medium-to-big grammars, maybe consider an other tool? |
| |
| ** push-parser |
| Check it too when checking the different kinds of parsers. And be |
| sure to check that the initial-action is performed once per parsing. |
| |
| ** m4 names |
| b4_shared_declarations is no longer what it is. Make it |
| b4_parser_declaration for instance. |
| |
| ** yychar in lalr1.cc |
| There is a large difference bw maint and master on the handling of |
| yychar (which was removed in lalr1.cc). See what needs to be |
| back-ported. |
| |
| |
| /* User semantic actions sometimes alter yychar, and that requires |
| that yytoken be updated with the new translation. We take the |
| approach of translating immediately before every use of yytoken. |
| One alternative is translating here after every semantic action, |
| but that translation would be missed if the semantic action |
| invokes YYABORT, YYACCEPT, or YYERROR immediately after altering |
| yychar. In the case of YYABORT or YYACCEPT, an incorrect |
| destructor might then be invoked immediately. In the case of |
| YYERROR, subsequent parser actions might lead to an incorrect |
| destructor call or verbose syntax error message before the |
| lookahead is translated. */ |
| |
| /* Make sure we have latest lookahead translation. See comments at |
| user semantic actions for why this is necessary. */ |
| yytoken = yytranslate_ (yychar); |
| |
| |
| ** Get rid of fake #lines [Bison: ...] |
| Possibly as simple as checking whether the column number is nonnegative. |
| |
| I have seen messages like the following from GCC. |
| |
| <built-in>:0: fatal error: opening dependency file .deps/libltdl/argz.Tpo: No such file or directory |
| |
| |
| ** Discuss about %printer/%destroy in the case of C++. |
| It would be very nice to provide the symbol classes with an operator<< |
| and a destructor. Unfortunately the syntax we have chosen for |
| %destroy and %printer make them hard to reuse. For instance, the user |
| is invited to write something like |
| |
| %printer { debug_stream() << $$; } <my_type>; |
| |
| which is hard to reuse elsewhere since it wants to use |
| "debug_stream()" to find the stream to use. The same applies to |
| %destroy: we told the user she could use the members of the Parser |
| class in the printers/destructors, which is not good for an operator<< |
| since it is no longer bound to a particular parser, it's just a |
| (standalone symbol). |
| |
| * Various |
| ** Rewrite glr.cc in C++ (Valentin Tolmer) |
| As a matter of fact, it would be very interesting to see how much we can |
| share between lalr1.cc and glr.cc. Most of the skeletons should be common. |
| It would be a very nice source of inspiration for the other languages. |
| |
| Valentin Tolmer is working on this. |
| |
| ** YYERRCODE |
| Defined to 256, but not used, not documented. Probably the token |
| number for the error token, which POSIX wants to be 256, but which |
| Bison might renumber if the user used number 256. Keep fix and doc? |
| Throw away? |
| |
| Also, why don't we output the token name of the error token in the |
| output? It is explicitly skipped: |
| |
| /* Skip error token and tokens without identifier. */ |
| if (sym != errtoken && id) |
| |
| Of course there are issues with name spaces, but if we disable we have |
| something which seems to be more simpler and more consistent instead |
| of the special case YYERRCODE. |
| |
| enum yytokentype { |
| error = 256, |
| // ... |
| }; |
| |
| |
| We could (should?) also treat the case of the undef_token, which is |
| numbered 257 for yylex, and 2 internal. Both appear for instance in |
| toknum: |
| |
| const unsigned short |
| parser::yytoken_number_[] = |
| { |
| 0, 256, 257, 258, 259, 260, 261, 262, 263, 264, |
| |
| while here |
| |
| enum yytokentype { |
| TOK_EOF = 0, |
| TOK_EQ = 258, |
| |
| so both 256 and 257 are "mysterious". |
| |
| const char* |
| const parser::yytname_[] = |
| { |
| "\"end of command\"", "error", "$undefined", "\"=\"", "\"break\"", |
| |
| |
| ** yychar == yyempty_ |
| The code in yyerrlab reads: |
| |
| if (yychar <= YYEOF) |
| { |
| /* Return failure if at end of input. */ |
| if (yychar == YYEOF) |
| YYABORT; |
| } |
| |
| There are only two yychar that can be <= YYEOF: YYEMPTY and YYEOF. |
| But I can't produce the situation where yychar is YYEMPTY here, is it |
| really possible? The test suite does not exercise this case. |
| |
| This shows that it would be interesting to manage to install skeleton |
| coverage analysis to the test suite. |
| |
| * From lalr1.cc to yacc.c |
| ** Single stack |
| Merging the three stacks in lalr1.cc simplified the code, prompted for |
| other improvements and also made it faster (probably because memory |
| management is performed once instead of three times). I suggest that |
| we do the same in yacc.c. |
| |
| (Some time later): it's also very nice to have three stacks: it's more dense |
| as we don't lose bits to padding. For instance the typical stack for states |
| will use 8 bits, while it is likely to consume 32 bits in a struct. |
| |
| We need trustworthy benchmarks for Bison, for all our backends. Akim has a |
| few things scattered around; we need to put them in the repo, and make them |
| more useful. |
| |
| ** yysyntax_error |
| The code bw glr.c and yacc.c is really alike, we can certainly factor |
| some parts. |
| |
| This should be worked on when we also address the expected improvements for |
| error generation (e.g., i18n). |
| |
| |
| * Report |
| |
| ** Figures |
| Some statistics about the grammar and the parser would be useful, |
| especially when asking the user to send some information about the |
| grammars she is working on. We should probably also include some |
| information about the variables (I'm not sure for instance we even |
| specify what LR variant was used). |
| |
| ** GLR |
| How would Paul like to display the conflicted actions? In particular, |
| what when two reductions are possible on a given lookahead token, but one is |
| part of $default. Should we make the two reductions explicit, or just |
| keep $default? See the following point. |
| |
| ** Disabled Reductions |
| See 'tests/conflicts.at (Defaulted Conflicted Reduction)', and decide |
| what we want to do. |
| |
| ** Documentation |
| Extend with error productions. The hard part will probably be finding |
| the right rule so that a single state does not exhibit too many yet |
| undocumented ''features''. Maybe an empty action ought to be |
| presented too. Shall we try to make a single grammar with all these |
| features, or should we have several very small grammars? |
| |
| ** --report=conflict-path |
| Provide better assistance for understanding the conflicts by providing |
| a sample text exhibiting the (LALR) ambiguity. See the paper from |
| DeRemer and Penello: they already provide the algorithm. |
| |
| ** Statically check for potential ambiguities in GLR grammars |
| See <http://www.lsv.fr/~schmitz/pub/expamb.pdf> for an approach. |
| An Experimental Ambiguity Detection Tool ∗ Sylvain Schmitz |
| LORIA, INRIA Nancy - Grand Est, Nancy, France |
| |
| * Extensions |
| ** Multiple start symbols |
| Would be very useful when parsing closely related languages. The idea is to |
| declare several start symbols, for instance |
| |
| %start stmt expr |
| %% |
| stmt: ... |
| expr: ... |
| |
| and to generate parse(), parse_stmt() and parse_expr(). Technically, the |
| above grammar would be transformed into |
| |
| %start yy_start |
| %token YY_START_STMT YY_START_EXPR |
| %% |
| yy_start: YY_START_STMT stmt | YY_START_EXPR expr |
| |
| so that there are no new conflicts in the grammar (as would undoubtedly |
| happen with yy_start: stmt | expr). Then adjust the skeletons so that this |
| initial token (YY_START_STMT, YY_START_EXPR) be shifted first in the |
| corresponding parse function. |
| |
| ** Better error messages |
| The users are not provided with enough tools to forge their error messages. |
| See for instance "Is there an option to change the message produced by |
| YYERROR_VERBOSE?" by Simon Sobisch, on bison-help. |
| |
| See also |
| https://www.cs.tufts.edu/~nr/cs257/archive/clinton-jefferey/lr-error-messages.pdf |
| and https://research.swtch.com/yyerror. |
| |
| ** %include |
| This is a popular demand. We already made many changes in the parser that |
| should make this reasonably easy to implement. |
| |
| Bruce Mardle <marblypup@yahoo.co.uk> |
| https://lists.gnu.org/archive/html/bison-patches/2015-09/msg00000.html |
| |
| However, there are many other things to do before having such a feature, |
| because I don't want a % equivalent to #include (which we all learned to |
| hate). I want something that builds "modules" of grammars, and assembles |
| them together, paying attention to keep separate bits separated, in pseudo |
| name spaces. |
| |
| ** Push parsers |
| There is demand for push parsers in Java and C++. And GLR I guess. |
| |
| ** Generate code instead of tables |
| This is certainly quite a lot of work. See |
| http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.4539. |
| |
| ** $-1 |
| We should find a means to provide an access to values deep in the |
| stack. For instance, instead of |
| |
| baz: qux { $$ = $<foo>-1 + $<bar>0 + $1; } |
| |
| we should be able to have: |
| |
| foo($foo) bar($bar) baz($bar): qux($qux) { $baz = $foo + $bar + $qux; } |
| |
| Or something like this. |
| |
| ** %if and the like |
| It should be possible to have %if/%else/%endif. The implementation is |
| not clear: should it be lexical or syntactic. Vadim Maslow thinks it |
| must be in the scanner: we must not parse what is in a switched off |
| part of %if. Akim Demaille thinks it should be in the parser, so as |
| to avoid falling into another CPP mistake. |
| |
| (Later): I'm sure there's actually good case for this. People who need that |
| feature can use m4/cpp on top of Bison. I don't think it is worth the |
| trouble in Bison itself. |
| |
| ** XML Output |
| There are couple of available extensions of Bison targeting some XML |
| output. Some day we should consider including them. One issue is |
| that they seem to be quite orthogonal to the parsing technique, and |
| seem to depend mostly on the possibility to have some code triggered |
| for each reduction. As a matter of fact, such hooks could also be |
| used to generate the yydebug traces. Some generic scheme probably |
| exists in there. |
| |
| XML output for GNU Bison and gcc |
| http://www.cs.may.ie/~jpower/Research/bisonXML/ |
| |
| XML output for GNU Bison |
| http://yaxx.sourceforge.net/ |
| |
| ** Counterexample generation |
| https://lists.gnu.org/archive/html/bug-bison/2016-06/msg00000.html |
| http://www.cs.cornell.edu/andru/papers/cupex/ |
| |
| Andrew Myers and Vincent Imbimbo are working on this item, see |
| https://github.com/akimd/bison/issues/12 |
| |
| * Coding system independence |
| Paul notes: |
| |
| Currently Bison assumes 8-bit bytes (i.e. that UCHAR_MAX is |
| 255). It also assumes that the 8-bit character encoding is |
| the same for the invocation of 'bison' as it is for the |
| invocation of 'cc', but this is not necessarily true when |
| people run bison on an ASCII host and then use cc on an EBCDIC |
| host. I don't think these topics are worth our time |
| addressing (unless we find a gung-ho volunteer for EBCDIC or |
| PDP-10 ports :-) but they should probably be documented |
| somewhere. |
| |
| More importantly, Bison does not currently allow NUL bytes in |
| tokens, either via escapes (e.g., "x\0y") or via a NUL byte in |
| the source code. This should get fixed. |
| |
| * Broken options? |
| ** %token-table |
| ** Skeleton strategy |
| Must we keep %token-table? |
| |
| * Precedence |
| |
| ** Partial order |
| It is unfortunate that there is a total order for precedence. It |
| makes it impossible to have modular precedence information. We should |
| move to partial orders (sounds like series/parallel orders to me). |
| |
| This is a prerequisite for modules. |
| |
| * $undefined |
| From Hans: |
| - If the Bison generated parser experiences an undefined number in the |
| character range, that character is written out in diagnostic messages, an |
| addition to the $undefined value. |
| |
| Suggest: Change the name $undefined to undefined; looks better in outputs. |
| |
| |
| * Pre and post actions. |
| From: Florian Krohm <florian@edamail.fishkill.ibm.com> |
| Subject: YYACT_EPILOGUE |
| To: bug-bison@gnu.org |
| X-Sent: 1 week, 4 days, 14 hours, 38 minutes, 11 seconds ago |
| |
| The other day I had the need for explicitly building the parse tree. I |
| used %locations for that and defined YYLLOC_DEFAULT to call a function |
| that returns the tree node for the production. Easy. But I also needed |
| to assign the S-attribute to the tree node. That cannot be done in |
| YYLLOC_DEFAULT, because it is invoked before the action is executed. |
| The way I solved this was to define a macro YYACT_EPILOGUE that would |
| be invoked after the action. For reasons of symmetry I also added |
| YYACT_PROLOGUE. Although I had no use for that I can envision how it |
| might come in handy for debugging purposes. |
| All is needed is to add |
| |
| #if YYLSP_NEEDED |
| YYACT_EPILOGUE (yyval, (yyvsp - yylen), yylen, yyloc, (yylsp - yylen)); |
| #else |
| YYACT_EPILOGUE (yyval, (yyvsp - yylen), yylen); |
| #endif |
| |
| at the proper place to bison.simple. Ditto for YYACT_PROLOGUE. |
| |
| I was wondering what you think about adding YYACT_PROLOGUE/EPILOGUE |
| to bison. If you're interested, I'll work on a patch. |
| |
| * Better graphics |
| Equip the parser with a means to create the (visual) parse tree. |
| |
| |
| Local Variables: |
| mode: outline |
| coding: utf-8 |
| fill-column: 76 |
| End: |
| |
| ----- |
| |
| Copyright (C) 2001-2004, 2006, 2008-2015, 2018-2019 Free Software |
| Foundation, Inc. |
| |
| This file is part of Bison, the GNU Compiler Compiler. |
| |
| This program is free software: you can redistribute it and/or modify |
| it under the terms of the GNU General Public License as published by |
| the Free Software Foundation, either version 3 of the License, or |
| (at your option) any later version. |
| |
| This program is distributed in the hope that it will be useful, |
| but WITHOUT ANY WARRANTY; without even the implied warranty of |
| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
| GNU General Public License for more details. |
| |
| You should have received a copy of the GNU General Public License |
| along with this program. If not, see <http://www.gnu.org/licenses/>. |