| TagSoup - Just Keep On Truckin' |
| |
| Introduction |
| |
| This is the home page of TagSoup, a SAX-compliant parser written in |
| Java that, instead of parsing well-formed or valid XML, parses HTML as |
| it is found in the wild: [1]poor, nasty and brutish, though quite often |
| far from short. TagSoup is designed for people who have to process this |
| stuff using some semblance of a rational application design. By |
| providing a SAX interface, it allows standard XML tools to be applied |
| to even the worst HTML. TagSoup also includes a command-line processor |
| that reads HTML files and can generate either clean HTML or well-formed |
| XML that is a close approximation to XHTML. |
| |
| This is also the README file packaged with TagSoup. |
| |
| TagSoup is free and Open Source software. As of version 1.2, it is |
| licensed under the [2]Apache License, Version 2.0, which allows |
| proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later |
| projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only |
| project, feel free to ask.) |
| |
| Warning: TagSoup will not build on stock Java 5.x or 6.x! |
| |
| Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x, |
| TagSoup will not build out of the box. You need to retrieve [3]Saxon |
| 6.5.5, which does not have the bug. Unpack the zipfile in an empty |
| directory and copy the saxon.jar and saxon-xml-apis.jar files to |
| $ANT_HOME/lib. The Ant build process for TagSoup will then notice that |
| Saxon is available and use it instead. |
| |
| TagSoup 1.2 released |
| |
| There are a great many changes, most of them fixes for long-standing |
| bugs, in this release. Only the most important are listed here; for the |
| rest, see the CHANGES file in the source distribution. Very special |
| thanks to Jojo Dijamco, whose intensive efforts at debugging made this |
| release a usable upgrade rather than a useless mass of undetected bugs. |
| * As noted above, I have changed the license to Apache 2.0. |
| * The default content model for bogons (unknown elements) is now ANY |
| rather than EMPTY. This is a breaking change, which I have done |
| only because there was so much demand for it. It can be undone on |
| the command line with the --emptybogons switch, or programmatically |
| with parser.setFeature(Parser.emptyBogonsFeature, true). |
| * The processing of entity references in attribute values has finally |
| been fixed to do what browsers do. That is, a reference is only |
| recognized if it is properly terminated by a semicolon; otherwise |
| it is treated as plain text. This means that URIs like |
| foo?cdown=32&cup=42 are no longer seen as containing an instance of |
| the )U character (whose name happens to be cup). |
| * Several new switches have been added: |
| + --doctype-system and --doctype-public force a DOCTYPE |
| declaration to be output and allow setting the system and |
| public identifiers. |
| + --standalone and --version allow control of the XML |
| declaration that is output. (Note that TagSoup's XML output is |
| always version 1.0, even if you use --version=1.1.) |
| + --norootbogons causes unknown elements not to be allowed as |
| the document root element. Instead, they are made children of |
| the default root element (the html element for HTML). |
| * The TagSoup core now supports character entities with values above |
| U+FFFF. As a consequence, the HTML schema now supports all 2,210 |
| standard character entities from the [4]2007-12-14 draft of XML |
| Entity Definitions for Characters, except the 94 which require more |
| than one Unicode character to represent. |
| * The SAX events startPrefixMapping and endPrefixMapping are now |
| being reported for all cases of foreign elements and attributes. |
| * All bugs around newline processing on Windows should now be gone. |
| * A number of content models have been loosened to allow elements to |
| appear in new and non-standard (but commonly found) places. In |
| particular, tables are now allowed inside paragraphs, against the |
| letter of the W3C specification. |
| * Since the span element is intended for fine control of appearance |
| using CSS, it should never have been a restartable element. This |
| very long-standing bug has now been fixed. |
| * The following non-standard elements are now at least partly |
| supported: bgsound, blink, canvas, comment, listing, marquee, nobr, |
| rbc, rb, rp, rtc, rt, ruby, wbr, xmp. |
| * In HTML output mode, boolean attributes like checked are now output |
| as such, rather than in XML style as checked="checked". |
| * Runs of < characters such as << and <<< are now handled correctly |
| in text rather than being transformed into extremely bogus |
| start-tags. |
| |
| [5]Download the TagSoup 1.2 jar file here. It's about 87K long. |
| [6]Download the full TagSoup 1.2 source here. If you don't have zip, |
| you can use jar to unpack it. |
| [7]Download the current CHANGES file here. |
| |
| TagSoup 1.1 released |
| |
| TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use |
| TagSoup within the JAXP framework (which is not something I necessarily |
| recommend, but it is part of the Java XML platform), you can create a |
| SAXParser by calling |
| org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also |
| set the system property javax.xml.parsers.SAXParserFactory to |
| org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing |
| this will cause all JAXP-based XML parsing to go through TagSoup, which |
| is a Bad Thing if your application also reads XML documents. |
| |
| What TagSoup does |
| |
| TagSoup is designed as a parser, not a whole application; it isn't |
| intended to permanently clean up bad HTML, as [8]HTML Tidy does, only |
| to parse it on the fly. Therefore, it does not convert presentation |
| HTML to CSS or anything similar. It does guarantee well-structured |
| results: tags will wind up properly nested, default attributes will |
| appear appropriately, and so on. |
| |
| The semantics of TagSoup are as far as practical those of actual HTML |
| browsers. In particular, never, never will it throw any sort of syntax |
| error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's |
| much, much more. For example, if the first tag is LI, it will supply |
| the application with enclosing HTML, BODY, and UL tags. Why UL? Because |
| that's what browsers assume in this situation. For the same reason, |
| overlapping tags are correctly restarted whenever possible: text like: |
| This is <B>bold, <I>bold italic, </b>italic, </i>normal text |
| |
| gets correctly rewritten as: |
| This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text. |
| |
| By intention, TagSoup is small and fast. It does not depend on the |
| existence of any framework other than SAX, and should be able to work |
| with any framework that can accept SAX parsers. In particular, [10]XOM |
| is known to work. |
| |
| You can replace the low-level HTML scanner with one based on Sean |
| McGrath's [11]PYX format (very close to James Clark's ESIS format). You |
| can also supply an AutoDetector that peeks at the incoming byte stream |
| and guesses a character encoding for it. Otherwise, the platform |
| default is used. If you need an autodetector of character sets, |
| consider trying to adapt the [12]Mozilla one; if you succeed, let me |
| know. |
| |
| Note: TagSoup in Java 1.1 |
| |
| If you go through the TagSoup source and replace all references to |
| HashMap with Hashtable and recompile, TagSoup will work fine in Java |
| 1.1 VMs. Thanks to Thorbjørn Vinne for this discovery. |
| |
| The TSaxon XSLT-for-HTML processor |
| |
| [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5 |
| of Michael Kay's Saxon XSLT version 1.0 implementation that includes |
| TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to |
| process either HTML or XML documents with XSLT stylesheets. |
| |
| TagSoup as a stand-alone program |
| |
| It is possible to run TagSoup as a program by saying java -jar |
| tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command |
| line will be parsed individually. If no files are specified, the |
| standard input is read. |
| |
| The following options are understood: |
| |
| --files |
| Output into individual files, with html extensions changed to |
| xhtml. Otherwise, all output is sent to the standard output. |
| |
| --html |
| Output is in clean HTML: the XML declaration is suppressed, as |
| are end-tags for the known empty elements. |
| |
| --omit-xml-declaration |
| The XML declaration is suppressed. |
| |
| --method=html |
| End-tags for the known empty HTML elements are suppressed. |
| |
| --doctype-system=systemid |
| Forces the output of a DOCTYPE declaration with the specified |
| systemid. |
| |
| --doctype-public=publicid |
| Forces the output of a DOCTYPE declaration with the specified |
| publicid. |
| |
| --version=version |
| Sets the version string in the XML declaration. |
| |
| --standalone=[yes|no] |
| Sets the standalone declaration to yes or no. |
| |
| --pyx |
| Output is in PYX format. |
| |
| --pyxin |
| Input is in PYXoid format (need not be well-formed). |
| |
| --nons |
| Namespaces are suppressed. Normally, all elements are in the |
| XHTML 1.x namespace, and all attributes are in no namespace. |
| |
| --nobogons |
| Bogons (unknown elements) are suppressed. |
| |
| --nodefaults |
| suppress default attribute values |
| |
| --nocolons |
| change explicit colons in element and attribute names to |
| underscores |
| |
| --norestart |
| don't restart any normally restartable elements |
| |
| --ignorable |
| output whitespace in elements with element-only content |
| |
| --emptybogons |
| Bogons are given a content model of EMPTY rather than ANY. |
| |
| --any |
| Bogons are given a content model of ANY rather than EMPTY |
| (default). |
| |
| --norootbogons |
| Don't allow bogons to be root elements; make them subordinate to |
| the root. |
| |
| --lexical |
| Pass through HTML comments and DOCTYPE declarations. Has no |
| effect when output is in PYX format. |
| |
| --reuse |
| Reuse a single instance of TagSoup parser throughout. Normally, |
| a new one is instantiated for each input file. |
| |
| --nocdata |
| Change the content models of the script and style elements to |
| treat them as ordinary #PCDATA (text-only) elements, as in |
| XHTML, rather than with the special CDATA content model. |
| |
| --encoding=encoding |
| Specify the input encoding. The default is the Java platform |
| default. |
| |
| --output-encoding=encoding |
| Specify the output encoding. The default is the Java platform |
| default. |
| |
| --help |
| Print help. |
| |
| --version |
| Print the version number. |
| |
| SAX features and properties |
| |
| TagSoup supports the following SAX features in addition to the standard |
| ones: |
| |
| http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons |
| A value of "true" indicates that the parser will ignore unknown |
| elements. |
| |
| http://www.ccil.org/~cowan/tagsoup/features/bogons-empty |
| A value of "true" indicates that the parser will give unknown |
| elements a content model of EMPTY; a value of "false", a content |
| model of ANY. |
| |
| http://www.ccil.org/~cowan/tagsoup/features/root-bogons |
| A value of "true" indicates that the parser will allow unknown |
| elements to be the root of the output document. |
| |
| http://www.ccil.org/~cowan/tagsoup/features/default-attributes |
| A value of "true" indicates that the parser will return default |
| attribute values for missing attributes that have default |
| values. |
| |
| http://www.ccil.org/~cowan/tagsoup/features/translate-colons |
| A value of "true" indicates that the parser will translate |
| colons into underscores in names. |
| |
| http://www.ccil.org/~cowan/tagsoup/features/restart-elements |
| A value of "true" indicates that the parser will attempt to |
| restart the restartable elements. |
| |
| http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace |
| A value of "true" indicates that the parser will transmit |
| whitespace in element-only content via the SAX |
| ignorableWhitespace callback. Normally this is not done, because |
| HTML is an SGML application and SGML suppresses such whitespace. |
| |
| http://www.ccil.org/~cowan/tagsoup/features/cdata-elements |
| A value of "true" indicates that the parser will process the |
| script and style elements (or any elements with type='cdata' in |
| the TSSL schema) as SGML CDATA elements (that is, no markup is |
| recognized except the matching end-tag). |
| |
| TagSoup supports the following SAX properties in addition to the |
| standard ones: |
| |
| http://www.ccil.org/~cowan/tagsoup/properties/scanner |
| Specifies the Scanner object this parser uses. |
| |
| http://www.ccil.org/~cowan/tagsoup/properties/schema |
| Specifies the Schema object this parser uses. |
| |
| http://www.ccil.org/~cowan/tagsoup/properties/auto-detector |
| Specifies the AutoDetector (for encoding detection) this parser |
| uses. |
| |
| More information |
| |
| I gave a presentation (a nocturne, so it's not on the schedule) at |
| [15]Extreme Markup Languages 2004 about TagSoup, updated from the one |
| presented in 2002 at the New York City XML SIG and at XML 2002. This is |
| the main high-level documentation about how TagSoup works. Formats: |
| [16]OpenDocument [17]Powerpoint [18]PDF. |
| |
| I also had people add [19]"evil" HTML to a large poster so that I could |
| [20]clean it up; View Source is probably more useful than ordinary |
| browsing. The original instructions were: |
| |
| SOUPE DE BALISES (BE EVIL)! |
| Ecritez une balise ouvrante (sans attributs) |
| ou fermante HTML ici, s.v.p. |
| |
| There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups. |
| You can [23]join via the Web, or by sending a blank email to |
| [24]tagsoup-friends-subscribe@yahoogroups.com. The [25]archives are |
| open to all. |
| |
| Online TagSoup processing for publicly accessible HTML documents is now |
| [26]available courtesy of Leigh Dodds. |
| |
| References |
| |
| 1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html |
| 2. http://opensource.org/licenses/apache2.0.php |
| 3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip |
| 4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214 |
| 5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar |
| 6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip |
| 7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES |
| 8. http://tidy.sf.net/ |
| 9. http://www.crumbmuseum.com/truckin.html |
| 10. http://www.cafeconleche.org/XOM |
| 11. http://gnosis.cx/publish/programming/xml_matters_17.html |
| 12. http://jchardet.sourceforge.net/ |
| 13. http://www.ccil.org/~cowan |
| 14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon |
| 15. http://www.extrememarkup.com/extreme/2004 |
| 16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp |
| 17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt |
| 18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf |
| 19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html |
| 20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml |
| 21. http://groups.yahoo.com/group/tagsoup-friends |
| 22. http://groups.yahoo.com/ |
| 23. http://groups.yahoo.com/group/tagsoup-friends/join |
| 24. mailto:tagsoup-friends-subscribe@yahoogroups.com |
| 25. http://groups.yahoo.com/group/tagsoup-friends/messages |
| 26. http://xmlarmyknife.org/docs/xhtml/tagsoup/ |