README - platform/external/tagsoup - Git at Google

                         TagSoup - Just Keep On Truckin'

   Introduction

    This is the home page of TagSoup, a SAX-compliant parser written in
    Java that, instead of parsing well-formed or valid XML, parses HTML as
    it is found in the wild: [1]poor, nasty and brutish, though quite often
    far from short. TagSoup is designed for people who have to process this
    stuff using some semblance of a rational application design. By
    providing a SAX interface, it allows standard XML tools to be applied
    to even the worst HTML. TagSoup also includes a command-line processor
    that reads HTML files and can generate either clean HTML or well-formed
    XML that is a close approximation to XHTML.

    This is also the README file packaged with TagSoup.

    TagSoup is free and Open Source software. As of version 1.2, it is
    licensed under the [2]Apache License, Version 2.0, which allows
    proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later
    projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only
    project, feel free to ask.)

   Warning: TagSoup will not build on stock Java 5.x or 6.x!

    Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x,
    TagSoup will not build out of the box. You need to retrieve [3]Saxon
    6.5.5, which does not have the bug. Unpack the zipfile in an empty
    directory and copy the saxon.jar and saxon-xml-apis.jar files to
    $ANT_HOME/lib. The Ant build process for TagSoup will then notice that
    Saxon is available and use it instead.

   TagSoup 1.2 released

    There are a great many changes, most of them fixes for long-standing
    bugs, in this release. Only the most important are listed here; for the
    rest, see the CHANGES file in the source distribution. Very special
    thanks to Jojo Dijamco, whose intensive efforts at debugging made this
    release a usable upgrade rather than a useless mass of undetected bugs.
      * As noted above, I have changed the license to Apache 2.0.
      * The default content model for bogons (unknown elements) is now ANY
        rather than EMPTY. This is a breaking change, which I have done
        only because there was so much demand for it. It can be undone on
        the command line with the --emptybogons switch, or programmatically
        with parser.setFeature(Parser.emptyBogonsFeature, true).
      * The processing of entity references in attribute values has finally
        been fixed to do what browsers do. That is, a reference is only
        recognized if it is properly terminated by a semicolon; otherwise
        it is treated as plain text. This means that URIs like
        foo?cdown=32&cup=42 are no longer seen as containing an instance of
        the )U character (whose name happens to be cup).
      * Several new switches have been added:
           + --doctype-system and --doctype-public force a DOCTYPE
             declaration to be output and allow setting the system and
             public identifiers.
           + --standalone and --version allow control of the XML
             declaration that is output. (Note that TagSoup's XML output is
             always version 1.0, even if you use --version=1.1.)
           + --norootbogons causes unknown elements not to be allowed as
             the document root element. Instead, they are made children of
             the default root element (the html element for HTML).
      * The TagSoup core now supports character entities with values above
        U+FFFF. As a consequence, the HTML schema now supports all 2,210
        standard character entities from the [4]2007-12-14 draft of XML
        Entity Definitions for Characters, except the 94 which require more
        than one Unicode character to represent.
      * The SAX events startPrefixMapping and endPrefixMapping are now
        being reported for all cases of foreign elements and attributes.
      * All bugs around newline processing on Windows should now be gone.
      * A number of content models have been loosened to allow elements to
        appear in new and non-standard (but commonly found) places. In
        particular, tables are now allowed inside paragraphs, against the
        letter of the W3C specification.
      * Since the span element is intended for fine control of appearance
        using CSS, it should never have been a restartable element. This
        very long-standing bug has now been fixed.
      * The following non-standard elements are now at least partly
        supported: bgsound, blink, canvas, comment, listing, marquee, nobr,
        rbc, rb, rp, rtc, rt, ruby, wbr, xmp.
      * In HTML output mode, boolean attributes like checked are now output
        as such, rather than in XML style as checked="checked".
      * Runs of < characters such as << and <<< are now handled correctly
        in text rather than being transformed into extremely bogus
        start-tags.

    [5]Download the TagSoup 1.2 jar file here. It's about 87K long.
    [6]Download the full TagSoup 1.2 source here. If you don't have zip,
    you can use jar to unpack it.
    [7]Download the current CHANGES file here.

   TagSoup 1.1 released

    TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use
    TagSoup within the JAXP framework (which is not something I necessarily
    recommend, but it is part of the Java XML platform), you can create a
    SAXParser by calling
    org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also
    set the system property javax.xml.parsers.SAXParserFactory to
    org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing
    this will cause all JAXP-based XML parsing to go through TagSoup, which
    is a Bad Thing if your application also reads XML documents.

   What TagSoup does

    TagSoup is designed as a parser, not a whole application; it isn't
    intended to permanently clean up bad HTML, as [8]HTML Tidy does, only
    to parse it on the fly. Therefore, it does not convert presentation
    HTML to CSS or anything similar. It does guarantee well-structured
    results: tags will wind up properly nested, default attributes will
    appear appropriately, and so on.

    The semantics of TagSoup are as far as practical those of actual HTML
    browsers. In particular, never, never will it throw any sort of syntax
    error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's
    much, much more. For example, if the first tag is LI, it will supply
    the application with enclosing HTML, BODY, and UL tags. Why UL? Because
    that's what browsers assume in this situation. For the same reason,
    overlapping tags are correctly restarted whenever possible: text like:
 This is <B>bold, <I>bold italic, </b>italic, </i>normal text

    gets correctly rewritten as:
 This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

    By intention, TagSoup is small and fast. It does not depend on the
    existence of any framework other than SAX, and should be able to work
    with any framework that can accept SAX parsers. In particular, [10]XOM
    is known to work.

    You can replace the low-level HTML scanner with one based on Sean
    McGrath's [11]PYX format (very close to James Clark's ESIS format). You
    can also supply an AutoDetector that peeks at the incoming byte stream
    and guesses a character encoding for it. Otherwise, the platform
    default is used. If you need an autodetector of character sets,
    consider trying to adapt the [12]Mozilla one; if you succeed, let me
    know.

   Note: TagSoup in Java 1.1

    If you go through the TagSoup source and replace all references to
    HashMap with Hashtable and recompile, TagSoup will work fine in Java
    1.1 VMs. Thanks to Thorbjørn Vinne for this discovery.

   The TSaxon XSLT-for-HTML processor

    [13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5
    of Michael Kay's Saxon XSLT version 1.0 implementation that includes
    TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to
    process either HTML or XML documents with XSLT stylesheets.

   TagSoup as a stand-alone program

    It is possible to run TagSoup as a program by saying java -jar
    tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command
    line will be parsed individually. If no files are specified, the
    standard input is read.

    The following options are understood:

    --files
           Output into individual files, with html extensions changed to
           xhtml. Otherwise, all output is sent to the standard output.

    --html
           Output is in clean HTML: the XML declaration is suppressed, as
           are end-tags for the known empty elements.

    --omit-xml-declaration
           The XML declaration is suppressed.

    --method=html
           End-tags for the known empty HTML elements are suppressed.

    --doctype-system=systemid
           Forces the output of a DOCTYPE declaration with the specified
           systemid.

    --doctype-public=publicid
           Forces the output of a DOCTYPE declaration with the specified
           publicid.

    --version=version
           Sets the version string in the XML declaration.

    --standalone=[yes|no]
           Sets the standalone declaration to yes or no.

    --pyx
           Output is in PYX format.

    --pyxin
           Input is in PYXoid format (need not be well-formed).

    --nons
           Namespaces are suppressed. Normally, all elements are in the
           XHTML 1.x namespace, and all attributes are in no namespace.

    --nobogons
           Bogons (unknown elements) are suppressed.

    --nodefaults
           suppress default attribute values

    --nocolons
           change explicit colons in element and attribute names to
           underscores

    --norestart
           don't restart any normally restartable elements

    --ignorable
           output whitespace in elements with element-only content

    --emptybogons
           Bogons are given a content model of EMPTY rather than ANY.

    --any
           Bogons are given a content model of ANY rather than EMPTY
           (default).

    --norootbogons
           Don't allow bogons to be root elements; make them subordinate to
           the root.

    --lexical
           Pass through HTML comments and DOCTYPE declarations. Has no
           effect when output is in PYX format.

    --reuse
           Reuse a single instance of TagSoup parser throughout. Normally,
           a new one is instantiated for each input file.

    --nocdata
           Change the content models of the script and style elements to
           treat them as ordinary #PCDATA (text-only) elements, as in
           XHTML, rather than with the special CDATA content model.

    --encoding=encoding
           Specify the input encoding. The default is the Java platform
           default.

    --output-encoding=encoding
           Specify the output encoding. The default is the Java platform
           default.

    --help
           Print help.

    --version
           Print the version number.

   SAX features and properties

    TagSoup supports the following SAX features in addition to the standard
    ones:

    http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons
           A value of "true" indicates that the parser will ignore unknown
           elements.

    http://www.ccil.org/~cowan/tagsoup/features/bogons-empty
           A value of "true" indicates that the parser will give unknown
           elements a content model of EMPTY; a value of "false", a content
           model of ANY.

    http://www.ccil.org/~cowan/tagsoup/features/root-bogons
           A value of "true" indicates that the parser will allow unknown
           elements to be the root of the output document.

    http://www.ccil.org/~cowan/tagsoup/features/default-attributes
           A value of "true" indicates that the parser will return default
           attribute values for missing attributes that have default
           values.

    http://www.ccil.org/~cowan/tagsoup/features/translate-colons
           A value of "true" indicates that the parser will translate
           colons into underscores in names.

    http://www.ccil.org/~cowan/tagsoup/features/restart-elements
           A value of "true" indicates that the parser will attempt to
           restart the restartable elements.

    http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace
           A value of "true" indicates that the parser will transmit
           whitespace in element-only content via the SAX
           ignorableWhitespace callback. Normally this is not done, because
           HTML is an SGML application and SGML suppresses such whitespace.

    http://www.ccil.org/~cowan/tagsoup/features/cdata-elements
           A value of "true" indicates that the parser will process the
           script and style elements (or any elements with type='cdata' in
           the TSSL schema) as SGML CDATA elements (that is, no markup is
           recognized except the matching end-tag).

    TagSoup supports the following SAX properties in addition to the
    standard ones:

    http://www.ccil.org/~cowan/tagsoup/properties/scanner
           Specifies the Scanner object this parser uses.

    http://www.ccil.org/~cowan/tagsoup/properties/schema
           Specifies the Schema object this parser uses.

    http://www.ccil.org/~cowan/tagsoup/properties/auto-detector
           Specifies the AutoDetector (for encoding detection) this parser
           uses.

   More information

    I gave a presentation (a nocturne, so it's not on the schedule) at
    [15]Extreme Markup Languages 2004 about TagSoup, updated from the one
    presented in 2002 at the New York City XML SIG and at XML 2002. This is
    the main high-level documentation about how TagSoup works. Formats:
    [16]OpenDocument [17]Powerpoint [18]PDF.

    I also had people add [19]"evil" HTML to a large poster so that I could
    [20]clean it up; View Source is probably more useful than ordinary
    browsing. The original instructions were:

                          SOUPE DE BALISES (BE EVIL)!
    Ecritez une balise ouvrante (sans attributs)
    ou fermante HTML ici, s.v.p.

    There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups.
    You can [23]join via the Web, or by sending a blank email to
    [24]tagsoup-friends-subscribe@yahoogroups.com. The [25]archives are
    open to all.

    Online TagSoup processing for publicly accessible HTML documents is now
    [26]available courtesy of Leigh Dodds.

 References

    1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html
    2. http://opensource.org/licenses/apache2.0.php
    3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip
    4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214
    5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar
    6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip
    7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES
    8. http://tidy.sf.net/
    9. http://www.crumbmuseum.com/truckin.html
   10. http://www.cafeconleche.org/XOM
   11. http://gnosis.cx/publish/programming/xml_matters_17.html
   12. http://jchardet.sourceforge.net/
   13. http://www.ccil.org/~cowan
   14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon
   15. http://www.extrememarkup.com/extreme/2004
   16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp
   17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
   18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
   19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html
   20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml
   21. http://groups.yahoo.com/group/tagsoup-friends
   22. http://groups.yahoo.com/
   23. http://groups.yahoo.com/group/tagsoup-friends/join
   24. mailto:tagsoup-friends-subscribe@yahoogroups.com
   25. http://groups.yahoo.com/group/tagsoup-friends/messages
   26. http://xmlarmyknife.org/docs/xhtml/tagsoup/
	TagSoup - Just Keep On Truckin'

	Introduction

	This is the home page of TagSoup, a SAX-compliant parser written in
	Java that, instead of parsing well-formed or valid XML, parses HTML as
	it is found in the wild: [1]poor, nasty and brutish, though quite often
	far from short. TagSoup is designed for people who have to process this
	stuff using some semblance of a rational application design. By
	providing a SAX interface, it allows standard XML tools to be applied
	to even the worst HTML. TagSoup also includes a command-line processor
	that reads HTML files and can generate either clean HTML or well-formed
	XML that is a close approximation to XHTML.

	This is also the README file packaged with TagSoup.

	TagSoup is free and Open Source software. As of version 1.2, it is
	licensed under the [2]Apache License, Version 2.0, which allows
	proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later
	projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only
	project, feel free to ask.)

	Warning: TagSoup will not build on stock Java 5.x or 6.x!

	Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x,
	TagSoup will not build out of the box. You need to retrieve [3]Saxon
	6.5.5, which does not have the bug. Unpack the zipfile in an empty
	directory and copy the saxon.jar and saxon-xml-apis.jar files to
	$ANT_HOME/lib. The Ant build process for TagSoup will then notice that
	Saxon is available and use it instead.

	TagSoup 1.2 released

	There are a great many changes, most of them fixes for long-standing
	bugs, in this release. Only the most important are listed here; for the
	rest, see the CHANGES file in the source distribution. Very special
	thanks to Jojo Dijamco, whose intensive efforts at debugging made this
	release a usable upgrade rather than a useless mass of undetected bugs.
	* As noted above, I have changed the license to Apache 2.0.
	* The default content model for bogons (unknown elements) is now ANY
	rather than EMPTY. This is a breaking change, which I have done
	only because there was so much demand for it. It can be undone on
	the command line with the --emptybogons switch, or programmatically
	with parser.setFeature(Parser.emptyBogonsFeature, true).
	* The processing of entity references in attribute values has finally
	been fixed to do what browsers do. That is, a reference is only
	recognized if it is properly terminated by a semicolon; otherwise
	it is treated as plain text. This means that URIs like
	foo?cdown=32&cup=42 are no longer seen as containing an instance of
	the )U character (whose name happens to be cup).
	* Several new switches have been added:
	+ --doctype-system and --doctype-public force a DOCTYPE
	declaration to be output and allow setting the system and
	public identifiers.
	+ --standalone and --version allow control of the XML
	declaration that is output. (Note that TagSoup's XML output is
	always version 1.0, even if you use --version=1.1.)
	+ --norootbogons causes unknown elements not to be allowed as
	the document root element. Instead, they are made children of
	the default root element (the html element for HTML).
	* The TagSoup core now supports character entities with values above
	U+FFFF. As a consequence, the HTML schema now supports all 2,210
	standard character entities from the [4]2007-12-14 draft of XML
	Entity Definitions for Characters, except the 94 which require more
	than one Unicode character to represent.
	* The SAX events startPrefixMapping and endPrefixMapping are now
	being reported for all cases of foreign elements and attributes.
	* All bugs around newline processing on Windows should now be gone.
	* A number of content models have been loosened to allow elements to
	appear in new and non-standard (but commonly found) places. In
	particular, tables are now allowed inside paragraphs, against the
	letter of the W3C specification.
	* Since the span element is intended for fine control of appearance
	using CSS, it should never have been a restartable element. This
	very long-standing bug has now been fixed.
	* The following non-standard elements are now at least partly
	supported: bgsound, blink, canvas, comment, listing, marquee, nobr,
	rbc, rb, rp, rtc, rt, ruby, wbr, xmp.
	* In HTML output mode, boolean attributes like checked are now output
	as such, rather than in XML style as checked="checked".
	* Runs of < characters such as << and <<< are now handled correctly
	in text rather than being transformed into extremely bogus
	start-tags.

	[5]Download the TagSoup 1.2 jar file here. It's about 87K long.
	[6]Download the full TagSoup 1.2 source here. If you don't have zip,
	you can use jar to unpack it.
	[7]Download the current CHANGES file here.

	TagSoup 1.1 released

	TagSoup 1.1 adds Tatu Saloranta's JAXP support for TagSoup. To use
	TagSoup within the JAXP framework (which is not something I necessarily
	recommend, but it is part of the Java XML platform), you can create a
	SAXParser by calling
	org.ccil.cowan.tagsoup.jaxp.SAXParserImpl.newInstance(). You can also
	set the system property javax.xml.parsers.SAXParserFactory to
	org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl, but be aware that doing
	this will cause all JAXP-based XML parsing to go through TagSoup, which
	is a Bad Thing if your application also reads XML documents.

	What TagSoup does

	TagSoup is designed as a parser, not a whole application; it isn't
	intended to permanently clean up bad HTML, as [8]HTML Tidy does, only
	to parse it on the fly. Therefore, it does not convert presentation
	HTML to CSS or anything similar. It does guarantee well-structured
	results: tags will wind up properly nested, default attributes will
	appear appropriately, and so on.

	The semantics of TagSoup are as far as practical those of actual HTML
	browsers. In particular, never, never will it throw any sort of syntax
	error: the TagSoup motto is [9]"Just Keep On Truckin'". But there's
	much, much more. For example, if the first tag is LI, it will supply
	the application with enclosing HTML, BODY, and UL tags. Why UL? Because
	that's what browsers assume in this situation. For the same reason,
	overlapping tags are correctly restarted whenever possible: text like:
	This is <B>bold, <I>bold italic, </b>italic, </i>normal text

	gets correctly rewritten as:
	This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

	By intention, TagSoup is small and fast. It does not depend on the
	existence of any framework other than SAX, and should be able to work
	with any framework that can accept SAX parsers. In particular, [10]XOM
	is known to work.

	You can replace the low-level HTML scanner with one based on Sean
	McGrath's [11]PYX format (very close to James Clark's ESIS format). You
	can also supply an AutoDetector that peeks at the incoming byte stream
	and guesses a character encoding for it. Otherwise, the platform
	default is used. If you need an autodetector of character sets,
	consider trying to adapt the [12]Mozilla one; if you succeed, let me
	know.

	Note: TagSoup in Java 1.1

	If you go through the TagSoup source and replace all references to
	HashMap with Hashtable and recompile, TagSoup will work fine in Java
	1.1 VMs. Thanks to Thorbjørn Vinne for this discovery.

	The TSaxon XSLT-for-HTML processor

	[13]I am also distributing [14]TSaxon, a repackaging of version 6.5.5
	of Michael Kay's Saxon XSLT version 1.0 implementation that includes
	TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to
	process either HTML or XML documents with XSLT stylesheets.

	TagSoup as a stand-alone program

	It is possible to run TagSoup as a program by saying java -jar
	tagsoup-1.0.1 [option ...] [file ...]. Files mentioned on the command
	line will be parsed individually. If no files are specified, the
	standard input is read.

	The following options are understood:

	--files
	Output into individual files, with html extensions changed to
	xhtml. Otherwise, all output is sent to the standard output.

	--html
	Output is in clean HTML: the XML declaration is suppressed, as
	are end-tags for the known empty elements.

	--omit-xml-declaration
	The XML declaration is suppressed.

	--method=html
	End-tags for the known empty HTML elements are suppressed.

	--doctype-system=systemid
	Forces the output of a DOCTYPE declaration with the specified
	systemid.

	--doctype-public=publicid
	Forces the output of a DOCTYPE declaration with the specified
	publicid.

	--version=version
	Sets the version string in the XML declaration.

	--standalone=[yes\|no]
	Sets the standalone declaration to yes or no.

	--pyx
	Output is in PYX format.

	--pyxin
	Input is in PYXoid format (need not be well-formed).

	--nons
	Namespaces are suppressed. Normally, all elements are in the
	XHTML 1.x namespace, and all attributes are in no namespace.

	--nobogons
	Bogons (unknown elements) are suppressed.

	--nodefaults
	suppress default attribute values

	--nocolons
	change explicit colons in element and attribute names to
	underscores

	--norestart
	don't restart any normally restartable elements

	--ignorable
	output whitespace in elements with element-only content

	--emptybogons
	Bogons are given a content model of EMPTY rather than ANY.

	--any
	Bogons are given a content model of ANY rather than EMPTY
	(default).

	--norootbogons
	Don't allow bogons to be root elements; make them subordinate to
	the root.

	--lexical
	Pass through HTML comments and DOCTYPE declarations. Has no
	effect when output is in PYX format.

	--reuse
	Reuse a single instance of TagSoup parser throughout. Normally,
	a new one is instantiated for each input file.

	--nocdata
	Change the content models of the script and style elements to
	treat them as ordinary #PCDATA (text-only) elements, as in
	XHTML, rather than with the special CDATA content model.

	--encoding=encoding
	Specify the input encoding. The default is the Java platform
	default.

	--output-encoding=encoding
	Specify the output encoding. The default is the Java platform
	default.

	--help
	Print help.

	--version
	Print the version number.

	SAX features and properties

	TagSoup supports the following SAX features in addition to the standard
	ones:

	http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons
	A value of "true" indicates that the parser will ignore unknown
	elements.

	http://www.ccil.org/~cowan/tagsoup/features/bogons-empty
	A value of "true" indicates that the parser will give unknown
	elements a content model of EMPTY; a value of "false", a content
	model of ANY.

	http://www.ccil.org/~cowan/tagsoup/features/root-bogons
	A value of "true" indicates that the parser will allow unknown
	elements to be the root of the output document.

	http://www.ccil.org/~cowan/tagsoup/features/default-attributes
	A value of "true" indicates that the parser will return default
	attribute values for missing attributes that have default
	values.

	http://www.ccil.org/~cowan/tagsoup/features/translate-colons
	A value of "true" indicates that the parser will translate
	colons into underscores in names.

	http://www.ccil.org/~cowan/tagsoup/features/restart-elements
	A value of "true" indicates that the parser will attempt to
	restart the restartable elements.

	http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace
	A value of "true" indicates that the parser will transmit
	whitespace in element-only content via the SAX
	ignorableWhitespace callback. Normally this is not done, because
	HTML is an SGML application and SGML suppresses such whitespace.

	http://www.ccil.org/~cowan/tagsoup/features/cdata-elements
	A value of "true" indicates that the parser will process the
	script and style elements (or any elements with type='cdata' in
	the TSSL schema) as SGML CDATA elements (that is, no markup is
	recognized except the matching end-tag).

	TagSoup supports the following SAX properties in addition to the
	standard ones:

	http://www.ccil.org/~cowan/tagsoup/properties/scanner
	Specifies the Scanner object this parser uses.

	http://www.ccil.org/~cowan/tagsoup/properties/schema
	Specifies the Schema object this parser uses.

	http://www.ccil.org/~cowan/tagsoup/properties/auto-detector
	Specifies the AutoDetector (for encoding detection) this parser
	uses.

	More information

	I gave a presentation (a nocturne, so it's not on the schedule) at
	[15]Extreme Markup Languages 2004 about TagSoup, updated from the one
	presented in 2002 at the New York City XML SIG and at XML 2002. This is
	the main high-level documentation about how TagSoup works. Formats:
	[16]OpenDocument [17]Powerpoint [18]PDF.

	I also had people add [19]"evil" HTML to a large poster so that I could
	[20]clean it up; View Source is probably more useful than ordinary
	browsing. The original instructions were:

	SOUPE DE BALISES (BE EVIL)!
	Ecritez une balise ouvrante (sans attributs)
	ou fermante HTML ici, s.v.p.

	There is a [21]tagsoup-friends mailing list hosted at [22]Yahoo Groups.
	You can [23]join via the Web, or by sending a blank email to
	[24]tagsoup-friends-subscribe@yahoogroups.com. The [25]archives are
	open to all.

	Online TagSoup processing for publicly accessible HTML documents is now
	[26]available courtesy of Leigh Dodds.

	References

	1. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html
	2. http://opensource.org/licenses/apache2.0.php
	3. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip
	4. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214
	5. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2.jar
	6. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup-1.2-src.zip
	7. http://home.ccil.org/~cowan/XML/tagsoup/CHANGES
	8. http://tidy.sf.net/
	9. http://www.crumbmuseum.com/truckin.html
	10. http://www.cafeconleche.org/XOM
	11. http://gnosis.cx/publish/programming/xml_matters_17.html
	12. http://jchardet.sourceforge.net/
	13. http://www.ccil.org/~cowan
	14. http://home.ccil.org/~cowan/XML/tagsoup/tsaxon
	15. http://www.extrememarkup.com/extreme/2004
	16. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.odp
	17. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.ppt
	18. http://home.ccil.org/~cowan/XML/tagsoup/tagsoup.pdf
	19. http://home.ccil.org/~cowan/XML/tagsoup/extreme.html
	20. http://home.ccil.org/~cowan/XML/tagsoup/extreme.xhtml
	21. http://groups.yahoo.com/group/tagsoup-friends
	22. http://groups.yahoo.com/
	23. http://groups.yahoo.com/group/tagsoup-friends/join
	24. mailto:tagsoup-friends-subscribe@yahoogroups.com
	25. http://groups.yahoo.com/group/tagsoup-friends/messages
	26. http://xmlarmyknife.org/docs/xhtml/tagsoup/