| Beautiful Soup Documentation |
| ============================ |
| |
| .. image:: 6.1.jpg |
| :align: right |
| :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself." |
| |
| `Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_ is a |
| Python library for pulling data out of HTML and XML files. It works |
| with your favorite parser to provide idiomatic ways of navigating, |
| searching, and modifying the parse tree. It commonly saves programmers |
| hours or days of work. |
| |
| These instructions illustrate all major features of Beautiful Soup 4, |
| with examples. I show you what the library is good for, how it works, |
| how to use it, how to make it do what you want, and what to do when it |
| violates your expectations. |
| |
| The examples in this documentation should work the same way in Python |
| 2.7 and Python 3.2. |
| |
| You might be looking for the documentation for `Beautiful Soup 3 |
| <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. |
| If so, you should know that Beautiful Soup 3 is no longer being |
| developed, and that Beautiful Soup 4 is recommended for all new |
| projects. If you want to learn about the differences between Beautiful |
| Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_. |
| |
| This documentation has been translated into other languages by its users. |
| |
| * 이 문서는 한국어 번역도 가능합니다. (`외부 링크 <http://coreapython.hosting.paran.com/etc/beautifulsoup4.html>`_) |
| |
| Getting help |
| ------------ |
| |
| If you have questions about Beautiful Soup, or run into problems, |
| `send mail to the discussion group |
| <https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup>`_. If |
| your problem involves parsing an HTML document, be sure to mention |
| :ref:`what the diagnose() function says <diagnose>` about |
| that document. |
| |
| Quick Start |
| =========== |
| |
| Here's an HTML document I'll be using as an example throughout this |
| document. It's part of a story from `Alice in Wonderland`:: |
| |
| html_doc = """ |
| <html><head><title>The Dormouse's story</title></head> |
| <body> |
| <p class="title"><b>The Dormouse's story</b></p> |
| |
| <p class="story">Once upon a time there were three little sisters; and their names were |
| <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, |
| <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and |
| <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; |
| and they lived at the bottom of a well.</p> |
| |
| <p class="story">...</p> |
| """ |
| |
| Running the "three sisters" document through Beautiful Soup gives us a |
| ``BeautifulSoup`` object, which represents the document as a nested |
| data structure:: |
| |
| from bs4 import BeautifulSoup |
| soup = BeautifulSoup(html_doc) |
| |
| print(soup.prettify()) |
| # <html> |
| # <head> |
| # <title> |
| # The Dormouse's story |
| # </title> |
| # </head> |
| # <body> |
| # <p class="title"> |
| # <b> |
| # The Dormouse's story |
| # </b> |
| # </p> |
| # <p class="story"> |
| # Once upon a time there were three little sisters; and their names were |
| # <a class="sister" href="http://example.com/elsie" id="link1"> |
| # Elsie |
| # </a> |
| # , |
| # <a class="sister" href="http://example.com/lacie" id="link2"> |
| # Lacie |
| # </a> |
| # and |
| # <a class="sister" href="http://example.com/tillie" id="link2"> |
| # Tillie |
| # </a> |
| # ; and they lived at the bottom of a well. |
| # </p> |
| # <p class="story"> |
| # ... |
| # </p> |
| # </body> |
| # </html> |
| |
| Here are some simple ways to navigate that data structure:: |
| |
| soup.title |
| # <title>The Dormouse's story</title> |
| |
| soup.title.name |
| # u'title' |
| |
| soup.title.string |
| # u'The Dormouse's story' |
| |
| soup.title.parent.name |
| # u'head' |
| |
| soup.p |
| # <p class="title"><b>The Dormouse's story</b></p> |
| |
| soup.p['class'] |
| # u'title' |
| |
| soup.a |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> |
| |
| soup.find_all('a') |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| soup.find(id="link3") |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> |
| |
| One common task is extracting all the URLs found within a page's <a> tags:: |
| |
| for link in soup.find_all('a'): |
| print(link.get('href')) |
| # http://example.com/elsie |
| # http://example.com/lacie |
| # http://example.com/tillie |
| |
| Another common task is extracting all the text from a page:: |
| |
| print(soup.get_text()) |
| # The Dormouse's story |
| # |
| # The Dormouse's story |
| # |
| # Once upon a time there were three little sisters; and their names were |
| # Elsie, |
| # Lacie and |
| # Tillie; |
| # and they lived at the bottom of a well. |
| # |
| # ... |
| |
| Does this look like what you need? If so, read on. |
| |
| Installing Beautiful Soup |
| ========================= |
| |
| If you're using a recent version of Debian or Ubuntu Linux, you can |
| install Beautiful Soup with the system package manager: |
| |
| :kbd:`$ apt-get install python-bs4` |
| |
| Beautiful Soup 4 is published through PyPi, so if you can't install it |
| with the system packager, you can install it with ``easy_install`` or |
| ``pip``. The package name is ``beautifulsoup4``, and the same package |
| works on Python 2 and Python 3. |
| |
| :kbd:`$ easy_install beautifulsoup4` |
| |
| :kbd:`$ pip install beautifulsoup4` |
| |
| (The ``BeautifulSoup`` package is probably `not` what you want. That's |
| the previous major release, `Beautiful Soup 3`_. Lots of software uses |
| BS3, so it's still available, but if you're writing new code you |
| should install ``beautifulsoup4``.) |
| |
| If you don't have ``easy_install`` or ``pip`` installed, you can |
| `download the Beautiful Soup 4 source tarball |
| <http://www.crummy.com/software/BeautifulSoup/download/4.x/>`_ and |
| install it with ``setup.py``. |
| |
| :kbd:`$ python setup.py install` |
| |
| If all else fails, the license for Beautiful Soup allows you to |
| package the entire library with your application. You can download the |
| tarball, copy its ``bs4`` directory into your application's codebase, |
| and use Beautiful Soup without installing it at all. |
| |
| I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it |
| should work with other recent versions. |
| |
| Problems after installation |
| --------------------------- |
| |
| Beautiful Soup is packaged as Python 2 code. When you install it for |
| use with Python 3, it's automatically converted to Python 3 code. If |
| you don't install the package, the code won't be converted. There have |
| also been reports on Windows machines of the wrong version being |
| installed. |
| |
| If you get the ``ImportError`` "No module named HTMLParser", your |
| problem is that you're running the Python 2 version of the code under |
| Python 3. |
| |
| If you get the ``ImportError`` "No module named html.parser", your |
| problem is that you're running the Python 3 version of the code under |
| Python 2. |
| |
| In both cases, your best bet is to completely remove the Beautiful |
| Soup installation from your system (including any directory created |
| when you unzipped the tarball) and try the installation again. |
| |
| If you get the ``SyntaxError`` "Invalid syntax" on the line |
| ``ROOT_TAG_NAME = u'[document]'``, you need to convert the Python 2 |
| code to Python 3. You can do this either by installing the package: |
| |
| :kbd:`$ python3 setup.py install` |
| |
| or by manually running Python's ``2to3`` conversion script on the |
| ``bs4`` directory: |
| |
| :kbd:`$ 2to3-3.2 -w bs4` |
| |
| .. _parser-installation: |
| |
| |
| Installing a parser |
| ------------------- |
| |
| Beautiful Soup supports the HTML parser included in Python's standard |
| library, but it also supports a number of third-party Python parsers. |
| One is the `lxml parser <http://lxml.de/>`_. Depending on your setup, |
| you might install lxml with one of these commands: |
| |
| :kbd:`$ apt-get install python-lxml` |
| |
| :kbd:`$ easy_install lxml` |
| |
| :kbd:`$ pip install lxml` |
| |
| Another alternative is the pure-Python `html5lib parser |
| <http://code.google.com/p/html5lib/>`_, which parses HTML the way a |
| web browser does. Depending on your setup, you might install html5lib |
| with one of these commands: |
| |
| :kbd:`$ apt-get install python-html5lib` |
| |
| :kbd:`$ easy_install html5lib` |
| |
| :kbd:`$ pip install html5lib` |
| |
| This table summarizes the advantages and disadvantages of each parser library: |
| |
| +----------------------+--------------------------------------------+--------------------------------+--------------------------+ |
| | Parser | Typical usage | Advantages | Disadvantages | |
| +----------------------+--------------------------------------------+--------------------------------+--------------------------+ |
| | Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | * Batteries included | * Not very lenient | |
| | | | * Decent speed | (before Python 2.7.3 | |
| | | | * Lenient (as of Python 2.7.3 | or 3.2.2) | |
| | | | and 3.2.) | | |
| +----------------------+--------------------------------------------+--------------------------------+--------------------------+ |
| | lxml's HTML parser | ``BeautifulSoup(markup, "lxml")`` | * Very fast | * External C dependency | |
| | | | * Lenient | | |
| +----------------------+--------------------------------------------+--------------------------------+--------------------------+ |
| | lxml's XML parser | ``BeautifulSoup(markup, ["lxml", "xml"])`` | * Very fast | * External C dependency | |
| | | ``BeautifulSoup(markup, "xml")`` | * The only currently supported | | |
| | | | XML parser | | |
| +----------------------+--------------------------------------------+--------------------------------+--------------------------+ |
| | html5lib | ``BeautifulSoup(markup, "html5lib")`` | * Extremely lenient | * Very slow | |
| | | | * Parses pages the same way a | * External Python | |
| | | | web browser does | dependency | |
| | | | * Creates valid HTML5 | | |
| +----------------------+--------------------------------------------+--------------------------------+--------------------------+ |
| |
| If you can, I recommend you install and use lxml for speed. If you're |
| using a version of Python 2 earlier than 2.7.3, or a version of Python |
| 3 earlier than 3.2.2, it's `essential` that you install lxml or |
| html5lib--Python's built-in HTML parser is just not very good in older |
| versions. |
| |
| Note that if a document is invalid, different parsers will generate |
| different Beautiful Soup trees for it. See `Differences |
| between parsers`_ for details. |
| |
| Making the soup |
| =============== |
| |
| To parse a document, pass it into the ``BeautifulSoup`` |
| constructor. You can pass in a string or an open filehandle:: |
| |
| from bs4 import BeautifulSoup |
| |
| soup = BeautifulSoup(open("index.html")) |
| |
| soup = BeautifulSoup("<html>data</html>") |
| |
| First, the document is converted to Unicode, and HTML entities are |
| converted to Unicode characters:: |
| |
| BeautifulSoup("Sacré bleu!") |
| <html><head></head><body>Sacré bleu!</body></html> |
| |
| Beautiful Soup then parses the document using the best available |
| parser. It will use an HTML parser unless you specifically tell it to |
| use an XML parser. (See `Parsing XML`_.) |
| |
| Kinds of objects |
| ================ |
| |
| Beautiful Soup transforms a complex HTML document into a complex tree |
| of Python objects. But you'll only ever have to deal with about four |
| `kinds` of objects. |
| |
| .. _Tag: |
| |
| ``Tag`` |
| ------- |
| |
| A ``Tag`` object corresponds to an XML or HTML tag in the original document:: |
| |
| soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') |
| tag = soup.b |
| type(tag) |
| # <class 'bs4.element.Tag'> |
| |
| Tags have a lot of attributes and methods, and I'll cover most of them |
| in `Navigating the tree`_ and `Searching the tree`_. For now, the most |
| important features of a tag are its name and attributes. |
| |
| Name |
| ^^^^ |
| |
| Every tag has a name, accessible as ``.name``:: |
| |
| tag.name |
| # u'b' |
| |
| If you change a tag's name, the change will be reflected in any HTML |
| markup generated by Beautiful Soup:: |
| |
| tag.name = "blockquote" |
| tag |
| # <blockquote class="boldest">Extremely bold</blockquote> |
| |
| Attributes |
| ^^^^^^^^^^ |
| |
| A tag may have any number of attributes. The tag ``<b |
| class="boldest">`` has an attribute "class" whose value is |
| "boldest". You can access a tag's attributes by treating the tag like |
| a dictionary:: |
| |
| tag['class'] |
| # u'boldest' |
| |
| You can access that dictionary directly as ``.attrs``:: |
| |
| tag.attrs |
| # {u'class': u'boldest'} |
| |
| You can add, remove, and modify a tag's attributes. Again, this is |
| done by treating the tag as a dictionary:: |
| |
| tag['class'] = 'verybold' |
| tag['id'] = 1 |
| tag |
| # <blockquote class="verybold" id="1">Extremely bold</blockquote> |
| |
| del tag['class'] |
| del tag['id'] |
| tag |
| # <blockquote>Extremely bold</blockquote> |
| |
| tag['class'] |
| # KeyError: 'class' |
| print(tag.get('class')) |
| # None |
| |
| .. _multivalue: |
| |
| Multi-valued attributes |
| &&&&&&&&&&&&&&&&&&&&&&& |
| |
| HTML 4 defines a few attributes that can have multiple values. HTML 5 |
| removes a couple of them, but defines a few more. The most common |
| multi-valued attribute is ``class`` (that is, a tag can have more than |
| one CSS class). Others include ``rel``, ``rev``, ``accept-charset``, |
| ``headers``, and ``accesskey``. Beautiful Soup presents the value(s) |
| of a multi-valued attribute as a list:: |
| |
| css_soup = BeautifulSoup('<p class="body strikeout"></p>') |
| css_soup.p['class'] |
| # ["body", "strikeout"] |
| |
| css_soup = BeautifulSoup('<p class="body"></p>') |
| css_soup.p['class'] |
| # ["body"] |
| |
| If an attribute `looks` like it has more than one value, but it's not |
| a multi-valued attribute as defined by any version of the HTML |
| standard, Beautiful Soup will leave the attribute alone:: |
| |
| id_soup = BeautifulSoup('<p id="my id"></p>') |
| id_soup.p['id'] |
| # 'my id' |
| |
| When you turn a tag back into a string, multiple attribute values are |
| consolidated:: |
| |
| rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>') |
| rel_soup.a['rel'] |
| # ['index'] |
| rel_soup.a['rel'] = ['index', 'contents'] |
| print(rel_soup.p) |
| # <p>Back to the <a rel="index contents">homepage</a></p> |
| |
| If you parse a document as XML, there are no multi-valued attributes:: |
| |
| xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml') |
| xml_soup.p['class'] |
| # u'body strikeout' |
| |
| |
| |
| ``NavigableString`` |
| ------------------- |
| |
| A string corresponds to a bit of text within a tag. Beautiful Soup |
| uses the ``NavigableString`` class to contain these bits of text:: |
| |
| tag.string |
| # u'Extremely bold' |
| type(tag.string) |
| # <class 'bs4.element.NavigableString'> |
| |
| A ``NavigableString`` is just like a Python Unicode string, except |
| that it also supports some of the features described in `Navigating |
| the tree`_ and `Searching the tree`_. You can convert a |
| ``NavigableString`` to a Unicode string with ``unicode()``:: |
| |
| unicode_string = unicode(tag.string) |
| unicode_string |
| # u'Extremely bold' |
| type(unicode_string) |
| # <type 'unicode'> |
| |
| You can't edit a string in place, but you can replace one string with |
| another, using :ref:`replace_with`:: |
| |
| tag.string.replace_with("No longer bold") |
| tag |
| # <blockquote>No longer bold</blockquote> |
| |
| ``NavigableString`` supports most of the features described in |
| `Navigating the tree`_ and `Searching the tree`_, but not all of |
| them. In particular, since a string can't contain anything (the way a |
| tag may contain a string or another tag), strings don't support the |
| ``.contents`` or ``.string`` attributes, or the ``find()`` method. |
| |
| If you want to use a ``NavigableString`` outside of Beautiful Soup, |
| you should call ``unicode()`` on it to turn it into a normal Python |
| Unicode string. If you don't, your string will carry around a |
| reference to the entire Beautiful Soup parse tree, even when you're |
| done using Beautiful Soup. This is a big waste of memory. |
| |
| ``BeautifulSoup`` |
| ----------------- |
| |
| The ``BeautifulSoup`` object itself represents the document as a |
| whole. For most purposes, you can treat it as a :ref:`Tag` |
| object. This means it supports most of the methods described in |
| `Navigating the tree`_ and `Searching the tree`_. |
| |
| Since the ``BeautifulSoup`` object doesn't correspond to an actual |
| HTML or XML tag, it has no name and no attributes. But sometimes it's |
| useful to look at its ``.name``, so it's been given the special |
| ``.name`` "[document]":: |
| |
| soup.name |
| # u'[document]' |
| |
| Comments and other special strings |
| ---------------------------------- |
| |
| ``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost |
| everything you'll see in an HTML or XML file, but there are a few |
| leftover bits. The only one you'll probably ever need to worry about |
| is the comment:: |
| |
| markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" |
| soup = BeautifulSoup(markup) |
| comment = soup.b.string |
| type(comment) |
| # <class 'bs4.element.Comment'> |
| |
| The ``Comment`` object is just a special type of ``NavigableString``:: |
| |
| comment |
| # u'Hey, buddy. Want to buy a used parser' |
| |
| But when it appears as part of an HTML document, a ``Comment`` is |
| displayed with special formatting:: |
| |
| print(soup.b.prettify()) |
| # <b> |
| # <!--Hey, buddy. Want to buy a used parser?--> |
| # </b> |
| |
| Beautiful Soup defines classes for anything else that might show up in |
| an XML document: ``CData``, ``ProcessingInstruction``, |
| ``Declaration``, and ``Doctype``. Just like ``Comment``, these classes |
| are subclasses of ``NavigableString`` that add something extra to the |
| string. Here's an example that replaces the comment with a CDATA |
| block:: |
| |
| from bs4 import CData |
| cdata = CData("A CDATA block") |
| comment.replace_with(cdata) |
| |
| print(soup.b.prettify()) |
| # <b> |
| # <![CDATA[A CDATA block]]> |
| # </b> |
| |
| |
| Navigating the tree |
| =================== |
| |
| Here's the "Three sisters" HTML document again:: |
| |
| html_doc = """ |
| <html><head><title>The Dormouse's story</title></head> |
| |
| <p class="title"><b>The Dormouse's story</b></p> |
| |
| <p class="story">Once upon a time there were three little sisters; and their names were |
| <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, |
| <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and |
| <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; |
| and they lived at the bottom of a well.</p> |
| |
| <p class="story">...</p> |
| """ |
| |
| from bs4 import BeautifulSoup |
| soup = BeautifulSoup(html_doc) |
| |
| I'll use this as an example to show you how to move from one part of |
| a document to another. |
| |
| Going down |
| ---------- |
| |
| Tags may contain strings and other tags. These elements are the tag's |
| `children`. Beautiful Soup provides a lot of different attributes for |
| navigating and iterating over a tag's children. |
| |
| Note that Beautiful Soup strings don't support any of these |
| attributes, because a string can't have children. |
| |
| Navigating using tag names |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The simplest way to navigate the parse tree is to say the name of the |
| tag you want. If you want the <head> tag, just say ``soup.head``:: |
| |
| soup.head |
| # <head><title>The Dormouse's story</title></head> |
| |
| soup.title |
| # <title>The Dormouse's story</title> |
| |
| You can do use this trick again and again to zoom in on a certain part |
| of the parse tree. This code gets the first <b> tag beneath the <body> tag:: |
| |
| soup.body.b |
| # <b>The Dormouse's story</b> |
| |
| Using a tag name as an attribute will give you only the `first` tag by that |
| name:: |
| |
| soup.a |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> |
| |
| If you need to get `all` the <a> tags, or anything more complicated |
| than the first tag with a certain name, you'll need to use one of the |
| methods described in `Searching the tree`_, such as `find_all()`:: |
| |
| soup.find_all('a') |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| ``.contents`` and ``.children`` |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| A tag's children are available in a list called ``.contents``:: |
| |
| head_tag = soup.head |
| head_tag |
| # <head><title>The Dormouse's story</title></head> |
| |
| head_tag.contents |
| [<title>The Dormouse's story</title>] |
| |
| title_tag = head_tag.contents[0] |
| title_tag |
| # <title>The Dormouse's story</title> |
| title_tag.contents |
| # [u'The Dormouse's story'] |
| |
| The ``BeautifulSoup`` object itself has children. In this case, the |
| <html> tag is the child of the ``BeautifulSoup`` object.:: |
| |
| len(soup.contents) |
| # 1 |
| soup.contents[0].name |
| # u'html' |
| |
| A string does not have ``.contents``, because it can't contain |
| anything:: |
| |
| text = title_tag.contents[0] |
| text.contents |
| # AttributeError: 'NavigableString' object has no attribute 'contents' |
| |
| Instead of getting them as a list, you can iterate over a tag's |
| children using the ``.children`` generator:: |
| |
| for child in title_tag.children: |
| print(child) |
| # The Dormouse's story |
| |
| ``.descendants`` |
| ^^^^^^^^^^^^^^^^ |
| |
| The ``.contents`` and ``.children`` attributes only consider a tag's |
| `direct` children. For instance, the <head> tag has a single direct |
| child--the <title> tag:: |
| |
| head_tag.contents |
| # [<title>The Dormouse's story</title>] |
| |
| But the <title> tag itself has a child: the string "The Dormouse's |
| story". There's a sense in which that string is also a child of the |
| <head> tag. The ``.descendants`` attribute lets you iterate over `all` |
| of a tag's children, recursively: its direct children, the children of |
| its direct children, and so on:: |
| |
| for child in head_tag.descendants: |
| print(child) |
| # <title>The Dormouse's story</title> |
| # The Dormouse's story |
| |
| The <head> tag has only one child, but it has two descendants: the |
| <title> tag and the <title> tag's child. The ``BeautifulSoup`` object |
| only has one direct child (the <html> tag), but it has a whole lot of |
| descendants:: |
| |
| len(list(soup.children)) |
| # 1 |
| len(list(soup.descendants)) |
| # 25 |
| |
| .. _.string: |
| |
| ``.string`` |
| ^^^^^^^^^^^ |
| |
| If a tag has only one child, and that child is a ``NavigableString``, |
| the child is made available as ``.string``:: |
| |
| title_tag.string |
| # u'The Dormouse's story' |
| |
| If a tag's only child is another tag, and `that` tag has a |
| ``.string``, then the parent tag is considered to have the same |
| ``.string`` as its child:: |
| |
| head_tag.contents |
| # [<title>The Dormouse's story</title>] |
| |
| head_tag.string |
| # u'The Dormouse's story' |
| |
| If a tag contains more than one thing, then it's not clear what |
| ``.string`` should refer to, so ``.string`` is defined to be |
| ``None``:: |
| |
| print(soup.html.string) |
| # None |
| |
| .. _string-generators: |
| |
| ``.strings`` and ``stripped_strings`` |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| If there's more than one thing inside a tag, you can still look at |
| just the strings. Use the ``.strings`` generator:: |
| |
| for string in soup.strings: |
| print(repr(string)) |
| # u"The Dormouse's story" |
| # u'\n\n' |
| # u"The Dormouse's story" |
| # u'\n\n' |
| # u'Once upon a time there were three little sisters; and their names were\n' |
| # u'Elsie' |
| # u',\n' |
| # u'Lacie' |
| # u' and\n' |
| # u'Tillie' |
| # u';\nand they lived at the bottom of a well.' |
| # u'\n\n' |
| # u'...' |
| # u'\n' |
| |
| These strings tend to have a lot of extra whitespace, which you can |
| remove by using the ``.stripped_strings`` generator instead:: |
| |
| for string in soup.stripped_strings: |
| print(repr(string)) |
| # u"The Dormouse's story" |
| # u"The Dormouse's story" |
| # u'Once upon a time there were three little sisters; and their names were' |
| # u'Elsie' |
| # u',' |
| # u'Lacie' |
| # u'and' |
| # u'Tillie' |
| # u';\nand they lived at the bottom of a well.' |
| # u'...' |
| |
| Here, strings consisting entirely of whitespace are ignored, and |
| whitespace at the beginning and end of strings is removed. |
| |
| Going up |
| -------- |
| |
| Continuing the "family tree" analogy, every tag and every string has a |
| `parent`: the tag that contains it. |
| |
| .. _.parent: |
| |
| ``.parent`` |
| ^^^^^^^^^^^ |
| |
| You can access an element's parent with the ``.parent`` attribute. In |
| the example "three sisters" document, the <head> tag is the parent |
| of the <title> tag:: |
| |
| title_tag = soup.title |
| title_tag |
| # <title>The Dormouse's story</title> |
| title_tag.parent |
| # <head><title>The Dormouse's story</title></head> |
| |
| The title string itself has a parent: the <title> tag that contains |
| it:: |
| |
| title_tag.string.parent |
| # <title>The Dormouse's story</title> |
| |
| The parent of a top-level tag like <html> is the ``BeautifulSoup`` object |
| itself:: |
| |
| html_tag = soup.html |
| type(html_tag.parent) |
| # <class 'bs4.BeautifulSoup'> |
| |
| And the ``.parent`` of a ``BeautifulSoup`` object is defined as None:: |
| |
| print(soup.parent) |
| # None |
| |
| .. _.parents: |
| |
| ``.parents`` |
| ^^^^^^^^^^^^ |
| |
| You can iterate over all of an element's parents with |
| ``.parents``. This example uses ``.parents`` to travel from an <a> tag |
| buried deep within the document, to the very top of the document:: |
| |
| link = soup.a |
| link |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> |
| for parent in link.parents: |
| if parent is None: |
| print(parent) |
| else: |
| print(parent.name) |
| # p |
| # body |
| # html |
| # [document] |
| # None |
| |
| Going sideways |
| -------------- |
| |
| Consider a simple document like this:: |
| |
| sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>") |
| print(sibling_soup.prettify()) |
| # <html> |
| # <body> |
| # <a> |
| # <b> |
| # text1 |
| # </b> |
| # <c> |
| # text2 |
| # </c> |
| # </a> |
| # </body> |
| # </html> |
| |
| The <b> tag and the <c> tag are at the same level: they're both direct |
| children of the same tag. We call them `siblings`. When a document is |
| pretty-printed, siblings show up at the same indentation level. You |
| can also use this relationship in the code you write. |
| |
| ``.next_sibling`` and ``.previous_sibling`` |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| You can use ``.next_sibling`` and ``.previous_sibling`` to navigate |
| between page elements that are on the same level of the parse tree:: |
| |
| sibling_soup.b.next_sibling |
| # <c>text2</c> |
| |
| sibling_soup.c.previous_sibling |
| # <b>text1</b> |
| |
| The <b> tag has a ``.next_sibling``, but no ``.previous_sibling``, |
| because there's nothing before the <b> tag `on the same level of the |
| tree`. For the same reason, the <c> tag has a ``.previous_sibling`` |
| but no ``.next_sibling``:: |
| |
| print(sibling_soup.b.previous_sibling) |
| # None |
| print(sibling_soup.c.next_sibling) |
| # None |
| |
| The strings "text1" and "text2" are `not` siblings, because they don't |
| have the same parent:: |
| |
| sibling_soup.b.string |
| # u'text1' |
| |
| print(sibling_soup.b.string.next_sibling) |
| # None |
| |
| In real documents, the ``.next_sibling`` or ``.previous_sibling`` of a |
| tag will usually be a string containing whitespace. Going back to the |
| "three sisters" document:: |
| |
| <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a> |
| <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> |
| <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> |
| |
| You might think that the ``.next_sibling`` of the first <a> tag would |
| be the second <a> tag. But actually, it's a string: the comma and |
| newline that separate the first <a> tag from the second:: |
| |
| link = soup.a |
| link |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> |
| |
| link.next_sibling |
| # u',\n' |
| |
| The second <a> tag is actually the ``.next_sibling`` of the comma:: |
| |
| link.next_sibling.next_sibling |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> |
| |
| .. _sibling-generators: |
| |
| ``.next_siblings`` and ``.previous_siblings`` |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| You can iterate over a tag's siblings with ``.next_siblings`` or |
| ``.previous_siblings``:: |
| |
| for sibling in soup.a.next_siblings: |
| print(repr(sibling)) |
| # u',\n' |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> |
| # u' and\n' |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> |
| # u'; and they lived at the bottom of a well.' |
| # None |
| |
| for sibling in soup.find(id="link3").previous_siblings: |
| print(repr(sibling)) |
| # ' and\n' |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> |
| # u',\n' |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> |
| # u'Once upon a time there were three little sisters; and their names were\n' |
| # None |
| |
| Going back and forth |
| -------------------- |
| |
| Take a look at the beginning of the "three sisters" document:: |
| |
| <html><head><title>The Dormouse's story</title></head> |
| <p class="title"><b>The Dormouse's story</b></p> |
| |
| An HTML parser takes this string of characters and turns it into a |
| series of events: "open an <html> tag", "open a <head> tag", "open a |
| <title> tag", "add a string", "close the <title> tag", "open a <p> |
| tag", and so on. Beautiful Soup offers tools for reconstructing the |
| initial parse of the document. |
| |
| .. _element-generators: |
| |
| ``.next_element`` and ``.previous_element`` |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| The ``.next_element`` attribute of a string or tag points to whatever |
| was parsed immediately afterwards. It might be the same as |
| ``.next_sibling``, but it's usually drastically different. |
| |
| Here's the final <a> tag in the "three sisters" document. Its |
| ``.next_sibling`` is a string: the conclusion of the sentence that was |
| interrupted by the start of the <a> tag.:: |
| |
| last_a_tag = soup.find("a", id="link3") |
| last_a_tag |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> |
| |
| last_a_tag.next_sibling |
| # '; and they lived at the bottom of a well.' |
| |
| But the ``.next_element`` of that <a> tag, the thing that was parsed |
| immediately after the <a> tag, is `not` the rest of that sentence: |
| it's the word "Tillie":: |
| |
| last_a_tag.next_element |
| # u'Tillie' |
| |
| That's because in the original markup, the word "Tillie" appeared |
| before that semicolon. The parser encountered an <a> tag, then the |
| word "Tillie", then the closing </a> tag, then the semicolon and rest of |
| the sentence. The semicolon is on the same level as the <a> tag, but the |
| word "Tillie" was encountered first. |
| |
| The ``.previous_element`` attribute is the exact opposite of |
| ``.next_element``. It points to whatever element was parsed |
| immediately before this one:: |
| |
| last_a_tag.previous_element |
| # u' and\n' |
| last_a_tag.previous_element.next_element |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> |
| |
| ``.next_elements`` and ``.previous_elements`` |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| You should get the idea by now. You can use these iterators to move |
| forward or backward in the document as it was parsed:: |
| |
| for element in last_a_tag.next_elements: |
| print(repr(element)) |
| # u'Tillie' |
| # u';\nand they lived at the bottom of a well.' |
| # u'\n\n' |
| # <p class="story">...</p> |
| # u'...' |
| # u'\n' |
| # None |
| |
| Searching the tree |
| ================== |
| |
| Beautiful Soup defines a lot of methods for searching the parse tree, |
| but they're all very similar. I'm going to spend a lot of time explaining |
| the two most popular methods: ``find()`` and ``find_all()``. The other |
| methods take almost exactly the same arguments, so I'll just cover |
| them briefly. |
| |
| Once again, I'll be using the "three sisters" document as an example:: |
| |
| html_doc = """ |
| <html><head><title>The Dormouse's story</title></head> |
| |
| <p class="title"><b>The Dormouse's story</b></p> |
| |
| <p class="story">Once upon a time there were three little sisters; and their names were |
| <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, |
| <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and |
| <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; |
| and they lived at the bottom of a well.</p> |
| |
| <p class="story">...</p> |
| """ |
| |
| from bs4 import BeautifulSoup |
| soup = BeautifulSoup(html_doc) |
| |
| By passing in a filter to an argument like ``find_all()``, you can |
| zoom in on the parts of the document you're interested in. |
| |
| Kinds of filters |
| ---------------- |
| |
| Before talking in detail about ``find_all()`` and similar methods, I |
| want to show examples of different filters you can pass into these |
| methods. These filters show up again and again, throughout the |
| search API. You can use them to filter based on a tag's name, |
| on its attributes, on the text of a string, or on some combination of |
| these. |
| |
| .. _a string: |
| |
| A string |
| ^^^^^^^^ |
| |
| The simplest filter is a string. Pass a string to a search method and |
| Beautiful Soup will perform a match against that exact string. This |
| code finds all the <b> tags in the document:: |
| |
| soup.find_all('b') |
| # [<b>The Dormouse's story</b>] |
| |
| If you pass in a byte string, Beautiful Soup will assume the string is |
| encoded as UTF-8. You can avoid this by passing in a Unicode string instead. |
| |
| .. _a regular expression: |
| |
| A regular expression |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| If you pass in a regular expression object, Beautiful Soup will filter |
| against that regular expression using its ``match()`` method. This code |
| finds all the tags whose names start with the letter "b"; in this |
| case, the <body> tag and the <b> tag:: |
| |
| import re |
| for tag in soup.find_all(re.compile("^b")): |
| print(tag.name) |
| # body |
| # b |
| |
| This code finds all the tags whose names contain the letter 't':: |
| |
| for tag in soup.find_all(re.compile("t")): |
| print(tag.name) |
| # html |
| # title |
| |
| .. _a list: |
| |
| A list |
| ^^^^^^ |
| |
| If you pass in a list, Beautiful Soup will allow a string match |
| against `any` item in that list. This code finds all the <a> tags |
| `and` all the <b> tags:: |
| |
| soup.find_all(["a", "b"]) |
| # [<b>The Dormouse's story</b>, |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| .. _the value True: |
| |
| ``True`` |
| ^^^^^^^^ |
| |
| The value ``True`` matches everything it can. This code finds `all` |
| the tags in the document, but none of the text strings:: |
| |
| for tag in soup.find_all(True): |
| print(tag.name) |
| # html |
| # head |
| # title |
| # body |
| # p |
| # b |
| # p |
| # a |
| # a |
| # a |
| # p |
| |
| .. a function: |
| |
| A function |
| ^^^^^^^^^^ |
| |
| If none of the other matches work for you, define a function that |
| takes an element as its only argument. The function should return |
| ``True`` if the argument matches, and ``False`` otherwise. |
| |
| Here's a function that returns ``True`` if a tag defines the "class" |
| attribute but doesn't define the "id" attribute:: |
| |
| def has_class_but_no_id(tag): |
| return tag.has_attr('class') and not tag.has_attr('id') |
| |
| Pass this function into ``find_all()`` and you'll pick up all the <p> |
| tags:: |
| |
| soup.find_all(has_class_but_no_id) |
| # [<p class="title"><b>The Dormouse's story</b></p>, |
| # <p class="story">Once upon a time there were...</p>, |
| # <p class="story">...</p>] |
| |
| This function only picks up the <p> tags. It doesn't pick up the <a> |
| tags, because those tags define both "class" and "id". It doesn't pick |
| up tags like <html> and <title>, because those tags don't define |
| "class". |
| |
| Here's a function that returns ``True`` if a tag is surrounded by |
| string objects:: |
| |
| from bs4 import NavigableString |
| def surrounded_by_strings(tag): |
| return (isinstance(tag.next_element, NavigableString) |
| and isinstance(tag.previous_element, NavigableString)) |
| |
| for tag in soup.find_all(surrounded_by_strings): |
| print tag.name |
| # p |
| # a |
| # a |
| # a |
| # p |
| |
| Now we're ready to look at the search methods in detail. |
| |
| ``find_all()`` |
| -------------- |
| |
| Signature: find_all(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive |
| <recursive>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) |
| |
| The ``find_all()`` method looks through a tag's descendants and |
| retrieves `all` descendants that match your filters. I gave several |
| examples in `Kinds of filters`_, but here are a few more:: |
| |
| soup.find_all("title") |
| # [<title>The Dormouse's story</title>] |
| |
| soup.find_all("p", "title") |
| # [<p class="title"><b>The Dormouse's story</b></p>] |
| |
| soup.find_all("a") |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| soup.find_all(id="link2") |
| # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] |
| |
| import re |
| soup.find(text=re.compile("sisters")) |
| # u'Once upon a time there were three little sisters; and their names were\n' |
| |
| Some of these should look familiar, but others are new. What does it |
| mean to pass in a value for ``text``, or ``id``? Why does |
| ``find_all("p", "title")`` find a <p> tag with the CSS class "title"? |
| Let's look at the arguments to ``find_all()``. |
| |
| .. _name: |
| |
| The ``name`` argument |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| Pass in a value for ``name`` and you'll tell Beautiful Soup to only |
| consider tags with certain names. Text strings will be ignored, as |
| will tags whose names that don't match. |
| |
| This is the simplest usage:: |
| |
| soup.find_all("title") |
| # [<title>The Dormouse's story</title>] |
| |
| Recall from `Kinds of filters`_ that the value to ``name`` can be `a |
| string`_, `a regular expression`_, `a list`_, `a function`_, or `the value |
| True`_. |
| |
| .. _kwargs: |
| |
| The keyword arguments |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| Any argument that's not recognized will be turned into a filter on one |
| of a tag's attributes. If you pass in a value for an argument called ``id``, |
| Beautiful Soup will filter against each tag's 'id' attribute:: |
| |
| soup.find_all(id='link2') |
| # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] |
| |
| If you pass in a value for ``href``, Beautiful Soup will filter |
| against each tag's 'href' attribute:: |
| |
| soup.find_all(href=re.compile("elsie")) |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] |
| |
| You can filter an attribute based on `a string`_, `a regular |
| expression`_, `a list`_, `a function`_, or `the value True`_. |
| |
| This code finds all tags whose ``id`` attribute has a value, |
| regardless of what the value is:: |
| |
| soup.find_all(id=True) |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| You can filter multiple attributes at once by passing in more than one |
| keyword argument:: |
| |
| soup.find_all(href=re.compile("elsie"), id='link1') |
| # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>] |
| |
| Some attributes, like the data-* attributes in HTML 5, have names that |
| can't be used as the names of keyword arguments:: |
| |
| data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') |
| data_soup.find_all(data-foo="value") |
| # SyntaxError: keyword can't be an expression |
| |
| You can use these attributes in searches by putting them into a |
| dictionary and passing the dictionary into ``find_all()`` as the |
| ``attrs`` argument:: |
| |
| data_soup.find_all(attrs={"data-foo": "value"}) |
| # [<div data-foo="value">foo!</div>] |
| |
| .. _attrs: |
| |
| Searching by CSS class |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| It's very useful to search for a tag that has a certain CSS class, but |
| the name of the CSS attribute, "class", is a reserved word in |
| Python. Using ``class`` as a keyword argument will give you a syntax |
| error. As of Beautiful Soup 4.1.2, you can search by CSS class using |
| the keyword argument ``class_``:: |
| |
| soup.find_all("a", class_="sister") |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| As with any keyword argument, you can pass ``class_`` a string, a regular |
| expression, a function, or ``True``:: |
| |
| soup.find_all(class_=re.compile("itl")) |
| # [<p class="title"><b>The Dormouse's story</b></p>] |
| |
| def has_six_characters(css_class): |
| return css_class is not None and len(css_class) == 6 |
| |
| soup.find_all(class_=has_six_characters) |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| :ref:`Remember <multivalue>` that a single tag can have multiple |
| values for its "class" attribute. When you search for a tag that |
| matches a certain CSS class, you're matching against `any` of its CSS |
| classes:: |
| |
| css_soup = BeautifulSoup('<p class="body strikeout"></p>') |
| css_soup.find_all("p", class_="strikeout") |
| # [<p class="body strikeout"></p>] |
| |
| css_soup.find_all("p", class_="body") |
| # [<p class="body strikeout"></p>] |
| |
| You can also search for the exact string value of the ``class`` attribute:: |
| |
| css_soup.find_all("p", class_="body strikeout") |
| # [<p class="body strikeout"></p>] |
| |
| But searching for variants of the string value won't work:: |
| |
| css_soup.find_all("p", class_="strikeout body") |
| # [] |
| |
| If you want to search for tags that match two or more CSS classes, you |
| should use a CSS selector:: |
| |
| css_soup.select("p.strikeout.body") |
| # [<p class="body strikeout"></p>] |
| |
| In older versions of Beautiful Soup, which don't have the ``class_`` |
| shortcut, you can use the ``attrs`` trick mentioned above. Create a |
| dictionary whose value for "class" is the string (or regular |
| expression, or whatever) you want to search for:: |
| |
| soup.find_all("a", attrs={"class": "sister"}) |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| .. _text: |
| |
| The ``text`` argument |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| With ``text`` you can search for strings instead of tags. As with |
| ``name`` and the keyword arguments, you can pass in `a string`_, `a |
| regular expression`_, `a list`_, `a function`_, or `the value True`_. |
| Here are some examples:: |
| |
| soup.find_all(text="Elsie") |
| # [u'Elsie'] |
| |
| soup.find_all(text=["Tillie", "Elsie", "Lacie"]) |
| # [u'Elsie', u'Lacie', u'Tillie'] |
| |
| soup.find_all(text=re.compile("Dormouse")) |
| [u"The Dormouse's story", u"The Dormouse's story"] |
| |
| def is_the_only_string_within_a_tag(s): |
| """Return True if this string is the only child of its parent tag.""" |
| return (s == s.parent.string) |
| |
| soup.find_all(text=is_the_only_string_within_a_tag) |
| # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...'] |
| |
| Although ``text`` is for finding strings, you can combine it with |
| arguments that find tags: Beautiful Soup will find all tags whose |
| ``.string`` matches your value for ``text``. This code finds the <a> |
| tags whose ``.string`` is "Elsie":: |
| |
| soup.find_all("a", text="Elsie") |
| # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] |
| |
| .. _limit: |
| |
| The ``limit`` argument |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| ``find_all()`` returns all the tags and strings that match your |
| filters. This can take a while if the document is large. If you don't |
| need `all` the results, you can pass in a number for ``limit``. This |
| works just like the LIMIT keyword in SQL. It tells Beautiful Soup to |
| stop gathering results after it's found a certain number. |
| |
| There are three links in the "three sisters" document, but this code |
| only finds the first two:: |
| |
| soup.find_all("a", limit=2) |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] |
| |
| .. _recursive: |
| |
| The ``recursive`` argument |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| If you call ``mytag.find_all()``, Beautiful Soup will examine all the |
| descendants of ``mytag``: its children, its children's children, and |
| so on. If you only want Beautiful Soup to consider direct children, |
| you can pass in ``recursive=False``. See the difference here:: |
| |
| soup.html.find_all("title") |
| # [<title>The Dormouse's story</title>] |
| |
| soup.html.find_all("title", recursive=False) |
| # [] |
| |
| Here's that part of the document:: |
| |
| <html> |
| <head> |
| <title> |
| The Dormouse's story |
| </title> |
| </head> |
| ... |
| |
| The <title> tag is beneath the <html> tag, but it's not `directly` |
| beneath the <html> tag: the <head> tag is in the way. Beautiful Soup |
| finds the <title> tag when it's allowed to look at all descendants of |
| the <html> tag, but when ``recursive=False`` restricts it to the |
| <html> tag's immediate children, it finds nothing. |
| |
| Beautiful Soup offers a lot of tree-searching methods (covered below), |
| and they mostly take the same arguments as ``find_all()``: ``name``, |
| ``attrs``, ``text``, ``limit``, and the keyword arguments. But the |
| ``recursive`` argument is different: ``find_all()`` and ``find()`` are |
| the only methods that support it. Passing ``recursive=False`` into a |
| method like ``find_parents()`` wouldn't be very useful. |
| |
| Calling a tag is like calling ``find_all()`` |
| -------------------------------------------- |
| |
| Because ``find_all()`` is the most popular method in the Beautiful |
| Soup search API, you can use a shortcut for it. If you treat the |
| ``BeautifulSoup`` object or a ``Tag`` object as though it were a |
| function, then it's the same as calling ``find_all()`` on that |
| object. These two lines of code are equivalent:: |
| |
| soup.find_all("a") |
| soup("a") |
| |
| These two lines are also equivalent:: |
| |
| soup.title.find_all(text=True) |
| soup.title(text=True) |
| |
| ``find()`` |
| ---------- |
| |
| Signature: find(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`recursive |
| <recursive>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) |
| |
| The ``find_all()`` method scans the entire document looking for |
| results, but sometimes you only want to find one result. If you know a |
| document only has one <body> tag, it's a waste of time to scan the |
| entire document looking for more. Rather than passing in ``limit=1`` |
| every time you call ``find_all``, you can use the ``find()`` |
| method. These two lines of code are `nearly` equivalent:: |
| |
| soup.find_all('title', limit=1) |
| # [<title>The Dormouse's story</title>] |
| |
| soup.find('title') |
| # <title>The Dormouse's story</title> |
| |
| The only difference is that ``find_all()`` returns a list containing |
| the single result, and ``find()`` just returns the result. |
| |
| If ``find_all()`` can't find anything, it returns an empty list. If |
| ``find()`` can't find anything, it returns ``None``:: |
| |
| print(soup.find("nosuchtag")) |
| # None |
| |
| Remember the ``soup.head.title`` trick from `Navigating using tag |
| names`_? That trick works by repeatedly calling ``find()``:: |
| |
| soup.head.title |
| # <title>The Dormouse's story</title> |
| |
| soup.find("head").find("title") |
| # <title>The Dormouse's story</title> |
| |
| ``find_parents()`` and ``find_parent()`` |
| ---------------------------------------- |
| |
| Signature: find_parents(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) |
| |
| Signature: find_parent(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) |
| |
| I spent a lot of time above covering ``find_all()`` and |
| ``find()``. The Beautiful Soup API defines ten other methods for |
| searching the tree, but don't be afraid. Five of these methods are |
| basically the same as ``find_all()``, and the other five are basically |
| the same as ``find()``. The only differences are in what parts of the |
| tree they search. |
| |
| First let's consider ``find_parents()`` and |
| ``find_parent()``. Remember that ``find_all()`` and ``find()`` work |
| their way down the tree, looking at tag's descendants. These methods |
| do the opposite: they work their way `up` the tree, looking at a tag's |
| (or a string's) parents. Let's try them out, starting from a string |
| buried deep in the "three daughters" document:: |
| |
| a_string = soup.find(text="Lacie") |
| a_string |
| # u'Lacie' |
| |
| a_string.find_parents("a") |
| # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] |
| |
| a_string.find_parent("p") |
| # <p class="story">Once upon a time there were three little sisters; and their names were |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; |
| # and they lived at the bottom of a well.</p> |
| |
| a_string.find_parents("p", class="title") |
| # [] |
| |
| One of the three <a> tags is the direct parent of the string in |
| question, so our search finds it. One of the three <p> tags is an |
| indirect parent of the string, and our search finds that as |
| well. There's a <p> tag with the CSS class "title" `somewhere` in the |
| document, but it's not one of this string's parents, so we can't find |
| it with ``find_parents()``. |
| |
| You may have made the connection between ``find_parent()`` and |
| ``find_parents()``, and the `.parent`_ and `.parents`_ attributes |
| mentioned earlier. The connection is very strong. These search methods |
| actually use ``.parents`` to iterate over all the parents, and check |
| each one against the provided filter to see if it matches. |
| |
| ``find_next_siblings()`` and ``find_next_sibling()`` |
| ---------------------------------------------------- |
| |
| Signature: find_next_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) |
| |
| Signature: find_next_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) |
| |
| These methods use :ref:`.next_siblings <sibling-generators>` to |
| iterate over the rest of an element's siblings in the tree. The |
| ``find_next_siblings()`` method returns all the siblings that match, |
| and ``find_next_sibling()`` only returns the first one:: |
| |
| first_link = soup.a |
| first_link |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> |
| |
| first_link.find_next_siblings("a") |
| # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| first_story_paragraph = soup.find("p", "story") |
| first_story_paragraph.find_next_sibling("p") |
| # <p class="story">...</p> |
| |
| ``find_previous_siblings()`` and ``find_previous_sibling()`` |
| ------------------------------------------------------------ |
| |
| Signature: find_previous_siblings(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) |
| |
| Signature: find_previous_sibling(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) |
| |
| These methods use :ref:`.previous_siblings <sibling-generators>` to iterate over an element's |
| siblings that precede it in the tree. The ``find_previous_siblings()`` |
| method returns all the siblings that match, and |
| ``find_previous_sibling()`` only returns the first one:: |
| |
| last_link = soup.find("a", id="link3") |
| last_link |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> |
| |
| last_link.find_previous_siblings("a") |
| # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] |
| |
| first_story_paragraph = soup.find("p", "story") |
| first_story_paragraph.find_previous_sibling("p") |
| # <p class="title"><b>The Dormouse's story</b></p> |
| |
| |
| ``find_all_next()`` and ``find_next()`` |
| --------------------------------------- |
| |
| Signature: find_all_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) |
| |
| Signature: find_next(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) |
| |
| These methods use :ref:`.next_elements <element-generators>` to |
| iterate over whatever tags and strings that come after it in the |
| document. The ``find_all_next()`` method returns all matches, and |
| ``find_next()`` only returns the first match:: |
| |
| first_link = soup.a |
| first_link |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> |
| |
| first_link.find_all_next(text=True) |
| # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', |
| # u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n'] |
| |
| first_link.find_next("p") |
| # <p class="story">...</p> |
| |
| In the first example, the string "Elsie" showed up, even though it was |
| contained within the <a> tag we started from. In the second example, |
| the last <p> tag in the document showed up, even though it's not in |
| the same part of the tree as the <a> tag we started from. For these |
| methods, all that matters is that an element match the filter, and |
| show up later in the document than the starting element. |
| |
| ``find_all_previous()`` and ``find_previous()`` |
| ----------------------------------------------- |
| |
| Signature: find_all_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`limit <limit>`, :ref:`**kwargs <kwargs>`) |
| |
| Signature: find_previous(:ref:`name <name>`, :ref:`attrs <attrs>`, :ref:`text <text>`, :ref:`**kwargs <kwargs>`) |
| |
| These methods use :ref:`.previous_elements <element-generators>` to |
| iterate over the tags and strings that came before it in the |
| document. The ``find_all_previous()`` method returns all matches, and |
| ``find_previous()`` only returns the first match:: |
| |
| first_link = soup.a |
| first_link |
| # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> |
| |
| first_link.find_all_previous("p") |
| # [<p class="story">Once upon a time there were three little sisters; ...</p>, |
| # <p class="title"><b>The Dormouse's story</b></p>] |
| |
| first_link.find_previous("title") |
| # <title>The Dormouse's story</title> |
| |
| The call to ``find_all_previous("p")`` found the first paragraph in |
| the document (the one with class="title"), but it also finds the |
| second paragraph, the <p> tag that contains the <a> tag we started |
| with. This shouldn't be too surprising: we're looking at all the tags |
| that show up earlier in the document than the one we started with. A |
| <p> tag that contains an <a> tag must have shown up before the <a> |
| tag it contains. |
| |
| CSS selectors |
| ------------- |
| |
| Beautiful Soup supports the most commonly-used `CSS selectors |
| <http://www.w3.org/TR/CSS2/selector.html>`_. Just pass a string into |
| the ``.select()`` method of a ``Tag`` object or the ``BeautifulSoup`` |
| object itself. |
| |
| You can find tags:: |
| |
| soup.select("title") |
| # [<title>The Dormouse's story</title>] |
| |
| soup.select("p nth-of-type(3)") |
| # [<p class="story">...</p>] |
| |
| Find tags beneath other tags:: |
| |
| soup.select("body a") |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| soup.select("html head title") |
| # [<title>The Dormouse's story</title>] |
| |
| Find tags `directly` beneath other tags:: |
| |
| soup.select("head > title") |
| # [<title>The Dormouse's story</title>] |
| |
| soup.select("p > a") |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| soup.select("p > a:nth-of-type(2)") |
| # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] |
| |
| soup.select("p > #link1") |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] |
| |
| soup.select("body > a") |
| # [] |
| |
| Find the siblings of tags:: |
| |
| soup.select("#link1 ~ .sister") |
| # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| soup.select("#link1 + .sister") |
| # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] |
| |
| Find tags by CSS class:: |
| |
| soup.select(".sister") |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| soup.select("[class~=sister]") |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| Find tags by ID:: |
| |
| soup.select("#link1") |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] |
| |
| soup.select("a#link2") |
| # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] |
| |
| Test for the existence of an attribute:: |
| |
| soup.select('a[href]') |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| Find tags by attribute value:: |
| |
| soup.select('a[href="http://example.com/elsie"]') |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] |
| |
| soup.select('a[href^="http://example.com/"]') |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, |
| # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, |
| # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| soup.select('a[href$="tillie"]') |
| # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] |
| |
| soup.select('a[href*=".com/el"]') |
| # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] |
| |
| Match language codes:: |
| |
| multilingual_markup = """ |
| <p lang="en">Hello</p> |
| <p lang="en-us">Howdy, y'all</p> |
| <p lang="en-gb">Pip-pip, old fruit</p> |
| <p lang="fr">Bonjour mes amis</p> |
| """ |
| multilingual_soup = BeautifulSoup(multilingual_markup) |
| multilingual_soup.select('p[lang|=en]') |
| # [<p lang="en">Hello</p>, |
| # <p lang="en-us">Howdy, y'all</p>, |
| # <p lang="en-gb">Pip-pip, old fruit</p>] |
| |
| This is a convenience for users who know the CSS selector syntax. You |
| can do all this stuff with the Beautiful Soup API. And if CSS |
| selectors are all you need, you might as well use lxml directly, |
| because it's faster. But this lets you `combine` simple CSS selectors |
| with the Beautiful Soup API. |
| |
| |
| Modifying the tree |
| ================== |
| |
| Beautiful Soup's main strength is in searching the parse tree, but you |
| can also modify the tree and write your changes as a new HTML or XML |
| document. |
| |
| Changing tag names and attributes |
| --------------------------------- |
| |
| I covered this earlier, in `Attributes`_, but it bears repeating. You |
| can rename a tag, change the values of its attributes, add new |
| attributes, and delete attributes:: |
| |
| soup = BeautifulSoup('<b class="boldest">Extremely bold</b>') |
| tag = soup.b |
| |
| tag.name = "blockquote" |
| tag['class'] = 'verybold' |
| tag['id'] = 1 |
| tag |
| # <blockquote class="verybold" id="1">Extremely bold</blockquote> |
| |
| del tag['class'] |
| del tag['id'] |
| tag |
| # <blockquote>Extremely bold</blockquote> |
| |
| |
| Modifying ``.string`` |
| --------------------- |
| |
| If you set a tag's ``.string`` attribute, the tag's contents are |
| replaced with the string you give:: |
| |
| markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' |
| soup = BeautifulSoup(markup) |
| |
| tag = soup.a |
| tag.string = "New link text." |
| tag |
| # <a href="http://example.com/">New link text.</a> |
| |
| Be careful: if the tag contained other tags, they and all their |
| contents will be destroyed. |
| |
| ``append()`` |
| ------------ |
| |
| You can add to a tag's contents with ``Tag.append()``. It works just |
| like calling ``.append()`` on a Python list:: |
| |
| soup = BeautifulSoup("<a>Foo</a>") |
| soup.a.append("Bar") |
| |
| soup |
| # <html><head></head><body><a>FooBar</a></body></html> |
| soup.a.contents |
| # [u'Foo', u'Bar'] |
| |
| ``BeautifulSoup.new_string()`` and ``.new_tag()`` |
| ------------------------------------------------- |
| |
| If you need to add a string to a document, no problem--you can pass a |
| Python string in to ``append()``, or you can call the factory method |
| ``BeautifulSoup.new_string()``:: |
| |
| soup = BeautifulSoup("<b></b>") |
| tag = soup.b |
| tag.append("Hello") |
| new_string = soup.new_string(" there") |
| tag.append(new_string) |
| tag |
| # <b>Hello there.</b> |
| tag.contents |
| # [u'Hello', u' there'] |
| |
| If you want to create a comment or some other subclass of |
| ``NavigableString``, pass that class as the second argument to |
| ``new_string()``:: |
| |
| from bs4 import Comment |
| new_comment = soup.new_string("Nice to see you.", Comment) |
| tag.append(new_comment) |
| tag |
| # <b>Hello there<!--Nice to see you.--></b> |
| tag.contents |
| # [u'Hello', u' there', u'Nice to see you.'] |
| |
| (This is a new feature in Beautiful Soup 4.2.1.) |
| |
| What if you need to create a whole new tag? The best solution is to |
| call the factory method ``BeautifulSoup.new_tag()``:: |
| |
| soup = BeautifulSoup("<b></b>") |
| original_tag = soup.b |
| |
| new_tag = soup.new_tag("a", href="http://www.example.com") |
| original_tag.append(new_tag) |
| original_tag |
| # <b><a href="http://www.example.com"></a></b> |
| |
| new_tag.string = "Link text." |
| original_tag |
| # <b><a href="http://www.example.com">Link text.</a></b> |
| |
| Only the first argument, the tag name, is required. |
| |
| ``insert()`` |
| ------------ |
| |
| ``Tag.insert()`` is just like ``Tag.append()``, except the new element |
| doesn't necessarily go at the end of its parent's |
| ``.contents``. It'll be inserted at whatever numeric position you |
| say. It works just like ``.insert()`` on a Python list:: |
| |
| markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' |
| soup = BeautifulSoup(markup) |
| tag = soup.a |
| |
| tag.insert(1, "but did not endorse ") |
| tag |
| # <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a> |
| tag.contents |
| # [u'I linked to ', u'but did not endorse', <i>example.com</i>] |
| |
| ``insert_before()`` and ``insert_after()`` |
| ------------------------------------------ |
| |
| The ``insert_before()`` method inserts a tag or string immediately |
| before something else in the parse tree:: |
| |
| soup = BeautifulSoup("<b>stop</b>") |
| tag = soup.new_tag("i") |
| tag.string = "Don't" |
| soup.b.string.insert_before(tag) |
| soup.b |
| # <b><i>Don't</i>stop</b> |
| |
| The ``insert_after()`` method moves a tag or string so that it |
| immediately follows something else in the parse tree:: |
| |
| soup.b.i.insert_after(soup.new_string(" ever ")) |
| soup.b |
| # <b><i>Don't</i> ever stop</b> |
| soup.b.contents |
| # [<i>Don't</i>, u' ever ', u'stop'] |
| |
| ``clear()`` |
| ----------- |
| |
| ``Tag.clear()`` removes the contents of a tag:: |
| |
| markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' |
| soup = BeautifulSoup(markup) |
| tag = soup.a |
| |
| tag.clear() |
| tag |
| # <a href="http://example.com/"></a> |
| |
| ``extract()`` |
| ------------- |
| |
| ``PageElement.extract()`` removes a tag or string from the tree. It |
| returns the tag or string that was extracted:: |
| |
| markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' |
| soup = BeautifulSoup(markup) |
| a_tag = soup.a |
| |
| i_tag = soup.i.extract() |
| |
| a_tag |
| # <a href="http://example.com/">I linked to</a> |
| |
| i_tag |
| # <i>example.com</i> |
| |
| print(i_tag.parent) |
| None |
| |
| At this point you effectively have two parse trees: one rooted at the |
| ``BeautifulSoup`` object you used to parse the document, and one rooted |
| at the tag that was extracted. You can go on to call ``extract`` on |
| a child of the element you extracted:: |
| |
| my_string = i_tag.string.extract() |
| my_string |
| # u'example.com' |
| |
| print(my_string.parent) |
| # None |
| i_tag |
| # <i></i> |
| |
| |
| ``decompose()`` |
| --------------- |
| |
| ``Tag.decompose()`` removes a tag from the tree, then `completely |
| destroys it and its contents`:: |
| |
| markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' |
| soup = BeautifulSoup(markup) |
| a_tag = soup.a |
| |
| soup.i.decompose() |
| |
| a_tag |
| # <a href="http://example.com/">I linked to</a> |
| |
| |
| .. _replace_with: |
| |
| ``replace_with()`` |
| ------------------ |
| |
| ``PageElement.replace_with()`` removes a tag or string from the tree, |
| and replaces it with the tag or string of your choice:: |
| |
| markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' |
| soup = BeautifulSoup(markup) |
| a_tag = soup.a |
| |
| new_tag = soup.new_tag("b") |
| new_tag.string = "example.net" |
| a_tag.i.replace_with(new_tag) |
| |
| a_tag |
| # <a href="http://example.com/">I linked to <b>example.net</b></a> |
| |
| ``replace_with()`` returns the tag or string that was replaced, so |
| that you can examine it or add it back to another part of the tree. |
| |
| ``wrap()`` |
| ---------- |
| |
| ``PageElement.wrap()`` wraps an element in the tag you specify. It |
| returns the new wrapper:: |
| |
| soup = BeautifulSoup("<p>I wish I was bold.</p>") |
| soup.p.string.wrap(soup.new_tag("b")) |
| # <b>I wish I was bold.</b> |
| |
| soup.p.wrap(soup.new_tag("div") |
| # <div><p><b>I wish I was bold.</b></p></div> |
| |
| This method is new in Beautiful Soup 4.0.5. |
| |
| ``unwrap()`` |
| --------------------------- |
| |
| ``Tag.unwrap()`` is the opposite of ``wrap()``. It replaces a tag with |
| whatever's inside that tag. It's good for stripping out markup:: |
| |
| markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' |
| soup = BeautifulSoup(markup) |
| a_tag = soup.a |
| |
| a_tag.i.unwrap() |
| a_tag |
| # <a href="http://example.com/">I linked to example.com</a> |
| |
| Like ``replace_with()``, ``unwrap()`` returns the tag |
| that was replaced. |
| |
| Output |
| ====== |
| |
| .. _.prettyprinting: |
| |
| Pretty-printing |
| --------------- |
| |
| The ``prettify()`` method will turn a Beautiful Soup parse tree into a |
| nicely formatted Unicode string, with each HTML/XML tag on its own line:: |
| |
| markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>' |
| soup = BeautifulSoup(markup) |
| soup.prettify() |
| # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...' |
| |
| print(soup.prettify()) |
| # <html> |
| # <head> |
| # </head> |
| # <body> |
| # <a href="http://example.com/"> |
| # I linked to |
| # <i> |
| # example.com |
| # </i> |
| # </a> |
| # </body> |
| # </html> |
| |
| You can call ``prettify()`` on the top-level ``BeautifulSoup`` object, |
| or on any of its ``Tag`` objects:: |
| |
| print(soup.a.prettify()) |
| # <a href="http://example.com/"> |
| # I linked to |
| # <i> |
| # example.com |
| # </i> |
| # </a> |
| |
| Non-pretty printing |
| ------------------- |
| |
| If you just want a string, with no fancy formatting, you can call |
| ``unicode()`` or ``str()`` on a ``BeautifulSoup`` object, or a ``Tag`` |
| within it:: |
| |
| str(soup) |
| # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>' |
| |
| unicode(soup.a) |
| # u'<a href="http://example.com/">I linked to <i>example.com</i></a>' |
| |
| The ``str()`` function returns a string encoded in UTF-8. See |
| `Encodings`_ for other options. |
| |
| You can also call ``encode()`` to get a bytestring, and ``decode()`` |
| to get Unicode. |
| |
| .. _output_formatters: |
| |
| Output formatters |
| ----------------- |
| |
| If you give Beautiful Soup a document that contains HTML entities like |
| "&lquot;", they'll be converted to Unicode characters:: |
| |
| soup = BeautifulSoup("“Dammit!” he said.") |
| unicode(soup) |
| # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>' |
| |
| If you then convert the document to a string, the Unicode characters |
| will be encoded as UTF-8. You won't get the HTML entities back:: |
| |
| str(soup) |
| # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>' |
| |
| By default, the only characters that are escaped upon output are bare |
| ampersands and angle brackets. These get turned into "&", "<", |
| and ">", so that Beautiful Soup doesn't inadvertently generate |
| invalid HTML or XML:: |
| |
| soup = BeautifulSoup("<p>The law firm of Dewey, Cheatem, & Howe</p>") |
| soup.p |
| # <p>The law firm of Dewey, Cheatem, & Howe</p> |
| |
| soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') |
| soup.a |
| # <a href="http://example.com/?foo=val1&bar=val2">A link</a> |
| |
| You can change this behavior by providing a value for the |
| ``formatter`` argument to ``prettify()``, ``encode()``, or |
| ``decode()``. Beautiful Soup recognizes four possible values for |
| ``formatter``. |
| |
| The default is ``formatter="minimal"``. Strings will only be processed |
| enough to ensure that Beautiful Soup generates valid HTML/XML:: |
| |
| french = "<p>Il a dit <<Sacré bleu!>></p>" |
| soup = BeautifulSoup(french) |
| print(soup.prettify(formatter="minimal")) |
| # <html> |
| # <body> |
| # <p> |
| # Il a dit <<Sacré bleu!>> |
| # </p> |
| # </body> |
| # </html> |
| |
| If you pass in ``formatter="html"``, Beautiful Soup will convert |
| Unicode characters to HTML entities whenever possible:: |
| |
| print(soup.prettify(formatter="html")) |
| # <html> |
| # <body> |
| # <p> |
| # Il a dit <<Sacré bleu!>> |
| # </p> |
| # </body> |
| # </html> |
| |
| If you pass in ``formatter=None``, Beautiful Soup will not modify |
| strings at all on output. This is the fastest option, but it may lead |
| to Beautiful Soup generating invalid HTML/XML, as in these examples:: |
| |
| print(soup.prettify(formatter=None)) |
| # <html> |
| # <body> |
| # <p> |
| # Il a dit <<Sacré bleu!>> |
| # </p> |
| # </body> |
| # </html> |
| |
| link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>') |
| print(link_soup.a.encode(formatter=None)) |
| # <a href="http://example.com/?foo=val1&bar=val2">A link</a> |
| |
| Finally, if you pass in a function for ``formatter``, Beautiful Soup |
| will call that function once for every string and attribute value in |
| the document. You can do whatever you want in this function. Here's a |
| formatter that converts strings to uppercase and does absolutely |
| nothing else:: |
| |
| def uppercase(str): |
| return str.upper() |
| |
| print(soup.prettify(formatter=uppercase)) |
| # <html> |
| # <body> |
| # <p> |
| # IL A DIT <<SACRÉ BLEU!>> |
| # </p> |
| # </body> |
| # </html> |
| |
| print(link_soup.a.prettify(formatter=uppercase)) |
| # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2"> |
| # A LINK |
| # </a> |
| |
| If you're writing your own function, you should know about the |
| ``EntitySubstitution`` class in the ``bs4.dammit`` module. This class |
| implements Beautiful Soup's standard formatters as class methods: the |
| "html" formatter is ``EntitySubstitution.substitute_html``, and the |
| "minimal" formatter is ``EntitySubstitution.substitute_xml``. You can |
| use these functions to simulate ``formatter=html`` or |
| ``formatter==minimal``, but then do something extra. |
| |
| Here's an example that replaces Unicode characters with HTML entities |
| whenever possible, but `also` converts all strings to uppercase:: |
| |
| from bs4.dammit import EntitySubstitution |
| def uppercase_and_substitute_html_entities(str): |
| return EntitySubstitution.substitute_html(str.upper()) |
| |
| print(soup.prettify(formatter=uppercase_and_substitute_html_entities)) |
| # <html> |
| # <body> |
| # <p> |
| # IL A DIT <<SACRÉ BLEU!>> |
| # </p> |
| # </body> |
| # </html> |
| |
| One last caveat: if you create a ``CData`` object, the text inside |
| that object is always presented `exactly as it appears, with no |
| formatting`. Beautiful Soup will call the formatter method, just in |
| case you've written a custom method that counts all the strings in the |
| document or something, but it will ignore the return value:: |
| |
| from bs4.element import CData |
| soup = BeautifulSoup("<a></a>") |
| soup.a.string = CData("one < three") |
| print(soup.a.prettify(formatter="xml")) |
| # <a> |
| # <![CDATA[one < three]]> |
| # </a> |
| |
| |
| ``get_text()`` |
| -------------- |
| |
| If you only want the text part of a document or tag, you can use the |
| ``get_text()`` method. It returns all the text in a document or |
| beneath a tag, as a single Unicode string:: |
| |
| markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' |
| soup = BeautifulSoup(markup) |
| |
| soup.get_text() |
| u'\nI linked to example.com\n' |
| soup.i.get_text() |
| u'example.com' |
| |
| You can specify a string to be used to join the bits of text |
| together:: |
| |
| # soup.get_text("|") |
| u'\nI linked to |example.com|\n' |
| |
| You can tell Beautiful Soup to strip whitespace from the beginning and |
| end of each bit of text:: |
| |
| # soup.get_text("|", strip=True) |
| u'I linked to|example.com' |
| |
| But at that point you might want to use the :ref:`.stripped_strings <string-generators>` |
| generator instead, and process the text yourself:: |
| |
| [text for text in soup.stripped_strings] |
| # [u'I linked to', u'example.com'] |
| |
| Specifying the parser to use |
| ============================ |
| |
| If you just need to parse some HTML, you can dump the markup into the |
| ``BeautifulSoup`` constructor, and it'll probably be fine. Beautiful |
| Soup will pick a parser for you and parse the data. But there are a |
| few additional arguments you can pass in to the constructor to change |
| which parser is used. |
| |
| The first argument to the ``BeautifulSoup`` constructor is a string or |
| an open filehandle--the markup you want parsed. The second argument is |
| `how` you'd like the markup parsed. |
| |
| If you don't specify anything, you'll get the best HTML parser that's |
| installed. Beautiful Soup ranks lxml's parser as being the best, then |
| html5lib's, then Python's built-in parser. You can override this by |
| specifying one of the following: |
| |
| * What type of markup you want to parse. Currently supported are |
| "html", "xml", and "html5". |
| |
| * The name of the parser library you want to use. Currently supported |
| options are "lxml", "html5lib", and "html.parser" (Python's |
| built-in HTML parser). |
| |
| The section `Installing a parser`_ contrasts the supported parsers. |
| |
| If you don't have an appropriate parser installed, Beautiful Soup will |
| ignore your request and pick a different parser. Right now, the only |
| supported XML parser is lxml. If you don't have lxml installed, asking |
| for an XML parser won't give you one, and asking for "lxml" won't work |
| either. |
| |
| Differences between parsers |
| --------------------------- |
| |
| Beautiful Soup presents the same interface to a number of different |
| parsers, but each parser is different. Different parsers will create |
| different parse trees from the same document. The biggest differences |
| are between the HTML parsers and the XML parsers. Here's a short |
| document, parsed as HTML:: |
| |
| BeautifulSoup("<a><b /></a>") |
| # <html><head></head><body><a><b></b></a></body></html> |
| |
| Since an empty <b /> tag is not valid HTML, the parser turns it into a |
| <b></b> tag pair. |
| |
| Here's the same document parsed as XML (running this requires that you |
| have lxml installed). Note that the empty <b /> tag is left alone, and |
| that the document is given an XML declaration instead of being put |
| into an <html> tag.:: |
| |
| BeautifulSoup("<a><b /></a>", "xml") |
| # <?xml version="1.0" encoding="utf-8"?> |
| # <a><b/></a> |
| |
| There are also differences between HTML parsers. If you give Beautiful |
| Soup a perfectly-formed HTML document, these differences won't |
| matter. One parser will be faster than another, but they'll all give |
| you a data structure that looks exactly like the original HTML |
| document. |
| |
| But if the document is not perfectly-formed, different parsers will |
| give different results. Here's a short, invalid document parsed using |
| lxml's HTML parser. Note that the dangling </p> tag is simply |
| ignored:: |
| |
| BeautifulSoup("<a></p>", "lxml") |
| # <html><body><a></a></body></html> |
| |
| Here's the same document parsed using html5lib:: |
| |
| BeautifulSoup("<a></p>", "html5lib") |
| # <html><head></head><body><a><p></p></a></body></html> |
| |
| Instead of ignoring the dangling </p> tag, html5lib pairs it with an |
| opening <p> tag. This parser also adds an empty <head> tag to the |
| document. |
| |
| Here's the same document parsed with Python's built-in HTML |
| parser:: |
| |
| BeautifulSoup("<a></p>", "html.parser") |
| # <a></a> |
| |
| Like html5lib, this parser ignores the closing </p> tag. Unlike |
| html5lib, this parser makes no attempt to create a well-formed HTML |
| document by adding a <body> tag. Unlike lxml, it doesn't even bother |
| to add an <html> tag. |
| |
| Since the document "<a></p>" is invalid, none of these techniques is |
| the "correct" way to handle it. The html5lib parser uses techniques |
| that are part of the HTML5 standard, so it has the best claim on being |
| the "correct" way, but all three techniques are legitimate. |
| |
| Differences between parsers can affect your script. If you're planning |
| on distributing your script to other people, or running it on multiple |
| machines, you should specify a parser in the ``BeautifulSoup`` |
| constructor. That will reduce the chances that your users parse a |
| document differently from the way you parse it. |
| |
| Encodings |
| ========= |
| |
| Any HTML or XML document is written in a specific encoding like ASCII |
| or UTF-8. But when you load that document into Beautiful Soup, you'll |
| discover it's been converted to Unicode:: |
| |
| markup = "<h1>Sacr\xc3\xa9 bleu!</h1>" |
| soup = BeautifulSoup(markup) |
| soup.h1 |
| # <h1>Sacré bleu!</h1> |
| soup.h1.string |
| # u'Sacr\xe9 bleu!' |
| |
| It's not magic. (That sure would be nice.) Beautiful Soup uses a |
| sub-library called `Unicode, Dammit`_ to detect a document's encoding |
| and convert it to Unicode. The autodetected encoding is available as |
| the ``.original_encoding`` attribute of the ``BeautifulSoup`` object:: |
| |
| soup.original_encoding |
| 'utf-8' |
| |
| Unicode, Dammit guesses correctly most of the time, but sometimes it |
| makes mistakes. Sometimes it guesses correctly, but only after a |
| byte-by-byte search of the document that takes a very long time. If |
| you happen to know a document's encoding ahead of time, you can avoid |
| mistakes and delays by passing it to the ``BeautifulSoup`` constructor |
| as ``from_encoding``. |
| |
| Here's a document written in ISO-8859-8. The document is so short that |
| Unicode, Dammit can't get a good lock on it, and misidentifies it as |
| ISO-8859-7:: |
| |
| markup = b"<h1>\xed\xe5\xec\xf9</h1>" |
| soup = BeautifulSoup(markup) |
| soup.h1 |
| <h1>νεμω</h1> |
| soup.original_encoding |
| 'ISO-8859-7' |
| |
| We can fix this by passing in the correct ``from_encoding``:: |
| |
| soup = BeautifulSoup(markup, from_encoding="iso-8859-8") |
| soup.h1 |
| <h1>םולש</h1> |
| soup.original_encoding |
| 'iso8859-8' |
| |
| In rare cases (usually when a UTF-8 document contains text written in |
| a completely different encoding), the only way to get Unicode may be |
| to replace some characters with the special Unicode character |
| "REPLACEMENT CHARACTER" (U+FFFD, �). If Unicode, Dammit needs to do |
| this, it will set the ``.contains_replacement_characters`` attribute |
| to ``True`` on the ``UnicodeDammit`` or ``BeautifulSoup`` object. This |
| lets you know that the Unicode representation is not an exact |
| representation of the original--some data was lost. If a document |
| contains �, but ``.contains_replacement_characters`` is ``False``, |
| you'll know that the � was there originally (as it is in this |
| paragraph) and doesn't stand in for missing data. |
| |
| Output encoding |
| --------------- |
| |
| When you write out a document from Beautiful Soup, you get a UTF-8 |
| document, even if the document wasn't in UTF-8 to begin with. Here's a |
| document written in the Latin-1 encoding:: |
| |
| markup = b''' |
| <html> |
| <head> |
| <meta content="text/html; charset=ISO-Latin-1" http-equiv="Content-type" /> |
| </head> |
| <body> |
| <p>Sacr\xe9 bleu!</p> |
| </body> |
| </html> |
| ''' |
| |
| soup = BeautifulSoup(markup) |
| print(soup.prettify()) |
| # <html> |
| # <head> |
| # <meta content="text/html; charset=utf-8" http-equiv="Content-type" /> |
| # </head> |
| # <body> |
| # <p> |
| # Sacré bleu! |
| # </p> |
| # </body> |
| # </html> |
| |
| Note that the <meta> tag has been rewritten to reflect the fact that |
| the document is now in UTF-8. |
| |
| If you don't want UTF-8, you can pass an encoding into ``prettify()``:: |
| |
| print(soup.prettify("latin-1")) |
| # <html> |
| # <head> |
| # <meta content="text/html; charset=latin-1" http-equiv="Content-type" /> |
| # ... |
| |
| You can also call encode() on the ``BeautifulSoup`` object, or any |
| element in the soup, just as if it were a Python string:: |
| |
| soup.p.encode("latin-1") |
| # '<p>Sacr\xe9 bleu!</p>' |
| |
| soup.p.encode("utf-8") |
| # '<p>Sacr\xc3\xa9 bleu!</p>' |
| |
| Any characters that can't be represented in your chosen encoding will |
| be converted into numeric XML entity references. Here's a document |
| that includes the Unicode character SNOWMAN:: |
| |
| markup = u"<b>\N{SNOWMAN}</b>" |
| snowman_soup = BeautifulSoup(markup) |
| tag = snowman_soup.b |
| |
| The SNOWMAN character can be part of a UTF-8 document (it looks like |
| ☃), but there's no representation for that character in ISO-Latin-1 or |
| ASCII, so it's converted into "☃" for those encodings:: |
| |
| print(tag.encode("utf-8")) |
| # <b>☃</b> |
| |
| print tag.encode("latin-1") |
| # <b>☃</b> |
| |
| print tag.encode("ascii") |
| # <b>☃</b> |
| |
| Unicode, Dammit |
| --------------- |
| |
| You can use Unicode, Dammit without using Beautiful Soup. It's useful |
| whenever you have data in an unknown encoding and you just want it to |
| become Unicode:: |
| |
| from bs4 import UnicodeDammit |
| dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!") |
| print(dammit.unicode_markup) |
| # Sacré bleu! |
| dammit.original_encoding |
| # 'utf-8' |
| |
| Unicode, Dammit's guesses will get a lot more accurate if you install |
| the ``chardet`` or ``cchardet`` Python libraries. The more data you |
| give Unicode, Dammit, the more accurately it will guess. If you have |
| your own suspicions as to what the encoding might be, you can pass |
| them in as a list:: |
| |
| dammit = UnicodeDammit("Sacr\xe9 bleu!", ["latin-1", "iso-8859-1"]) |
| print(dammit.unicode_markup) |
| # Sacré bleu! |
| dammit.original_encoding |
| # 'latin-1' |
| |
| Unicode, Dammit has two special features that Beautiful Soup doesn't |
| use. |
| |
| Smart quotes |
| ^^^^^^^^^^^^ |
| |
| You can use Unicode, Dammit to convert Microsoft smart quotes to HTML or XML |
| entities:: |
| |
| markup = b"<p>I just \x93love\x94 Microsoft Word\x92s smart quotes</p>" |
| |
| UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="html").unicode_markup |
| # u'<p>I just “love” Microsoft Word’s smart quotes</p>' |
| |
| UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="xml").unicode_markup |
| # u'<p>I just “love” Microsoft Word’s smart quotes</p>' |
| |
| You can also convert Microsoft smart quotes to ASCII quotes:: |
| |
| UnicodeDammit(markup, ["windows-1252"], smart_quotes_to="ascii").unicode_markup |
| # u'<p>I just "love" Microsoft Word\'s smart quotes</p>' |
| |
| Hopefully you'll find this feature useful, but Beautiful Soup doesn't |
| use it. Beautiful Soup prefers the default behavior, which is to |
| convert Microsoft smart quotes to Unicode characters along with |
| everything else:: |
| |
| UnicodeDammit(markup, ["windows-1252"]).unicode_markup |
| # u'<p>I just \u201clove\u201d Microsoft Word\u2019s smart quotes</p>' |
| |
| Inconsistent encodings |
| ^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Sometimes a document is mostly in UTF-8, but contains Windows-1252 |
| characters such as (again) Microsoft smart quotes. This can happen |
| when a website includes data from multiple sources. You can use |
| ``UnicodeDammit.detwingle()`` to turn such a document into pure |
| UTF-8. Here's a simple example:: |
| |
| snowmen = (u"\N{SNOWMAN}" * 3) |
| quote = (u"\N{LEFT DOUBLE QUOTATION MARK}I like snowmen!\N{RIGHT DOUBLE QUOTATION MARK}") |
| doc = snowmen.encode("utf8") + quote.encode("windows_1252") |
| |
| This document is a mess. The snowmen are in UTF-8 and the quotes are |
| in Windows-1252. You can display the snowmen or the quotes, but not |
| both:: |
| |
| print(doc) |
| # ☃☃☃�I like snowmen!� |
| |
| print(doc.decode("windows-1252")) |
| # ☃☃☃“I like snowmen!” |
| |
| Decoding the document as UTF-8 raises a ``UnicodeDecodeError``, and |
| decoding it as Windows-1252 gives you gibberish. Fortunately, |
| ``UnicodeDammit.detwingle()`` will convert the string to pure UTF-8, |
| allowing you to decode it to Unicode and display the snowmen and quote |
| marks simultaneously:: |
| |
| new_doc = UnicodeDammit.detwingle(doc) |
| print(new_doc.decode("utf8")) |
| # ☃☃☃“I like snowmen!” |
| |
| ``UnicodeDammit.detwingle()`` only knows how to handle Windows-1252 |
| embedded in UTF-8 (or vice versa, I suppose), but this is the most |
| common case. |
| |
| Note that you must know to call ``UnicodeDammit.detwingle()`` on your |
| data before passing it into ``BeautifulSoup`` or the ``UnicodeDammit`` |
| constructor. Beautiful Soup assumes that a document has a single |
| encoding, whatever it might be. If you pass it a document that |
| contains both UTF-8 and Windows-1252, it's likely to think the whole |
| document is Windows-1252, and the document will come out looking like |
| ` ☃☃☃“I like snowmen!”`. |
| |
| ``UnicodeDammit.detwingle()`` is new in Beautiful Soup 4.1.0. |
| |
| Parsing only part of a document |
| =============================== |
| |
| Let's say you want to use Beautiful Soup look at a document's <a> |
| tags. It's a waste of time and memory to parse the entire document and |
| then go over it again looking for <a> tags. It would be much faster to |
| ignore everything that wasn't an <a> tag in the first place. The |
| ``SoupStrainer`` class allows you to choose which parts of an incoming |
| document are parsed. You just create a ``SoupStrainer`` and pass it in |
| to the ``BeautifulSoup`` constructor as the ``parse_only`` argument. |
| |
| (Note that *this feature won't work if you're using the html5lib parser*. |
| If you use html5lib, the whole document will be parsed, no |
| matter what. This is because html5lib constantly rearranges the parse |
| tree as it works, and if some part of the document didn't actually |
| make it into the parse tree, it'll crash. To avoid confusion, in the |
| examples below I'll be forcing Beautiful Soup to use Python's |
| built-in parser.) |
| |
| ``SoupStrainer`` |
| ---------------- |
| |
| The ``SoupStrainer`` class takes the same arguments as a typical |
| method from `Searching the tree`_: :ref:`name <name>`, :ref:`attrs |
| <attrs>`, :ref:`text <text>`, and :ref:`**kwargs <kwargs>`. Here are |
| three ``SoupStrainer`` objects:: |
| |
| from bs4 import SoupStrainer |
| |
| only_a_tags = SoupStrainer("a") |
| |
| only_tags_with_id_link2 = SoupStrainer(id="link2") |
| |
| def is_short_string(string): |
| return len(string) < 10 |
| |
| only_short_strings = SoupStrainer(text=is_short_string) |
| |
| I'm going to bring back the "three sisters" document one more time, |
| and we'll see what the document looks like when it's parsed with these |
| three ``SoupStrainer`` objects:: |
| |
| html_doc = """ |
| <html><head><title>The Dormouse's story</title></head> |
| |
| <p class="title"><b>The Dormouse's story</b></p> |
| |
| <p class="story">Once upon a time there were three little sisters; and their names were |
| <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, |
| <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and |
| <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; |
| and they lived at the bottom of a well.</p> |
| |
| <p class="story">...</p> |
| """ |
| |
| print(BeautifulSoup(html_doc, "html.parser", parse_only=only_a_tags).prettify()) |
| # <a class="sister" href="http://example.com/elsie" id="link1"> |
| # Elsie |
| # </a> |
| # <a class="sister" href="http://example.com/lacie" id="link2"> |
| # Lacie |
| # </a> |
| # <a class="sister" href="http://example.com/tillie" id="link3"> |
| # Tillie |
| # </a> |
| |
| print(BeautifulSoup(html_doc, "html.parser", parse_only=only_tags_with_id_link2).prettify()) |
| # <a class="sister" href="http://example.com/lacie" id="link2"> |
| # Lacie |
| # </a> |
| |
| print(BeautifulSoup(html_doc, "html.parser", parse_only=only_short_strings).prettify()) |
| # Elsie |
| # , |
| # Lacie |
| # and |
| # Tillie |
| # ... |
| # |
| |
| You can also pass a ``SoupStrainer`` into any of the methods covered |
| in `Searching the tree`_. This probably isn't terribly useful, but I |
| thought I'd mention it:: |
| |
| soup = BeautifulSoup(html_doc) |
| soup.find_all(only_short_strings) |
| # [u'\n\n', u'\n\n', u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', |
| # u'\n\n', u'...', u'\n'] |
| |
| Troubleshooting |
| =============== |
| |
| .. _diagnose: |
| |
| ``diagnose()`` |
| -------------- |
| |
| If you're having trouble understanding what Beautiful Soup does to a |
| document, pass the document into the ``diagnose()`` function. (New in |
| Beautiful Soup 4.2.0.) Beautiful Soup will print out a report showing |
| you how different parsers handle the document, and tell you if you're |
| missing a parser that Beautiful Soup could be using:: |
| |
| from bs4.diagnose import diagnose |
| data = open("bad.html").read() |
| diagnose(data) |
| |
| # Diagnostic running on Beautiful Soup 4.2.0 |
| # Python version 2.7.3 (default, Aug 1 2012, 05:16:07) |
| # I noticed that html5lib is not installed. Installing it may help. |
| # Found lxml version 2.3.2.0 |
| # |
| # Trying to parse your data with html.parser |
| # Here's what html.parser did with the document: |
| # ... |
| |
| Just looking at the output of diagnose() may show you how to solve the |
| problem. Even if not, you can paste the output of ``diagnose()`` when |
| asking for help. |
| |
| Errors when parsing a document |
| ------------------------------ |
| |
| There are two different kinds of parse errors. There are crashes, |
| where you feed a document to Beautiful Soup and it raises an |
| exception, usually an ``HTMLParser.HTMLParseError``. And there is |
| unexpected behavior, where a Beautiful Soup parse tree looks a lot |
| different than the document used to create it. |
| |
| Almost none of these problems turn out to be problems with Beautiful |
| Soup. This is not because Beautiful Soup is an amazingly well-written |
| piece of software. It's because Beautiful Soup doesn't include any |
| parsing code. Instead, it relies on external parsers. If one parser |
| isn't working on a certain document, the best solution is to try a |
| different parser. See `Installing a parser`_ for details and a parser |
| comparison. |
| |
| The most common parse errors are ``HTMLParser.HTMLParseError: |
| malformed start tag`` and ``HTMLParser.HTMLParseError: bad end |
| tag``. These are both generated by Python's built-in HTML parser |
| library, and the solution is to :ref:`install lxml or |
| html5lib. <parser-installation>` |
| |
| The most common type of unexpected behavior is that you can't find a |
| tag that you know is in the document. You saw it going in, but |
| ``find_all()`` returns ``[]`` or ``find()`` returns ``None``. This is |
| another common problem with Python's built-in HTML parser, which |
| sometimes skips tags it doesn't understand. Again, the solution is to |
| :ref:`install lxml or html5lib. <parser-installation>` |
| |
| Version mismatch problems |
| ------------------------- |
| |
| * ``SyntaxError: Invalid syntax`` (on the line ``ROOT_TAG_NAME = |
| u'[document]'``): Caused by running the Python 2 version of |
| Beautiful Soup under Python 3, without converting the code. |
| |
| * ``ImportError: No module named HTMLParser`` - Caused by running the |
| Python 2 version of Beautiful Soup under Python 3. |
| |
| * ``ImportError: No module named html.parser`` - Caused by running the |
| Python 3 version of Beautiful Soup under Python 2. |
| |
| * ``ImportError: No module named BeautifulSoup`` - Caused by running |
| Beautiful Soup 3 code on a system that doesn't have BS3 |
| installed. Or, by writing Beautiful Soup 4 code without knowing that |
| the package name has changed to ``bs4``. |
| |
| * ``ImportError: No module named bs4`` - Caused by running Beautiful |
| Soup 4 code on a system that doesn't have BS4 installed. |
| |
| .. _parsing-xml: |
| |
| Parsing XML |
| ----------- |
| |
| By default, Beautiful Soup parses documents as HTML. To parse a |
| document as XML, pass in "xml" as the second argument to the |
| ``BeautifulSoup`` constructor:: |
| |
| soup = BeautifulSoup(markup, "xml") |
| |
| You'll need to :ref:`have lxml installed <parser-installation>`. |
| |
| Other parser problems |
| --------------------- |
| |
| * If your script works on one computer but not another, it's probably |
| because the two computers have different parser libraries |
| available. For example, you may have developed the script on a |
| computer that has lxml installed, and then tried to run it on a |
| computer that only has html5lib installed. See `Differences between |
| parsers`_ for why this matters, and fix the problem by mentioning a |
| specific parser library in the ``BeautifulSoup`` constructor. |
| |
| * Because `HTML tags and attributes are case-insensitive |
| <http://www.w3.org/TR/html5/syntax.html#syntax>`_, all three HTML |
| parsers convert tag and attribute names to lowercase. That is, the |
| markup <TAG></TAG> is converted to <tag></tag>. If you want to |
| preserve mixed-case or uppercase tags and attributes, you'll need to |
| :ref:`parse the document as XML. <parsing-xml>` |
| |
| .. _misc: |
| |
| Miscellaneous |
| ------------- |
| |
| * ``UnicodeEncodeError: 'charmap' codec can't encode character |
| u'\xfoo' in position bar`` (or just about any other |
| ``UnicodeEncodeError``) - This is not a problem with Beautiful Soup. |
| This problem shows up in two main situations. First, when you try to |
| print a Unicode character that your console doesn't know how to |
| display. (See `this page on the Python wiki |
| <http://wiki.python.org/moin/PrintFails>`_ for help.) Second, when |
| you're writing to a file and you pass in a Unicode character that's |
| not supported by your default encoding. In this case, the simplest |
| solution is to explicitly encode the Unicode string into UTF-8 with |
| ``u.encode("utf8")``. |
| |
| * ``KeyError: [attr]`` - Caused by accessing ``tag['attr']`` when the |
| tag in question doesn't define the ``attr`` attribute. The most |
| common errors are ``KeyError: 'href'`` and ``KeyError: |
| 'class'``. Use ``tag.get('attr')`` if you're not sure ``attr`` is |
| defined, just as you would with a Python dictionary. |
| |
| * ``AttributeError: 'ResultSet' object has no attribute 'foo'`` - This |
| usually happens because you expected ``find_all()`` to return a |
| single tag or string. But ``find_all()`` returns a _list_ of tags |
| and strings--a ``ResultSet`` object. You need to iterate over the |
| list and look at the ``.foo`` of each one. Or, if you really only |
| want one result, you need to use ``find()`` instead of |
| ``find_all()``. |
| |
| * ``AttributeError: 'NoneType' object has no attribute 'foo'`` - This |
| usually happens because you called ``find()`` and then tried to |
| access the `.foo`` attribute of the result. But in your case, |
| ``find()`` didn't find anything, so it returned ``None``, instead of |
| returning a tag or a string. You need to figure out why your |
| ``find()`` call isn't returning anything. |
| |
| Improving Performance |
| --------------------- |
| |
| Beautiful Soup will never be as fast as the parsers it sits on top |
| of. If response time is critical, if you're paying for computer time |
| by the hour, or if there's any other reason why computer time is more |
| valuable than programmer time, you should forget about Beautiful Soup |
| and work directly atop `lxml <http://lxml.de/>`_. |
| |
| That said, there are things you can do to speed up Beautiful Soup. If |
| you're not using lxml as the underlying parser, my advice is to |
| :ref:`start <parser-installation>`. Beautiful Soup parses documents |
| significantly faster using lxml than using html.parser or html5lib. |
| |
| You can speed up encoding detection significantly by installing the |
| `cchardet <http://pypi.python.org/pypi/cchardet/>`_ library. |
| |
| `Parsing only part of a document`_ won't save you much time parsing |
| the document, but it can save a lot of memory, and it'll make |
| `searching` the document much faster. |
| |
| Beautiful Soup 3 |
| ================ |
| |
| Beautiful Soup 3 is the previous release series, and is no longer |
| being actively developed. It's currently packaged with all major Linux |
| distributions: |
| |
| :kbd:`$ apt-get install python-beautifulsoup` |
| |
| It's also published through PyPi as ``BeautifulSoup``.: |
| |
| :kbd:`$ easy_install BeautifulSoup` |
| |
| :kbd:`$ pip install BeautifulSoup` |
| |
| You can also `download a tarball of Beautiful Soup 3.2.0 |
| <http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.0.tar.gz>`_. |
| |
| If you ran ``easy_install beautifulsoup`` or ``easy_install |
| BeautifulSoup``, but your code doesn't work, you installed Beautiful |
| Soup 3 by mistake. You need to run ``easy_install beautifulsoup4``. |
| |
| `The documentation for Beautiful Soup 3 is archived online |
| <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html>`_. If |
| your first language is Chinese, it might be easier for you to read |
| `the Chinese translation of the Beautiful Soup 3 documentation |
| <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html>`_, |
| then read this document to find out about the changes made in |
| Beautiful Soup 4. |
| |
| Porting code to BS4 |
| ------------------- |
| |
| Most code written against Beautiful Soup 3 will work against Beautiful |
| Soup 4 with one simple change. All you should have to do is change the |
| package name from ``BeautifulSoup`` to ``bs4``. So this:: |
| |
| from BeautifulSoup import BeautifulSoup |
| |
| becomes this:: |
| |
| from bs4 import BeautifulSoup |
| |
| * If you get the ``ImportError`` "No module named BeautifulSoup", your |
| problem is that you're trying to run Beautiful Soup 3 code, but you |
| only have Beautiful Soup 4 installed. |
| |
| * If you get the ``ImportError`` "No module named bs4", your problem |
| is that you're trying to run Beautiful Soup 4 code, but you only |
| have Beautiful Soup 3 installed. |
| |
| Although BS4 is mostly backwards-compatible with BS3, most of its |
| methods have been deprecated and given new names for `PEP 8 compliance |
| <http://www.python.org/dev/peps/pep-0008/>`_. There are numerous other |
| renames and changes, and a few of them break backwards compatibility. |
| |
| Here's what you'll need to know to convert your BS3 code and habits to BS4: |
| |
| You need a parser |
| ^^^^^^^^^^^^^^^^^ |
| |
| Beautiful Soup 3 used Python's ``SGMLParser``, a module that was |
| deprecated and removed in Python 3.0. Beautiful Soup 4 uses |
| ``html.parser`` by default, but you can plug in lxml or html5lib and |
| use that instead. See `Installing a parser`_ for a comparison. |
| |
| Since ``html.parser`` is not the same parser as ``SGMLParser``, it |
| will treat invalid markup differently. Usually the "difference" is |
| that ``html.parser`` crashes. In that case, you'll need to install |
| another parser. But sometimes ``html.parser`` just creates a different |
| parse tree than ``SGMLParser`` would. If this happens, you may need to |
| update your BS3 scraping code to deal with the new tree. |
| |
| Method names |
| ^^^^^^^^^^^^ |
| |
| * ``renderContents`` -> ``encode_contents`` |
| * ``replaceWith`` -> ``replace_with`` |
| * ``replaceWithChildren`` -> ``unwrap`` |
| * ``findAll`` -> ``find_all`` |
| * ``findAllNext`` -> ``find_all_next`` |
| * ``findAllPrevious`` -> ``find_all_previous`` |
| * ``findNext`` -> ``find_next`` |
| * ``findNextSibling`` -> ``find_next_sibling`` |
| * ``findNextSiblings`` -> ``find_next_siblings`` |
| * ``findParent`` -> ``find_parent`` |
| * ``findParents`` -> ``find_parents`` |
| * ``findPrevious`` -> ``find_previous`` |
| * ``findPreviousSibling`` -> ``find_previous_sibling`` |
| * ``findPreviousSiblings`` -> ``find_previous_siblings`` |
| * ``nextSibling`` -> ``next_sibling`` |
| * ``previousSibling`` -> ``previous_sibling`` |
| |
| Some arguments to the Beautiful Soup constructor were renamed for the |
| same reasons: |
| |
| * ``BeautifulSoup(parseOnlyThese=...)`` -> ``BeautifulSoup(parse_only=...)`` |
| * ``BeautifulSoup(fromEncoding=...)`` -> ``BeautifulSoup(from_encoding=...)`` |
| |
| I renamed one method for compatibility with Python 3: |
| |
| * ``Tag.has_key()`` -> ``Tag.has_attr()`` |
| |
| I renamed one attribute to use more accurate terminology: |
| |
| * ``Tag.isSelfClosing`` -> ``Tag.is_empty_element`` |
| |
| I renamed three attributes to avoid using words that have special |
| meaning to Python. Unlike the others, these changes are *not backwards |
| compatible.* If you used these attributes in BS3, your code will break |
| on BS4 until you change them. |
| |
| * ``UnicodeDammit.unicode`` -> ``UnicodeDammit.unicode_markup`` |
| * ``Tag.next`` -> ``Tag.next_element`` |
| * ``Tag.previous`` -> ``Tag.previous_element`` |
| |
| Generators |
| ^^^^^^^^^^ |
| |
| I gave the generators PEP 8-compliant names, and transformed them into |
| properties: |
| |
| * ``childGenerator()`` -> ``children`` |
| * ``nextGenerator()`` -> ``next_elements`` |
| * ``nextSiblingGenerator()`` -> ``next_siblings`` |
| * ``previousGenerator()`` -> ``previous_elements`` |
| * ``previousSiblingGenerator()`` -> ``previous_siblings`` |
| * ``recursiveChildGenerator()`` -> ``descendants`` |
| * ``parentGenerator()`` -> ``parents`` |
| |
| So instead of this:: |
| |
| for parent in tag.parentGenerator(): |
| ... |
| |
| You can write this:: |
| |
| for parent in tag.parents: |
| ... |
| |
| (But the old code will still work.) |
| |
| Some of the generators used to yield ``None`` after they were done, and |
| then stop. That was a bug. Now the generators just stop. |
| |
| There are two new generators, :ref:`.strings and |
| .stripped_strings <string-generators>`. ``.strings`` yields |
| NavigableString objects, and ``.stripped_strings`` yields Python |
| strings that have had whitespace stripped. |
| |
| XML |
| ^^^ |
| |
| There is no longer a ``BeautifulStoneSoup`` class for parsing XML. To |
| parse XML you pass in "xml" as the second argument to the |
| ``BeautifulSoup`` constructor. For the same reason, the |
| ``BeautifulSoup`` constructor no longer recognizes the ``isHTML`` |
| argument. |
| |
| Beautiful Soup's handling of empty-element XML tags has been |
| improved. Previously when you parsed XML you had to explicitly say |
| which tags were considered empty-element tags. The ``selfClosingTags`` |
| argument to the constructor is no longer recognized. Instead, |
| Beautiful Soup considers any empty tag to be an empty-element tag. If |
| you add a child to an empty-element tag, it stops being an |
| empty-element tag. |
| |
| Entities |
| ^^^^^^^^ |
| |
| An incoming HTML or XML entity is always converted into the |
| corresponding Unicode character. Beautiful Soup 3 had a number of |
| overlapping ways of dealing with entities, which have been |
| removed. The ``BeautifulSoup`` constructor no longer recognizes the |
| ``smartQuotesTo`` or ``convertEntities`` arguments. (`Unicode, |
| Dammit`_ still has ``smart_quotes_to``, but its default is now to turn |
| smart quotes into Unicode.) The constants ``HTML_ENTITIES``, |
| ``XML_ENTITIES``, and ``XHTML_ENTITIES`` have been removed, since they |
| configure a feature (transforming some but not all entities into |
| Unicode characters) that no longer exists. |
| |
| If you want to turn Unicode characters back into HTML entities on |
| output, rather than turning them into UTF-8 characters, you need to |
| use an :ref:`output formatter <output_formatters>`. |
| |
| Miscellaneous |
| ^^^^^^^^^^^^^ |
| |
| :ref:`Tag.string <.string>` now operates recursively. If tag A |
| contains a single tag B and nothing else, then A.string is the same as |
| B.string. (Previously, it was None.) |
| |
| `Multi-valued attributes`_ like ``class`` have lists of strings as |
| their values, not strings. This may affect the way you search by CSS |
| class. |
| |
| If you pass one of the ``find*`` methods both :ref:`text <text>` `and` |
| a tag-specific argument like :ref:`name <name>`, Beautiful Soup will |
| search for tags that match your tag-specific criteria and whose |
| :ref:`Tag.string <.string>` matches your value for :ref:`text |
| <text>`. It will `not` find the strings themselves. Previously, |
| Beautiful Soup ignored the tag-specific arguments and looked for |
| strings. |
| |
| The ``BeautifulSoup`` constructor no longer recognizes the |
| `markupMassage` argument. It's now the parser's responsibility to |
| handle markup correctly. |
| |
| The rarely-used alternate parser classes like |
| ``ICantBelieveItsBeautifulSoup`` and ``BeautifulSOAP`` have been |
| removed. It's now the parser's decision how to handle ambiguous |
| markup. |
| |
| The ``prettify()`` method now returns a Unicode string, not a bytestring. |