Event Based HTML Parsing with Python
18 Apr 2013HTML parsing is a useful technique for crawlers. Three major techniques to parse HTML are regular expression, tree-model based parsing and event based parsing. Regular expression is universally applicable for all parsings, but it’s tricky and hard for other people to understand or even maintain.
Tree-model based parsing is powerful and popular, a lot of HTML/XML parsing libraries construct an in-memory tree-like model to represent the structure of the parsed HTML. The first drawback of this kind of parsing is obvious, constructing tree-models in memory requires memory for the entire HTML even if we only need a small part of it. The second problem of tree-model based parsing is the cost of CPU time. If a big HTML file is parsed and the in-memory model is huge, a lot of CPU time is cost to build the model and travel in the model.
Event based parsing is simple but efficient. The parser defines certain events and the users implement event handles to retrieve information during parsing. When the parsing engine encounter a start tag, a close tag and text within tags, corresponding event handles get executed. In the event handle, context information such as tag name, tag attributes and tag texts are all accessible. However, information beyond the current tag such as information of the parent tag and the children tags are not accessible.
The advantages of event based parsing are straightforward: only the information of current tag is resident in the memory and there is no need to construct any model.
Python HTML Parser Performance provides detail performance benchmarks on parsing speed and memory usage of different HTML parsers of Python. The results show clearly that HTMLParser (which is a event-based parsing library for Python) has the smallest memory usage and is also the second fastest parser.
I write a simple parser for Jiandan OOXX, which is a
photo gallery contributed by users, to demonstrate how to parse HTML with event
based parsers. If I want to download all photos in one of the pages, I need to
parse the HTML source file and get information of the photos such as image url
and image votes. By dig the HTML structure, we know that each photo corresponds
to a <li>
element of id="comment-*"
. The image’s URL is the src
attribute
of one of the <img>
elements inside this <li>
and the votes are the texts of
<span>
elements of id="cos_[un]support-*"
.
We only need to implement three event handles to retrieve the photos’
information: handle_starttag
, handle_endtag
and handle_data
. We also
define two state variable: withinPostLi
to identify whether the parsing is now
inside a <li>
that we are interested in; fetchData
to identify whether the
tag text is a support vote or an unsupport vote.
The parser takes the HTML as input and output the photos’ informations. Full parser example can be found at this gist. An example run is: