Beginning Python - From Novice to Professional
Beginning Python - From Novice to Professional Beginning Python - From Novice to Professional
CHAPTER 15 ■ PYTHON AND THE WEB 317 ■Note The package containing mxTidy will, in fact, give you more tools than just Tidy, but that shouldn’t be a problem. Using Command-Line Tidy in Python You don’t have to install either of the libraries, though. If you’re running a UNIX or Linux machine of some sort, it’s quite possible that you have the command-line version of Tidy available. And no matter what OS you’re using, you can probably get an executable binary from the TidyLib Web site (http://tidy.sf.net). Once you’ve got that, you can use the subprocess module (or some of the popen functions) to run the Tidy program. Assume, for example, that you have a messy HTML file called messy.html; then the following program will run Tidy on it and print the result. Instead of printing the result, you would, most likely, extract some useful information from it, as demonstrated in the following sections. from subprocess import Popen, PIPE text = open('messy.html').read() tidy = Popen('tidy', stdin=PIPE, stdout=PIPE, stderr=PIPE) tidy.stdin.write(text) tidy.stdin.close() print tidy.stdout.read() But Why XHTML? The main difference between XHTML and older forms of HTML (at least for our current purposes) is that XHTML is quite strict about closing all elements explicitly. So where you in HTML might end one paragraph simply by beginning another (with a tag), in XHTML you first have to close the paragraph explicitly (with a tag). This makes XHTML much easier to parse, because you can tell directly when you enter or leave the various elements. Another advantage of XHTML (which I won’t really capitalize on in this chapter) is that it is an XML dialect, so you can use all kinds of nifty XML tools on it, such as XPath. For example, the links to the firms extracted by the program in Listing 15-1 could also be extracted by the XPath expression //h4/a/@href. (For more about XML, see Chapter 22; for more about the uses of XPath, see, for example, http:// www.w3schools.com/xpath.) A very simple way of parsing the kind of well-behaved XHTML we get from Tidy is using the standard library module (and class) HTMLParser (not to be confused with the class HTMLParser from the htmllib module, which you can also use, of course, if you’re so inclined; it’s more liberal in accepting ill-formed input).
318 CHAPTER 15 ■ PYTHON AND THE WEB Using HTMLParser Using HTMLParser simply means subclassing it, and overriding various event-handling methods such as handle_starttag or handle_data. Table 15-1 summarizes the relevant methods, and when they’re called (automatically) by the parser. Table 15-1. The HTMLParser Callback Methods Callback Method handle_starttag(tag, attrs) handle_startendtag(tag, attrs) handle_endtag(tag) handle_data(data) handle_charref(ref) handle_entityref(name) handle_comment(data) When Is It Called? When a start tag is found. attrs is a sequence of (name, value) pairs. For empty tags; default handles start and end separately. When an end tag is found. For textual data. For character references of the form ref;. For entity references of the form &name;. For comments; only called with the comment contents. handle_decl(decl) For declarations of the form . handle_pi(data) For processing instructions. For screen scraping purposes, you usually won’t need to implement all the parser callbacks (the event handlers), and you probably won’t have to construct some abstract representation of the entire document (such as a document tree) to find what you need. If you just keep track of the minimum of information needed to find what you’re looking for, you’re in business. (See Chapter 22 for more about this topic, in the context of XML parsing with SAX.) Listing 15-2 shows a program that solves the same problem as Listing 15-1, but this time using HTMLParser. Listing 15-2. A Screen Scraping Program Using the HTMLParser Module from urllib import urlopen from HTMLParser import HTMLParser class Scraper(HTMLParser): in_h4 = False in_link = False def handle_starttag(self, tag, attrs): attrs = dict(attrs) if tag == 'h4': self.in_h4 = True
- Page 297 and 298: 266 CHAPTER 11 ■ FILES AND STUFF
- Page 299 and 300: 268 CHAPTER 11 ■ FILES AND STUFF
- Page 301 and 302: 270 CHAPTER 12 ■ GRAPHICAL USER I
- Page 303 and 304: 272 CHAPTER 12 ■ GRAPHICAL USER I
- Page 305 and 306: 274 CHAPTER 12 ■ GRAPHICAL USER I
- Page 307 and 308: 276 CHAPTER 12 ■ GRAPHICAL USER I
- Page 309 and 310: 278 CHAPTER 12 ■ GRAPHICAL USER I
- Page 311 and 312: 280 CHAPTER 12 ■ GRAPHICAL USER I
- Page 313 and 314: 282 CHAPTER 12 ■ GRAPHICAL USER I
- Page 316 and 317: CHAPTER 13 ■ ■ ■ Database Sup
- Page 318 and 319: CHAPTER 13 ■ DATABASE SUPPORT 287
- Page 320 and 321: CHAPTER 13 ■ DATABASE SUPPORT 289
- Page 322 and 323: CHAPTER 13 ■ DATABASE SUPPORT 291
- Page 324 and 325: CHAPTER 13 ■ DATABASE SUPPORT 293
- Page 326: CHAPTER 13 ■ DATABASE SUPPORT 295
- Page 329 and 330: 298 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 331 and 332: 300 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 333 and 334: 302 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 335 and 336: 304 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 337 and 338: 306 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 339 and 340: 308 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 341 and 342: 310 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 343 and 344: 312 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 345 and 346: 314 CHAPTER 15 ■ PYTHON AND THE W
- Page 347: 316 CHAPTER 15 ■ PYTHON AND THE W
- Page 351 and 352: 320 CHAPTER 15 ■ PYTHON AND THE W
- Page 353 and 354: 322 CHAPTER 15 ■ PYTHON AND THE W
- Page 355 and 356: 324 CHAPTER 15 ■ PYTHON AND THE W
- Page 357 and 358: 326 CHAPTER 15 ■ PYTHON AND THE W
- Page 359 and 360: 328 CHAPTER 15 ■ PYTHON AND THE W
- Page 361 and 362: 330 CHAPTER 15 ■ PYTHON AND THE W
- Page 363 and 364: 332 CHAPTER 15 ■ PYTHON AND THE W
- Page 365 and 366: 334 CHAPTER 15 ■ PYTHON AND THE W
- Page 367 and 368: 336 CHAPTER 15 ■ PYTHON AND THE W
- Page 369 and 370: 338 CHAPTER 15 ■ PYTHON AND THE W
- Page 372 and 373: CHAPTER 16 ■ ■ ■ Testing, 1-2
- Page 374 and 375: CHAPTER 16 ■ TESTING, 1-2-3 343 W
- Page 376 and 377: CHAPTER 16 ■ TESTING, 1-2-3 345 d
- Page 378 and 379: CHAPTER 16 ■ TESTING, 1-2-3 347 u
- Page 380 and 381: CHAPTER 16 ■ TESTING, 1-2-3 349 F
- Page 382 and 383: CHAPTER 16 ■ TESTING, 1-2-3 351 P
- Page 384 and 385: CHAPTER 16 ■ TESTING, 1-2-3 353 "
- Page 386: CHAPTER 16 ■ TESTING, 1-2-3 355 N
- Page 389 and 390: 358 CHAPTER 17 ■ EXTENDING PYTHON
- Page 391 and 392: 360 CHAPTER 17 ■ EXTENDING PYTHON
- Page 393 and 394: 362 CHAPTER 17 ■ EXTENDING PYTHON
- Page 395 and 396: 364 CHAPTER 17 ■ EXTENDING PYTHON
- Page 397 and 398: 366 CHAPTER 17 ■ EXTENDING PYTHON
318 CHAPTER 15 ■ PYTHON AND THE WEB<br />
Using HTMLParser<br />
Using HTMLParser simply means subclassing it, and overriding various event-handling methods<br />
such as handle_starttag or handle_data. Table 15-1 summarizes the relevant methods, and<br />
when they’re called (au<strong>to</strong>matically) by the parser.<br />
Table 15-1. The HTMLParser Callback Methods<br />
Callback Method<br />
handle_starttag(tag, attrs)<br />
handle_startendtag(tag, attrs)<br />
handle_endtag(tag)<br />
handle_data(data)<br />
handle_charref(ref)<br />
handle_entityref(name)<br />
handle_comment(data)<br />
When Is It Called?<br />
When a start tag is found. attrs is a sequence of<br />
(name, value) pairs.<br />
For empty tags; default handles start and end separately.<br />
When an end tag is found.<br />
For textual data.<br />
For character references of the form &#ref;.<br />
For entity references of the form &name;.<br />
For comments; only called with the comment contents.<br />
handle_decl(decl) For declarations of the form .<br />
handle_pi(data)<br />
For processing instructions.<br />
For screen scraping purposes, you usually won’t need <strong>to</strong> implement all the parser callbacks<br />
(the event handlers), and you probably won’t have <strong>to</strong> construct some abstract representation<br />
of the entire document (such as a document tree) <strong>to</strong> find what you need. If you just keep track<br />
of the minimum of information needed <strong>to</strong> find what you’re looking for, you’re in business. (See<br />
Chapter 22 for more about this <strong>to</strong>pic, in the context of XML parsing with SAX.) Listing 15-2<br />
shows a program that solves the same problem as Listing 15-1, but this time using HTMLParser.<br />
Listing 15-2. A Screen Scraping Program Using the HTMLParser Module<br />
from urllib import urlopen<br />
from HTMLParser import HTMLParser<br />
class Scraper(HTMLParser):<br />
in_h4 = False<br />
in_link = False<br />
def handle_starttag(self, tag, attrs):<br />
attrs = dict(attrs)<br />
if tag == 'h4':<br />
self.in_h4 = True