Beginning Python - From Novice to Professional
Beginning Python - From Novice to Professional Beginning Python - From Novice to Professional
CHAPTER 15 ■ PYTHON AND THE WEB 315 meaningless HTML as best they can. If this happens to look acceptable to the page author, he or she may be satisfied. This does make the job of screen scraping quite a bit harder, though. The general approach for parsing HTML in the standard library is event-based; you write event handlers that are called as the parser moves along the data. The standard library modules sgmllib and htmllib will let you parse really sloppy HTML in this manner, but if you want to extract data based on document structure (such as the first item after the second level-two heading), you’ll have to do some heavy guessing if there are missing tags, for example. You are certainly welcome to do this, if you like, but there is another way: Tidy. What’s Tidy? Tidy (http://tidy.sf.net) is a tool for fixing ill-formed and sloppy HTML. It can fix a range of common errors in a rather intelligent manner, doing lots of work that you’d probably rather not do yourself. It’s also quite configurable, letting you turn various corrections on or off. Here is an example of an HTML file filled with errors: some of them just Old Skool HTML, and some of them plain wrong (can you spot all the problems?): Pet Shop Complaints There is no way at all we can accept returned parrots. Dead Pets Our pets may tend to rest at times, but rarely die within the warranty period. News We have just received a really nice parrot. It's really nice. The Norwegian Blue Plumage and pining behavior More information Features: Beautiful plumage Here is the version that is fixed by Tidy:
316 CHAPTER 15 ■ PYTHON AND THE WEB Pet Shop Complaints There is no way at all we can accept returned parrots. Dead Pets Our pets may tend to rest at times, but rarely die within the warranty period. News We have just received a really nice parrot. It's really nice. The Norwegian Blue Plumage and pining behavior More information Features: Beautiful plumage Of course, Tidy can’t fix all problems with an HTML file, but it does make sure it’s wellformed (that is, all elements nest properly), which makes it much easier for you to parse it. Getting a Tidy Library You can get Tidy and the library version of Tidy, Tidylib, from http://tidy.sf.net. There are wrappers for this library for several languages; however, at the time of writing, there is none for Python at this Web site. You can get Python wrappers for Tidy from other places, though; there is μTidyLib at http://utidylib.berlios.de and mxTidy at http://egenix.com/files/python/ mxTidy.html. At the time of writing, μTidyLib seems to be the most up to date of the two, but mxTidy is a bit easier to install. In Windows, simply download the installer for mxTidy, run it, and you have the module mx.Tidy at your fingertips. There are also RPM packages available. If you want to install the source package (presumably in a UNIX or Linux environment), you can simply run the distutils script, using python setup.py install.
- Page 295 and 296: 264 CHAPTER 11 ■ FILES AND STUFF
- Page 297 and 298: 266 CHAPTER 11 ■ FILES AND STUFF
- Page 299 and 300: 268 CHAPTER 11 ■ FILES AND STUFF
- Page 301 and 302: 270 CHAPTER 12 ■ GRAPHICAL USER I
- Page 303 and 304: 272 CHAPTER 12 ■ GRAPHICAL USER I
- Page 305 and 306: 274 CHAPTER 12 ■ GRAPHICAL USER I
- Page 307 and 308: 276 CHAPTER 12 ■ GRAPHICAL USER I
- Page 309 and 310: 278 CHAPTER 12 ■ GRAPHICAL USER I
- Page 311 and 312: 280 CHAPTER 12 ■ GRAPHICAL USER I
- Page 313 and 314: 282 CHAPTER 12 ■ GRAPHICAL USER I
- Page 316 and 317: CHAPTER 13 ■ ■ ■ Database Sup
- Page 318 and 319: CHAPTER 13 ■ DATABASE SUPPORT 287
- Page 320 and 321: CHAPTER 13 ■ DATABASE SUPPORT 289
- Page 322 and 323: CHAPTER 13 ■ DATABASE SUPPORT 291
- Page 324 and 325: CHAPTER 13 ■ DATABASE SUPPORT 293
- Page 326: CHAPTER 13 ■ DATABASE SUPPORT 295
- Page 329 and 330: 298 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 331 and 332: 300 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 333 and 334: 302 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 335 and 336: 304 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 337 and 338: 306 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 339 and 340: 308 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 341 and 342: 310 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 343 and 344: 312 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 345: 314 CHAPTER 15 ■ PYTHON AND THE W
- Page 349 and 350: 318 CHAPTER 15 ■ PYTHON AND THE W
- Page 351 and 352: 320 CHAPTER 15 ■ PYTHON AND THE W
- Page 353 and 354: 322 CHAPTER 15 ■ PYTHON AND THE W
- Page 355 and 356: 324 CHAPTER 15 ■ PYTHON AND THE W
- Page 357 and 358: 326 CHAPTER 15 ■ PYTHON AND THE W
- Page 359 and 360: 328 CHAPTER 15 ■ PYTHON AND THE W
- Page 361 and 362: 330 CHAPTER 15 ■ PYTHON AND THE W
- Page 363 and 364: 332 CHAPTER 15 ■ PYTHON AND THE W
- Page 365 and 366: 334 CHAPTER 15 ■ PYTHON AND THE W
- Page 367 and 368: 336 CHAPTER 15 ■ PYTHON AND THE W
- Page 369 and 370: 338 CHAPTER 15 ■ PYTHON AND THE W
- Page 372 and 373: CHAPTER 16 ■ ■ ■ Testing, 1-2
- Page 374 and 375: CHAPTER 16 ■ TESTING, 1-2-3 343 W
- Page 376 and 377: CHAPTER 16 ■ TESTING, 1-2-3 345 d
- Page 378 and 379: CHAPTER 16 ■ TESTING, 1-2-3 347 u
- Page 380 and 381: CHAPTER 16 ■ TESTING, 1-2-3 349 F
- Page 382 and 383: CHAPTER 16 ■ TESTING, 1-2-3 351 P
- Page 384 and 385: CHAPTER 16 ■ TESTING, 1-2-3 353 "
- Page 386: CHAPTER 16 ■ TESTING, 1-2-3 355 N
- Page 389 and 390: 358 CHAPTER 17 ■ EXTENDING PYTHON
- Page 391 and 392: 360 CHAPTER 17 ■ EXTENDING PYTHON
- Page 393 and 394: 362 CHAPTER 17 ■ EXTENDING PYTHON
- Page 395 and 396: 364 CHAPTER 17 ■ EXTENDING PYTHON
CHAPTER 15 ■ PYTHON AND THE WEB 315<br />
meaningless HTML as best they can. If this happens <strong>to</strong> look acceptable <strong>to</strong> the page author, he<br />
or she may be satisfied. This does make the job of screen scraping quite a bit harder, though.<br />
The general approach for parsing HTML in the standard library is event-based; you write<br />
event handlers that are called as the parser moves along the data. The standard library modules<br />
sgmllib and htmllib will let you parse really sloppy HTML in this manner, but if you want <strong>to</strong><br />
extract data based on document structure (such as the first item after the second level-two<br />
heading), you’ll have <strong>to</strong> do some heavy guessing if there are missing tags, for example. You are<br />
certainly welcome <strong>to</strong> do this, if you like, but there is another way: Tidy.<br />
What’s Tidy?<br />
Tidy (http://tidy.sf.net) is a <strong>to</strong>ol for fixing ill-formed and sloppy HTML. It can fix a range of<br />
common errors in a rather intelligent manner, doing lots of work that you’d probably rather<br />
not do yourself. It’s also quite configurable, letting you turn various corrections on or off.<br />
Here is an example of an HTML file filled with errors: some of them just Old Skool HTML,<br />
and some of them plain wrong (can you spot all the problems?):<br />
Pet Shop<br />
Complaints<br />
There is no way at all we can accept returned<br />
parrots.<br />
Dead Pets<br />
Our pets may tend <strong>to</strong> rest at times, but rarely die within the<br />
warranty period.<br />
News<br />
We have just received a really nice parrot.<br />
It's really nice.<br />
The Norwegian Blue<br />
Plumage and pining behavior<br />
More information<br />
Features:<br />
<br />
Beautiful plumage<br />
Here is the version that is fixed by Tidy: