Beginning Python - From Novice to Professional
Beginning Python - From Novice to Professional Beginning Python - From Novice to Professional
CHAPTER 15 ■ PYTHON AND THE WEB 335 Web Services: Scraping Done Right Web services are a bit like computer-friendly Web pages. They are based on standards and protocols that enable programs to exchange information across the network, usually with one program, the client or service requester, asking for some information or service, and the other program, the server or service provider, providing this information or service. Yeah, I know. Glaringly obvious stuff. And it also seems very similar to the network programming discussed in Chapter 14. There are differences, though . . . Web services often work on a rather high level of abstraction. They use HTTP (the “Web protocol”) as the underlying protocol; on top of this, they use more content-oriented protocols, for example, using some XML format to encode requests and responses. This means that a Web server can be the platform for Web services. As the title of this section indicates, it’s Web scraping taken to another level; one could see the Web service as a dynamic Web page designed for a computerized client, rather than for human consumption. There are standards for Web services that go really, really far in capturing all kinds of complexity, but you can get a lot done with utter simplicity as well. In this section, I discuss two simple Web service protocols: I start with the simplest, RSS, which you could even argue is so simple that it isn’t really a Web service protocol at all. Then I show you how to use XML-RPC from the client side; the server side is dealt with in more detail in Chapter 27. There are several other standards you might want to check out. For example, SOAP is, in some sense, XML-RPC on steroids, and WSDL is a format for describing Web services formally. A good Web search engine will, as always, be your friend here. RSS RSS, which stands for either Rich Site Summary, RDF Site Summary, or Really Simple Syndication (depending on the version number), is, in its simplest form, a format for listing news items in XML. What makes RSS documents (or feeds) more of a service than simply a static document is that they’re expected to be updated regularly (or irregularly). They may even be computed dynamically, representing, for example, the most recent additions to a Web log or the like. There are plenty of RSS readers out there, and because the RSS format is so easy to deal with, it’s easy to find new applications for it. For example, some browsers (such as Mozilla Firefox) will let you bookmark an RSS feed, and will then give you a dynamic bookmark submenu with the individual news items as menu items. Some people are even using RSS feeds to “broadcast” sound or video files (called podcasting). One slightly confusing part of the RSS picture is that versions 0.9x and 2.0.x, now mainly called Really Simple Syndication (with 0.9x originally called Rich Site Summary), are sort of compatible with each other, but completely incompatible with RSS 1.0. There are also other formats for this sort of news feeds and site syndication, such as the more recent Atom (see, for example, http://ietf.org/html.charters/atompub-charter.html). The problem is that if you want to write a client program that handles feeds from several sites, you must be prepared to parse several different formats; you may even have to parse HTML fragments found in the messages themselves. In this section, I’ll use a tiny subset of RSS 2.0. Listing 15-10 shows an example RSS file; for full specifications of recent RSS 2.0 versions, see http://blogs.law.harvard. edu/tech/rss. For a specification of RSS 1.0, see http://web.resource.org/rss/1.0.
336 CHAPTER 15 ■ PYTHON AND THE WEB Listing 15-10. A Simple RSS 2.0 File Example Top Stories http://www.example.com Example News is a top notch provider of meaningless news items. Interesting stuff Something really interesting happened today http://www.example.com/newsitem1.html More interesting stuff Then something even more interesting happened http://www.example.com/newsitem2.html The RSS 2.0 standard specifies a few mandatory elements, and many optional ones. You can count on an RSS 2.0 channel element having a title, link, and description. They can contain (among other things) zero or more item elements, which, at the very least, have either a title or a description. If you’re writing a program to deal with a specific feed, a good idea might be to simply find out which elements it provides. Another thing making the parsing a bit challenging is the sad fact that even though RSS is supposed to be valid XML, and therefore easy to parse, chances are you will come across illformed RSS feeds. If nothing else, the news messages themselves may contain such illegalities as unescaped ampersands (&) or the like. There aren’t really (at the time of writing) any obvious standard RSS modules for Python that will handle these difficulties, so you’re more or less back to screen scraping (for now, at least). Luckily, the handy Beautiful Soup parser can deal with XML as well as HTML, and it won’t complain about a bit of sloppiness on the part of the RSS feed. To round off this little introduction to RSS, Listing 15-11 is an example program that will get the top stories from Wired News (http://wired.com). Note that it uses the class BeautifulStoneSoup, rather than BeautifulSoup, to parse the RSS feed; this class can deal with XML in general, while BeautifulSoup is targeted specifically at HTML. (In order to use the BeautifulStoneSoup class, you will, of course, need to download BeautifulSoup, as discussed earlier in this chapter.) The program also demonstrates how you can use the wrap function from the standard Python module textwrap to make text fit nicely on the screen.
- Page 316 and 317: CHAPTER 13 ■ ■ ■ Database Sup
- Page 318 and 319: CHAPTER 13 ■ DATABASE SUPPORT 287
- Page 320 and 321: CHAPTER 13 ■ DATABASE SUPPORT 289
- Page 322 and 323: CHAPTER 13 ■ DATABASE SUPPORT 291
- Page 324 and 325: CHAPTER 13 ■ DATABASE SUPPORT 293
- Page 326: CHAPTER 13 ■ DATABASE SUPPORT 295
- Page 329 and 330: 298 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 331 and 332: 300 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 333 and 334: 302 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 335 and 336: 304 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 337 and 338: 306 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 339 and 340: 308 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 341 and 342: 310 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 343 and 344: 312 CHAPTER 14 ■ NETWORK PROGRAMM
- Page 345 and 346: 314 CHAPTER 15 ■ PYTHON AND THE W
- Page 347 and 348: 316 CHAPTER 15 ■ PYTHON AND THE W
- Page 349 and 350: 318 CHAPTER 15 ■ PYTHON AND THE W
- Page 351 and 352: 320 CHAPTER 15 ■ PYTHON AND THE W
- Page 353 and 354: 322 CHAPTER 15 ■ PYTHON AND THE W
- Page 355 and 356: 324 CHAPTER 15 ■ PYTHON AND THE W
- Page 357 and 358: 326 CHAPTER 15 ■ PYTHON AND THE W
- Page 359 and 360: 328 CHAPTER 15 ■ PYTHON AND THE W
- Page 361 and 362: 330 CHAPTER 15 ■ PYTHON AND THE W
- Page 363 and 364: 332 CHAPTER 15 ■ PYTHON AND THE W
- Page 365: 334 CHAPTER 15 ■ PYTHON AND THE W
- Page 369 and 370: 338 CHAPTER 15 ■ PYTHON AND THE W
- Page 372 and 373: CHAPTER 16 ■ ■ ■ Testing, 1-2
- Page 374 and 375: CHAPTER 16 ■ TESTING, 1-2-3 343 W
- Page 376 and 377: CHAPTER 16 ■ TESTING, 1-2-3 345 d
- Page 378 and 379: CHAPTER 16 ■ TESTING, 1-2-3 347 u
- Page 380 and 381: CHAPTER 16 ■ TESTING, 1-2-3 349 F
- Page 382 and 383: CHAPTER 16 ■ TESTING, 1-2-3 351 P
- Page 384 and 385: CHAPTER 16 ■ TESTING, 1-2-3 353 "
- Page 386: CHAPTER 16 ■ TESTING, 1-2-3 355 N
- Page 389 and 390: 358 CHAPTER 17 ■ EXTENDING PYTHON
- Page 391 and 392: 360 CHAPTER 17 ■ EXTENDING PYTHON
- Page 393 and 394: 362 CHAPTER 17 ■ EXTENDING PYTHON
- Page 395 and 396: 364 CHAPTER 17 ■ EXTENDING PYTHON
- Page 397 and 398: 366 CHAPTER 17 ■ EXTENDING PYTHON
- Page 399 and 400: 368 CHAPTER 17 ■ EXTENDING PYTHON
- Page 401 and 402: 370 CHAPTER 17 ■ EXTENDING PYTHON
- Page 404 and 405: CHAPTER 18 ■ ■ ■ Packaging Yo
- Page 406 and 407: CHAPTER 18 ■ PACKAGING YOUR PROGR
- Page 408 and 409: CHAPTER 18 ■ PACKAGING YOUR PROGR
- Page 410 and 411: CHAPTER 18 ■ PACKAGING YOUR PROGR
- Page 412 and 413: CHAPTER 19 ■ ■ ■ Playful Prog
- Page 414 and 415: CHAPTER 19 ■ PLAYFUL PROGRAMMING
336 CHAPTER 15 ■ PYTHON AND THE WEB<br />
Listing 15-10. A Simple RSS 2.0 File<br />
<br />
<br />
<br />
Example Top S<strong>to</strong>ries<br />
http://www.example.com<br />
<br />
Example News is a <strong>to</strong>p notch provider of meaningless news items.<br />
<br />
<br />
Interesting stuff<br />
Something really interesting happened <strong>to</strong>day<br />
http://www.example.com/newsitem1.html<br />
<br />
<br />
More interesting stuff<br />
Then something even more interesting happened<br />
http://www.example.com/newsitem2.html<br />
<br />
<br />
<br />
The RSS 2.0 standard specifies a few manda<strong>to</strong>ry elements, and many optional ones. You<br />
can count on an RSS 2.0 channel element having a title, link, and description. They can<br />
contain (among other things) zero or more item elements, which, at the very least, have either<br />
a title or a description. If you’re writing a program <strong>to</strong> deal with a specific feed, a good idea<br />
might be <strong>to</strong> simply find out which elements it provides.<br />
Another thing making the parsing a bit challenging is the sad fact that even though RSS is<br />
supposed <strong>to</strong> be valid XML, and therefore easy <strong>to</strong> parse, chances are you will come across illformed<br />
RSS feeds. If nothing else, the news messages themselves may contain such illegalities<br />
as unescaped ampersands (&) or the like.<br />
There aren’t really (at the time of writing) any obvious standard RSS modules for <strong>Python</strong><br />
that will handle these difficulties, so you’re more or less back <strong>to</strong> screen scraping (for now, at<br />
least). Luckily, the handy Beautiful Soup parser can deal with XML as well as HTML, and it<br />
won’t complain about a bit of sloppiness on the part of the RSS feed. To round off this little<br />
introduction <strong>to</strong> RSS, Listing 15-11 is an example program that will get the <strong>to</strong>p s<strong>to</strong>ries from<br />
Wired News (http://wired.com). Note that it uses the class BeautifulS<strong>to</strong>neSoup, rather than<br />
BeautifulSoup, <strong>to</strong> parse the RSS feed; this class can deal with XML in general, while BeautifulSoup<br />
is targeted specifically at HTML. (In order <strong>to</strong> use the BeautifulS<strong>to</strong>neSoup class, you will, of<br />
course, need <strong>to</strong> download BeautifulSoup, as discussed earlier in this chapter.) The program<br />
also demonstrates how you can use the wrap function from the standard <strong>Python</strong> module<br />
textwrap <strong>to</strong> make text fit nicely on the screen.