Beginning Python - From Novice to Professional

Beginning Python - From Novice to Professional Beginning Python - From Novice to Professional

16.01.2014 Views

CHAPTER 15 ■ PYTHON AND THE WEB 335 Web Services: Scraping Done Right Web services are a bit like computer-friendly Web pages. They are based on standards and protocols that enable programs to exchange information across the network, usually with one program, the client or service requester, asking for some information or service, and the other program, the server or service provider, providing this information or service. Yeah, I know. Glaringly obvious stuff. And it also seems very similar to the network programming discussed in Chapter 14. There are differences, though . . . Web services often work on a rather high level of abstraction. They use HTTP (the “Web protocol”) as the underlying protocol; on top of this, they use more content-oriented protocols, for example, using some XML format to encode requests and responses. This means that a Web server can be the platform for Web services. As the title of this section indicates, it’s Web scraping taken to another level; one could see the Web service as a dynamic Web page designed for a computerized client, rather than for human consumption. There are standards for Web services that go really, really far in capturing all kinds of complexity, but you can get a lot done with utter simplicity as well. In this section, I discuss two simple Web service protocols: I start with the simplest, RSS, which you could even argue is so simple that it isn’t really a Web service protocol at all. Then I show you how to use XML-RPC from the client side; the server side is dealt with in more detail in Chapter 27. There are several other standards you might want to check out. For example, SOAP is, in some sense, XML-RPC on steroids, and WSDL is a format for describing Web services formally. A good Web search engine will, as always, be your friend here. RSS RSS, which stands for either Rich Site Summary, RDF Site Summary, or Really Simple Syndication (depending on the version number), is, in its simplest form, a format for listing news items in XML. What makes RSS documents (or feeds) more of a service than simply a static document is that they’re expected to be updated regularly (or irregularly). They may even be computed dynamically, representing, for example, the most recent additions to a Web log or the like. There are plenty of RSS readers out there, and because the RSS format is so easy to deal with, it’s easy to find new applications for it. For example, some browsers (such as Mozilla Firefox) will let you bookmark an RSS feed, and will then give you a dynamic bookmark submenu with the individual news items as menu items. Some people are even using RSS feeds to “broadcast” sound or video files (called podcasting). One slightly confusing part of the RSS picture is that versions 0.9x and 2.0.x, now mainly called Really Simple Syndication (with 0.9x originally called Rich Site Summary), are sort of compatible with each other, but completely incompatible with RSS 1.0. There are also other formats for this sort of news feeds and site syndication, such as the more recent Atom (see, for example, http://ietf.org/html.charters/atompub-charter.html). The problem is that if you want to write a client program that handles feeds from several sites, you must be prepared to parse several different formats; you may even have to parse HTML fragments found in the messages themselves. In this section, I’ll use a tiny subset of RSS 2.0. Listing 15-10 shows an example RSS file; for full specifications of recent RSS 2.0 versions, see http://blogs.law.harvard. edu/tech/rss. For a specification of RSS 1.0, see http://web.resource.org/rss/1.0.

336 CHAPTER 15 ■ PYTHON AND THE WEB Listing 15-10. A Simple RSS 2.0 File Example Top Stories http://www.example.com Example News is a top notch provider of meaningless news items. Interesting stuff Something really interesting happened today http://www.example.com/newsitem1.html More interesting stuff Then something even more interesting happened http://www.example.com/newsitem2.html The RSS 2.0 standard specifies a few mandatory elements, and many optional ones. You can count on an RSS 2.0 channel element having a title, link, and description. They can contain (among other things) zero or more item elements, which, at the very least, have either a title or a description. If you’re writing a program to deal with a specific feed, a good idea might be to simply find out which elements it provides. Another thing making the parsing a bit challenging is the sad fact that even though RSS is supposed to be valid XML, and therefore easy to parse, chances are you will come across illformed RSS feeds. If nothing else, the news messages themselves may contain such illegalities as unescaped ampersands (&) or the like. There aren’t really (at the time of writing) any obvious standard RSS modules for Python that will handle these difficulties, so you’re more or less back to screen scraping (for now, at least). Luckily, the handy Beautiful Soup parser can deal with XML as well as HTML, and it won’t complain about a bit of sloppiness on the part of the RSS feed. To round off this little introduction to RSS, Listing 15-11 is an example program that will get the top stories from Wired News (http://wired.com). Note that it uses the class BeautifulStoneSoup, rather than BeautifulSoup, to parse the RSS feed; this class can deal with XML in general, while BeautifulSoup is targeted specifically at HTML. (In order to use the BeautifulStoneSoup class, you will, of course, need to download BeautifulSoup, as discussed earlier in this chapter.) The program also demonstrates how you can use the wrap function from the standard Python module textwrap to make text fit nicely on the screen.

CHAPTER 15 ■ PYTHON AND THE WEB 335<br />

Web Services: Scraping Done Right<br />

Web services are a bit like computer-friendly Web pages. They are based on standards and<br />

pro<strong>to</strong>cols that enable programs <strong>to</strong> exchange information across the network, usually with one<br />

program, the client or service requester, asking for some information or service, and the other<br />

program, the server or service provider, providing this information or service. Yeah, I know.<br />

Glaringly obvious stuff. And it also seems very similar <strong>to</strong> the network programming discussed<br />

in Chapter 14. There are differences, though . . .<br />

Web services often work on a rather high level of abstraction. They use HTTP (the “Web<br />

pro<strong>to</strong>col”) as the underlying pro<strong>to</strong>col; on <strong>to</strong>p of this, they use more content-oriented pro<strong>to</strong>cols,<br />

for example, using some XML format <strong>to</strong> encode requests and responses. This means that<br />

a Web server can be the platform for Web services. As the title of this section indicates, it’s Web<br />

scraping taken <strong>to</strong> another level; one could see the Web service as a dynamic Web page designed for<br />

a computerized client, rather than for human consumption.<br />

There are standards for Web services that go really, really far in capturing all kinds of<br />

complexity, but you can get a lot done with utter simplicity as well. In this section, I discuss two<br />

simple Web service pro<strong>to</strong>cols: I start with the simplest, RSS, which you could even argue is so<br />

simple that it isn’t really a Web service pro<strong>to</strong>col at all. Then I show you how <strong>to</strong> use XML-RPC<br />

from the client side; the server side is dealt with in more detail in Chapter 27. There are several<br />

other standards you might want <strong>to</strong> check out. For example, SOAP is, in some sense, XML-RPC<br />

on steroids, and WSDL is a format for describing Web services formally. A good Web search<br />

engine will, as always, be your friend here.<br />

RSS<br />

RSS, which stands for either Rich Site Summary, RDF Site Summary, or Really Simple Syndication<br />

(depending on the version number), is, in its simplest form, a format for listing news items in<br />

XML. What makes RSS documents (or feeds) more of a service than simply a static document is<br />

that they’re expected <strong>to</strong> be updated regularly (or irregularly). They may even be computed<br />

dynamically, representing, for example, the most recent additions <strong>to</strong> a Web log or the like.<br />

There are plenty of RSS readers out there, and because the RSS format is so easy <strong>to</strong> deal with,<br />

it’s easy <strong>to</strong> find new applications for it. For example, some browsers (such as Mozilla Firefox)<br />

will let you bookmark an RSS feed, and will then give you a dynamic bookmark submenu with<br />

the individual news items as menu items. Some people are even using RSS feeds <strong>to</strong> “broadcast”<br />

sound or video files (called podcasting).<br />

One slightly confusing part of the RSS picture is that versions 0.9x and 2.0.x, now mainly<br />

called Really Simple Syndication (with 0.9x originally called Rich Site Summary), are sort of<br />

compatible with each other, but completely incompatible with RSS 1.0. There are also other<br />

formats for this sort of news feeds and site syndication, such as the more recent A<strong>to</strong>m (see, for<br />

example, http://ietf.org/html.charters/a<strong>to</strong>mpub-charter.html). The problem is that if you<br />

want <strong>to</strong> write a client program that handles feeds from several sites, you must be prepared <strong>to</strong><br />

parse several different formats; you may even have <strong>to</strong> parse HTML fragments found in the<br />

messages themselves. In this section, I’ll use a tiny subset of RSS 2.0. Listing 15-10 shows an<br />

example RSS file; for full specifications of recent RSS 2.0 versions, see http://blogs.law.harvard.<br />

edu/tech/rss. For a specification of RSS 1.0, see http://web.resource.org/rss/1.0.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!