Beginning Python - From Novice to Professional

16.01.2014 Views
CHAPTER 15 ■ PYTHON AND THE WEB 335 Web Services: Scraping Done Right Web services are a bit like computer-friendly Web pages. They are based on standards and protocols that enable programs to exchange information across the network, usually with one program, the client or service requester, asking for some information or service, and the other program, the server or service provider, providing this information or service. Yeah, I know. Glaringly obvious stuff. And it also seems very similar to the network programming discussed in Chapter 14. There are differences, though . . . Web services often work on a rather high level of abstraction. They use HTTP (the “Web protocol”) as the underlying protocol; on top of this, they use more content-oriented protocols, for example, using some XML format to encode requests and responses. This means that a Web server can be the platform for Web services. As the title of this section indicates, it’s Web scraping taken to another level; one could see the Web service as a dynamic Web page designed for a computerized client, rather than for human consumption. There are standards for Web services that go really, really far in capturing all kinds of complexity, but you can get a lot done with utter simplicity as well. In this section, I discuss two simple Web service protocols: I start with the simplest, RSS, which you could even argue is so simple that it isn’t really a Web service protocol at all. Then I show you how to use XML-RPC from the client side; the server side is dealt with in more detail in Chapter 27. There are several other standards you might want to check out. For example, SOAP is, in some sense, XML-RPC on steroids, and WSDL is a format for describing Web services formally. A good Web search engine will, as always, be your friend here. RSS RSS, which stands for either Rich Site Summary, RDF Site Summary, or Really Simple Syndication (depending on the version number), is, in its simplest form, a format for listing news items in XML. What makes RSS documents (or feeds) more of a service than simply a static document is that they’re expected to be updated regularly (or irregularly). They may even be computed dynamically, representing, for example, the most recent additions to a Web log or the like. There are plenty of RSS readers out there, and because the RSS format is so easy to deal with, it’s easy to find new applications for it. For example, some browsers (such as Mozilla Firefox) will let you bookmark an RSS feed, and will then give you a dynamic bookmark submenu with the individual news items as menu items. Some people are even using RSS feeds to “broadcast” sound or video files (called podcasting). One slightly confusing part of the RSS picture is that versions 0.9x and 2.0.x, now mainly called Really Simple Syndication (with 0.9x originally called Rich Site Summary), are sort of compatible with each other, but completely incompatible with RSS 1.0. There are also other formats for this sort of news feeds and site syndication, such as the more recent Atom (see, for example, http://ietf.org/html.charters/atompub-charter.html). The problem is that if you want to write a client program that handles feeds from several sites, you must be prepared to parse several different formats; you may even have to parse HTML fragments found in the messages themselves. In this section, I’ll use a tiny subset of RSS 2.0. Listing 15-10 shows an example RSS file; for full specifications of recent RSS 2.0 versions, see http://blogs.law.harvard. edu/tech/rss. For a specification of RSS 1.0, see http://web.resource.org/rss/1.0.

336 CHAPTER 15 ■ PYTHON AND THE WEB Listing 15-10. A Simple RSS 2.0 File Example Top Stories http://www.example.com Example News is a top notch provider of meaningless news items. Interesting stuff Something really interesting happened today http://www.example.com/newsitem1.html More interesting stuff Then something even more interesting happened http://www.example.com/newsitem2.html The RSS 2.0 standard specifies a few mandatory elements, and many optional ones. You can count on an RSS 2.0 channel element having a title, link, and description. They can contain (among other things) zero or more item elements, which, at the very least, have either a title or a description. If you’re writing a program to deal with a specific feed, a good idea might be to simply find out which elements it provides. Another thing making the parsing a bit challenging is the sad fact that even though RSS is supposed to be valid XML, and therefore easy to parse, chances are you will come across illformed RSS feeds. If nothing else, the news messages themselves may contain such illegalities as unescaped ampersands (&) or the like. There aren’t really (at the time of writing) any obvious standard RSS modules for Python that will handle these difficulties, so you’re more or less back to screen scraping (for now, at least). Luckily, the handy Beautiful Soup parser can deal with XML as well as HTML, and it won’t complain about a bit of sloppiness on the part of the RSS feed. To round off this little introduction to RSS, Listing 15-11 is an example program that will get the top stories from Wired News (http://wired.com). Note that it uses the class BeautifulStoneSoup, rather than BeautifulSoup, to parse the RSS feed; this class can deal with XML in general, while BeautifulSoup is targeted specifically at HTML. (In order to use the BeautifulStoneSoup class, you will, of course, need to download BeautifulSoup, as discussed earlier in this chapter.) The program also demonstrates how you can use the wrap function from the standard Python module textwrap to make text fit nicely on the screen.

Page 2 and 3: Beginning Python From Novice to Pro

Page 4: For Ranveig

Page 7 and 8: ■CHAPTER 23 Project 4: In the New

Page 9 and 10: viii ■CONTENTS Strings . . . . .

Page 11 and 12: x ■CONTENTS ■CHAPTER 5 Conditio

Page 13 and 14: xii ■CONTENTS ■CHAPTER 8 Except

Page 15 and 16: xiv ■CONTENTS ■CHAPTER 11 Files

Page 17 and 18: xvi ■CONTENTS Dynamic Web Pages w

Page 19 and 20: xviii ■CONTENTS ■CHAPTER 20 Pro

Page 21 and 22: xx ■CONTENTS Second Implementatio

Page 23 and 24: xxii ■CONTENTS Preparations . . .

Page 26: About the Technical Reviewer ■JER

Page 30 and 31: Introduction A C program is like a

Page 32 and 33: CHAPTER 1 ■ ■ ■ Instant Hacki

Page 34 and 35: CHAPTER 1 ■ INSTANT HACKING: THE














Page 62 and 63: CHAPTER 2 ■ ■ ■ Lists and Tup

Page 64 and 65: CHAPTER 2 ■ LISTS AND TUPLES 33 I

Page 66 and 67: CHAPTER 2 ■ LISTS AND TUPLES 35 A

Page 68 and 69: CHAPTER 2 ■ LISTS AND TUPLES 37 >

Page 70 and 71: CHAPTER 2 ■ LISTS AND TUPLES 39 M

Page 72 and 73: CHAPTER 2 ■ LISTS AND TUPLES 41 T

Page 74 and 75: CHAPTER 2 ■ LISTS AND TUPLES 43 L

Page 76 and 77: CHAPTER 2 ■ LISTS AND TUPLES 45 W

Page 78 and 79: CHAPTER 2 ■ LISTS AND TUPLES 47

Page 80 and 81: CHAPTER 2 ■ LISTS AND TUPLES 49 a

Page 82 and 83: CHAPTER 2 ■ LISTS AND TUPLES 51 S

Page 84 and 85: CHAPTER 3 ■ ■ ■ Working with

Page 86 and 87: CHAPTER 3 ■ WORKING WITH STRINGS






Page 98 and 99: CHAPTER 4 ■ ■ ■ Dictionaries:

Page 100 and 101: CHAPTER 4 ■ DICTIONARIES: WHEN IN





Page 110: CHAPTER 4 ■ DICTIONARIES: WHEN IN

Page 113 and 114: 82 CHAPTER 5 ■ CONDITIONALS, LOOP









Page 131 and 132: 100 CHAPTER 5 ■ CONDITIONALS, LOO





Page 141 and 142: 110 CHAPTER 6 ■ ABSTRACTION But w

Page 143 and 144: 112 CHAPTER 6 ■ ABSTRACTION ■Ti

Page 145 and 146: 114 CHAPTER 6 ■ ABSTRACTION Can I

Page 147 and 148: 116 CHAPTER 6 ■ ABSTRACTION stora

Page 149 and 150: 118 CHAPTER 6 ■ ABSTRACTION 4. Yo

Page 151 and 152: 120 CHAPTER 6 ■ ABSTRACTION The p

Page 153 and 154: 122 CHAPTER 6 ■ ABSTRACTION >>> p

Page 155 and 156: 124 CHAPTER 6 ■ ABSTRACTION Also,

Page 157 and 158: 126 CHAPTER 6 ■ ABSTRACTION Feel

Page 159 and 160: 128 CHAPTER 6 ■ ABSTRACTION ■No

Page 161 and 162: 130 CHAPTER 6 ■ ABSTRACTION So yo

Page 163 and 164: 132 CHAPTER 6 ■ ABSTRACTION numer

Page 165 and 166: 134 CHAPTER 6 ■ ABSTRACTION LAMBD

Page 167 and 168: 136 CHAPTER 6 ■ ABSTRACTION In th

Page 169 and 170: 138 CHAPTER 6 ■ ABSTRACTION Scope

Page 171 and 172: 140 CHAPTER 7 ■ MORE ABSTRACTION










Page 191 and 192: 160 CHAPTER 8 ■ EXCEPTIONS >>> im

Page 193 and 194: 162 CHAPTER 8 ■ EXCEPTIONS Custom

Page 195 and 196: 164 CHAPTER 8 ■ EXCEPTIONS More T

Page 197 and 198: 166 CHAPTER 8 ■ EXCEPTIONS try: x

Page 199 and 200: 168 CHAPTER 8 ■ EXCEPTIONS Invali

Page 201 and 202: 170 CHAPTER 8 ■ EXCEPTIONS If you

Page 204 and 205: CHAPTER 9 ■ ■ ■ Magic Methods

Page 206 and 207: CHAPTER 9 ■ MAGIC METHODS, PROPER














Page 234 and 235: CHAPTER 10 ■ ■ ■ Batteries In

Page 236 and 237: CHAPTER 10 ■ BATTERIES INCLUDED 2
























Page 284: CHAPTER 10 ■ BATTERIES INCLUDED 2

Page 287 and 288: 256 CHAPTER 11 ■ FILES AND STUFF







Page 301 and 302: 270 CHAPTER 12 ■ GRAPHICAL USER I







Page 316 and 317: CHAPTER 13 ■ ■ ■ Database Sup

Page 318 and 319: CHAPTER 13 ■ DATABASE SUPPORT 287




Page 326: CHAPTER 13 ■ DATABASE SUPPORT 295

Page 329 and 330: 298 CHAPTER 14 ■ NETWORK PROGRAMM








Page 345 and 346: 314 CHAPTER 15 ■ PYTHON AND THE W










Page 365: 334 CHAPTER 15 ■ PYTHON AND THE W


Page 372 and 373: CHAPTER 16 ■ ■ ■ Testing, 1-2

Page 374 and 375: CHAPTER 16 ■ TESTING, 1-2-3 343 W

Page 376 and 377: CHAPTER 16 ■ TESTING, 1-2-3 345 d

Page 378 and 379: CHAPTER 16 ■ TESTING, 1-2-3 347 u

Page 380 and 381: CHAPTER 16 ■ TESTING, 1-2-3 349 F

Page 382 and 383: CHAPTER 16 ■ TESTING, 1-2-3 351 P

Page 384 and 385: CHAPTER 16 ■ TESTING, 1-2-3 353 "

Page 386: CHAPTER 16 ■ TESTING, 1-2-3 355 N

Page 389 and 390: 358 CHAPTER 17 ■ EXTENDING PYTHON







Page 404 and 405: CHAPTER 18 ■ ■ ■ Packaging Yo

Page 406 and 407: CHAPTER 18 ■ PACKAGING YOUR PROGR



Page 412 and 413: CHAPTER 19 ■ ■ ■ Playful Prog

Page 414 and 415: CHAPTER 19 ■ PLAYFUL PROGRAMMING



Page 420: CHAPTER 19 ■ PLAYFUL PROGRAMMING

Page 423 and 424: 392 CHAPTER 20 ■ PROJECT 1: INSTA









Page 442 and 443: CHAPTER 21 ■ ■ ■ Project 2: P

Page 444 and 445: CHAPTER 21 ■ PROJECT 2: PAINTING




Page 452 and 453: CHAPTER 22 ■ ■ ■ Project 3: X

Page 454 and 455: CHAPTER 22 ■ PROJECT 3: XML FOR A







Page 468: CHAPTER 22 ■ PROJECT 3: XML FOR A

Page 471 and 472: 440 CHAPTER 23 ■ PROJECT 4: IN TH







Page 486 and 487: CHAPTER 24 ■ ■ ■ Project 5: A

Page 488 and 489: CHAPTER 24 ■ PROJECT 5: A VIRTUAL








Page 504 and 505: CHAPTER 25 ■ ■ ■ Project 6: R

Page 506 and 507: CHAPTER 25 ■ PROJECT 6: REMOTE ED



Page 512: CHAPTER 25 ■ PROJECT 6: REMOTE ED

Page 515 and 516: 484 CHAPTER 26 ■ PROJECT 7: YOUR








Page 531 and 532: 500 CHAPTER 27 ■ PROJECT 8: FILE













Page 558 and 559: CHAPTER 29 ■ ■ ■ Project 10:

Page 560 and 561: CHAPTER 29 ■ PROJECT 10: DO-IT-YO









Page 578 and 579: APPENDIX A ■ ■ ■ The Short Ve

Page 580 and 581: APPENDIX A ■ THE SHORT VERSION 54



Page 586: APPENDIX A ■ THE SHORT VERSION 55

Page 589 and 590: 558 APPENDIX B ■ PYTHON REFERENCE







Page 603 and 604: 572 APPENDIX C ■ ONLINE RESOURCES

Page 606 and 607: Index ■Symbols - operator 558 !=

Page 608 and 609: ■INDEX 577 assertEqual method Tes

Page 610 and 611: ■INDEX 579 cmd module 252, 501 Cm

Page 612 and 613: ■INDEX 581 get method 74-75 has_k

Page 614 and 615: ■INDEX 583 ■F Factory class twi

Page 616 and 617: ■INDEX 585 finding conflicts 197

Page 618 and 619: ■INDEX 587 initialization 38 nami

Page 620 and 621: ■INDEX 589 localtime function tim

Page 622 and 623: ■INDEX 591 nesting blocks 88 Netw

Page 624 and 625: ■INDEX 593 playful programming 38

Page 626 and 627: ■INDEX 595 Python C API 365 hand-

Page 628 and 629: ■INDEX 597 ■S safe_substitute m

Page 630 and 631: ■INDEX 599 split function re modu

Page 632 and 633: ■INDEX 601 TestCase class methods

Page 634 and 635: ■INDEX 603 further exploration 47

Page 641: forums.apress.com FOR PROFESSIONALS

python

module

method

import

methods

listing

functions

returns

server

sequence

novice

www.iaa.es

Beginning Python - From Novice to Professional

Beginning Python - From Novice to Professional ... View more Beginning Python - From Novice to Professional

Delete template?

Save as template ?

Beginning Python - From Novice to Professional Beginning Python - From Novice to Professional