Beginning Python - From Novice to Professional
Beginning Python - From Novice to Professional Beginning Python - From Novice to Professional
CHAPTER 22 ■ PROJECT 3: XML FOR ALL OCCASIONS 425 WHAT ABOUT DOM? There are two common ways of dealing with XML in Python (and other programming languages, for that matter): SAX and DOM (the Document Object Model). A SAX parser reads through the XML file and tells you what it sees (text, tags, attributes), storing only small parts of the document at a time; this makes SAX simple, fast, and memory-efficient, which is why I have chosen to use it in this chapter. DOM takes another approach: It constructs a data structure (the document tree), which represents the entire document. This is slower and requires more memory, but can be useful if you want to manipulate the structure of your document, for example. For information about using DOM in Python, check out the Python Library Reference (http:// www.python.org/doc/lib/module-xml.dom.html). In addition to the standard DOM handling, the standard library contains two other modules, xml.dom.minidom (a simplified DOM) and xml.dom.pulldom (a cross between SAX and DOM, which reduces memory requirements). A very fast and simple XML parser (which doesn’t really use DOM, but which creates a complete document tree from your XML document) is PyRXP (http://www.reportlab.org/pyrxp.html). And then there is ElementTree (available from http://effbot.org/zone), which is flexible and easy to use. Creating a Simple Content Handler There are several event types available when parsing with SAX, but let’s restrict ourselves to three: the beginning of an element (the occurrence of an opening tag), the end of an element (the occurrence of a closing tag), and plain text (characters). To parse the XML file, let’s use the parse function from the xml.sax module. This function takes care of reading the file and generating the events—but as it generates these events, it needs some event handlers to call. These event handlers will be implemented as methods of a content handler object. You’ll subclass the ContentHandler class from xml.sax.handler because it implements all the necessary event handlers (as dummy operations that have no effect), and you can override only the ones you need. Let’s begin with a minimal XML parser (assuming that your XML file is called website.xml): from xml.sax.handler import ContentHandler from xml.sax import parse class TestHandler(ContentHandler): pass parse('website.xml', TestHandler()) If you execute this program, seemingly nothing happens, but you shouldn’t get any error messages either. Behind the scenes, the XML file is parsed, and the default event handlers are called—but because they don’t do anything, you won’t see any output. Let’s try a simple extension. Add the following method to the TestHandler class: def startElement(self, name, attrs): print name, attrs.keys()
426 CHAPTER 22 ■ PROJECT 3: XML FOR ALL OCCASIONS This overrides the default startElement event handler. The parameters are the relevant tag name and its attributes (kept in a dictionary-like object). If you run the program again (using website.xml from Listing 22-1), you see the following output: website [] page [u'name', u'title'] h1 [] p [] ul [] li [] a [u'href'] li [] a [u'href'] li [] a [u'href'] directory [u'name'] page [u'name', u'title'] h1 [] p [] page [u'name', u'title'] h1 [] p [] page [u'name', u'title'] h1 [] p [] How this works should be pretty clear. In addition to startElement, you’ll use endElement (which takes only a tag name as its argument) and characters (which takes a string as its argument). The following is an example that uses all these three methods to build a list of the headlines (the h1 elements) of the Web site file: from xml.sax.handler import ContentHandler from xml.sax import parse class HeadlineHandler(ContentHandler): in_headline = False def __init__(self, headlines): ContentHandler.__init__(self) self.headlines = headlines self.data = []
- Page 406 and 407: CHAPTER 18 ■ PACKAGING YOUR PROGR
- Page 408 and 409: CHAPTER 18 ■ PACKAGING YOUR PROGR
- Page 410 and 411: CHAPTER 18 ■ PACKAGING YOUR PROGR
- Page 412 and 413: CHAPTER 19 ■ ■ ■ Playful Prog
- Page 414 and 415: CHAPTER 19 ■ PLAYFUL PROGRAMMING
- Page 416 and 417: CHAPTER 19 ■ PLAYFUL PROGRAMMING
- Page 418 and 419: CHAPTER 19 ■ PLAYFUL PROGRAMMING
- Page 420: CHAPTER 19 ■ PLAYFUL PROGRAMMING
- Page 423 and 424: 392 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 425 and 426: 394 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 427 and 428: 396 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 429 and 430: 398 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 431 and 432: 400 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 433 and 434: 402 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 435 and 436: 404 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 437 and 438: 406 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 439 and 440: 408 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 442 and 443: CHAPTER 21 ■ ■ ■ Project 2: P
- Page 444 and 445: CHAPTER 21 ■ PROJECT 2: PAINTING
- Page 446 and 447: CHAPTER 21 ■ PROJECT 2: PAINTING
- Page 448 and 449: CHAPTER 21 ■ PROJECT 2: PAINTING
- Page 450 and 451: CHAPTER 21 ■ PROJECT 2: PAINTING
- Page 452 and 453: CHAPTER 22 ■ ■ ■ Project 3: X
- Page 454 and 455: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 458 and 459: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 460 and 461: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 462 and 463: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 464 and 465: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 466 and 467: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 468: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 471 and 472: 440 CHAPTER 23 ■ PROJECT 4: IN TH
- Page 473 and 474: 442 CHAPTER 23 ■ PROJECT 4: IN TH
- Page 475 and 476: 444 CHAPTER 23 ■ PROJECT 4: IN TH
- Page 477 and 478: 446 CHAPTER 23 ■ PROJECT 4: IN TH
- Page 479 and 480: 448 CHAPTER 23 ■ PROJECT 4: IN TH
- Page 481 and 482: 450 CHAPTER 23 ■ PROJECT 4: IN TH
- Page 483 and 484: 452 CHAPTER 23 ■ PROJECT 4: IN TH
- Page 486 and 487: CHAPTER 24 ■ ■ ■ Project 5: A
- Page 488 and 489: CHAPTER 24 ■ PROJECT 5: A VIRTUAL
- Page 490 and 491: CHAPTER 24 ■ PROJECT 5: A VIRTUAL
- Page 492 and 493: CHAPTER 24 ■ PROJECT 5: A VIRTUAL
- Page 494 and 495: CHAPTER 24 ■ PROJECT 5: A VIRTUAL
- Page 496 and 497: CHAPTER 24 ■ PROJECT 5: A VIRTUAL
- Page 498 and 499: CHAPTER 24 ■ PROJECT 5: A VIRTUAL
- Page 500 and 501: CHAPTER 24 ■ PROJECT 5: A VIRTUAL
- Page 502 and 503: CHAPTER 24 ■ PROJECT 5: A VIRTUAL
- Page 504 and 505: CHAPTER 25 ■ ■ ■ Project 6: R
CHAPTER 22 ■ PROJECT 3: XML FOR ALL OCCASIONS 425<br />
WHAT ABOUT DOM?<br />
There are two common ways of dealing with XML in <strong>Python</strong> (and other programming languages, for that matter):<br />
SAX and DOM (the Document Object Model). A SAX parser reads through the XML file and tells you what it sees<br />
(text, tags, attributes), s<strong>to</strong>ring only small parts of the document at a time; this makes SAX simple, fast, and<br />
memory-efficient, which is why I have chosen <strong>to</strong> use it in this chapter. DOM takes another approach: It constructs<br />
a data structure (the document tree), which represents the entire document. This is slower and requires more<br />
memory, but can be useful if you want <strong>to</strong> manipulate the structure of your document, for example.<br />
For information about using DOM in <strong>Python</strong>, check out the <strong>Python</strong> Library Reference (http://<br />
www.python.org/doc/lib/module-xml.dom.html). In addition <strong>to</strong> the standard DOM handling, the standard<br />
library contains two other modules, xml.dom.minidom (a simplified DOM) and xml.dom.pulldom (a cross<br />
between SAX and DOM, which reduces memory requirements).<br />
A very fast and simple XML parser (which doesn’t really use DOM, but which creates a complete document<br />
tree from your XML document) is PyRXP (http://www.reportlab.org/pyrxp.html). And then there is<br />
ElementTree (available from http://effbot.org/zone), which is flexible and easy <strong>to</strong> use.<br />
Creating a Simple Content Handler<br />
There are several event types available when parsing with SAX, but let’s restrict ourselves <strong>to</strong><br />
three: the beginning of an element (the occurrence of an opening tag), the end of an element<br />
(the occurrence of a closing tag), and plain text (characters). To parse the XML file, let’s use the<br />
parse function from the xml.sax module. This function takes care of reading the file and generating<br />
the events—but as it generates these events, it needs some event handlers <strong>to</strong> call. These<br />
event handlers will be implemented as methods of a content handler object. You’ll subclass the<br />
ContentHandler class from xml.sax.handler because it implements all the necessary event handlers<br />
(as dummy operations that have no effect), and you can override only the ones you need.<br />
Let’s begin with a minimal XML parser (assuming that your XML file is called website.xml):<br />
from xml.sax.handler import ContentHandler<br />
from xml.sax import parse<br />
class TestHandler(ContentHandler): pass<br />
parse('website.xml', TestHandler())<br />
If you execute this program, seemingly nothing happens, but you shouldn’t get any error<br />
messages either. Behind the scenes, the XML file is parsed, and the default event handlers are<br />
called—but because they don’t do anything, you won’t see any output.<br />
Let’s try a simple extension. Add the following method <strong>to</strong> the TestHandler class:<br />
def startElement(self, name, attrs):<br />
print name, attrs.keys()