Beginning Python - From Novice to Professional
Beginning Python - From Novice to Professional Beginning Python - From Novice to Professional
CHAPTER 20 ■ ■ ■ Project 1: Instant Markup In this project, you see how to use Python’s excellent text processing capabilities, including the capability to use regular expressions to change a plain text file into one marked up in a language such as HTML or XML. You need such skills if you want to use text written by people who don’t know these languages in a system that requires the contents to be marked up. Let’s start by implementing a simple prototype that does the basic processing, and then extend that program to make the markup system more flexible. ALPHABET SOUP Don’t speak fluent XML? Don’t worry about that—if you have only a passing acquaintance with HTML, you’ll do fine in this chapter. If you need an introduction to HTML, I suggest you take a look at Dave Raggett’s excellent guide “Getting Started with HTML” at the World Wide Web Consortium’s Web site (http://www.w3.org/ MarkUp/Guide). For more information about XML, see Chapter 22. What’s the Problem? You want to add some formatting to a plain text file. Let’s say you’ve been handed the file from someone who can’t be bothered with writing in HTML, and you need to use the document as a Web page. Instead of adding all the necessary tags manually, you want your program to do it automatically. Your task is basically to classify various text elements, such as headlines and emphasized text, and then clearly mark them. In the specific problem addressed here, you add HTML markup to the text, so the resulting document can be displayed in a Web browser and used as a Web page. However, once you have built your basic engine, there is no reason why you can’t add other kind of markup (such as various forms of XML or perhaps ALTEX codes). After analyzing a text file, you can even perform other tasks, such as extracting all the headlines to make a table of contents. 391
392 CHAPTER 20 ■ PROJECT 1: INSTANT MARKUP WHAT IS ALTEX ? You don’t have to worry about it while working on this project—but ALTEX is another markup system (based on the TEX typesetting program) for creating various types of technical documents. I mention it here only as an example of what other uses your program may be put to. If you want to know more, you can visit the TEX Users Group home page at http://www.tug.org. The text you’re given may contain some clues (such as emphasized text being marked *like this*), but you’ll probably need some ingenuity in making your program guess how the document is structured. Specific Goals Before starting to write your prototype, let’s define some goals: • The input shouldn’t be required to contain artificial codes or tags. • You should be able to deal both with different blocks, such as headings, paragraphs, and list items, as well as inline text, such as emphasized text or URLs. • Although this implementation deals with HTML, it should be easy to extend it to other markup languages. You may not be able to reach these goals fully in the first version of your program, but that’s what it’s for: You write the prototype to find flaws in your original ideas and to learn more about how to write a program that solves your problem. Useful Tools Consider what tools might be needed in writing this program. You certainly need to read from and write to files (see Chapter 11), or at least read from standard input (sys.stdin) and output with print. You probably need to iterate over the lines of the input (see Chapter 11); you need a few string methods (see Chapter 3); perhaps you’ll use a generator or two (see Chapter 9); and you probably need the re module (see Chapter 10). If any of these concepts seem unfamiliar to you, you should perhaps take a moment to refresh your memory. Preparations Before you start coding, you need some way of assessing your progress; you need a test suite. In this project, a single test may suffice: a test document (in plain text). Listing 20-1 contains a sample text that you want to mark up automatically.
- Page 372 and 373: CHAPTER 16 ■ ■ ■ Testing, 1-2
- Page 374 and 375: CHAPTER 16 ■ TESTING, 1-2-3 343 W
- Page 376 and 377: CHAPTER 16 ■ TESTING, 1-2-3 345 d
- Page 378 and 379: CHAPTER 16 ■ TESTING, 1-2-3 347 u
- Page 380 and 381: CHAPTER 16 ■ TESTING, 1-2-3 349 F
- Page 382 and 383: CHAPTER 16 ■ TESTING, 1-2-3 351 P
- Page 384 and 385: CHAPTER 16 ■ TESTING, 1-2-3 353 "
- Page 386: CHAPTER 16 ■ TESTING, 1-2-3 355 N
- Page 389 and 390: 358 CHAPTER 17 ■ EXTENDING PYTHON
- Page 391 and 392: 360 CHAPTER 17 ■ EXTENDING PYTHON
- Page 393 and 394: 362 CHAPTER 17 ■ EXTENDING PYTHON
- Page 395 and 396: 364 CHAPTER 17 ■ EXTENDING PYTHON
- Page 397 and 398: 366 CHAPTER 17 ■ EXTENDING PYTHON
- Page 399 and 400: 368 CHAPTER 17 ■ EXTENDING PYTHON
- Page 401 and 402: 370 CHAPTER 17 ■ EXTENDING PYTHON
- Page 404 and 405: CHAPTER 18 ■ ■ ■ Packaging Yo
- Page 406 and 407: CHAPTER 18 ■ PACKAGING YOUR PROGR
- Page 408 and 409: CHAPTER 18 ■ PACKAGING YOUR PROGR
- Page 410 and 411: CHAPTER 18 ■ PACKAGING YOUR PROGR
- Page 412 and 413: CHAPTER 19 ■ ■ ■ Playful Prog
- Page 414 and 415: CHAPTER 19 ■ PLAYFUL PROGRAMMING
- Page 416 and 417: CHAPTER 19 ■ PLAYFUL PROGRAMMING
- Page 418 and 419: CHAPTER 19 ■ PLAYFUL PROGRAMMING
- Page 420: CHAPTER 19 ■ PLAYFUL PROGRAMMING
- Page 425 and 426: 394 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 427 and 428: 396 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 429 and 430: 398 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 431 and 432: 400 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 433 and 434: 402 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 435 and 436: 404 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 437 and 438: 406 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 439 and 440: 408 CHAPTER 20 ■ PROJECT 1: INSTA
- Page 442 and 443: CHAPTER 21 ■ ■ ■ Project 2: P
- Page 444 and 445: CHAPTER 21 ■ PROJECT 2: PAINTING
- Page 446 and 447: CHAPTER 21 ■ PROJECT 2: PAINTING
- Page 448 and 449: CHAPTER 21 ■ PROJECT 2: PAINTING
- Page 450 and 451: CHAPTER 21 ■ PROJECT 2: PAINTING
- Page 452 and 453: CHAPTER 22 ■ ■ ■ Project 3: X
- Page 454 and 455: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 456 and 457: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 458 and 459: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 460 and 461: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 462 and 463: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 464 and 465: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 466 and 467: CHAPTER 22 ■ PROJECT 3: XML FOR A
- Page 468: CHAPTER 22 ■ PROJECT 3: XML FOR A
CHAPTER 20<br />
■ ■ ■<br />
Project 1: Instant Markup<br />
In this project, you see how <strong>to</strong> use <strong>Python</strong>’s excellent text processing capabilities, including<br />
the capability <strong>to</strong> use regular expressions <strong>to</strong> change a plain text file in<strong>to</strong> one marked up in a<br />
language such as HTML or XML. You need such skills if you want <strong>to</strong> use text written by people<br />
who don’t know these languages in a system that requires the contents <strong>to</strong> be marked up.<br />
Let’s start by implementing a simple pro<strong>to</strong>type that does the basic processing, and then<br />
extend that program <strong>to</strong> make the markup system more flexible.<br />
ALPHABET SOUP<br />
Don’t speak fluent XML? Don’t worry about that—if you have only a passing acquaintance with HTML, you’ll<br />
do fine in this chapter. If you need an introduction <strong>to</strong> HTML, I suggest you take a look at Dave Raggett’s excellent<br />
guide “Getting Started with HTML” at the World Wide Web Consortium’s Web site (http://www.w3.org/<br />
MarkUp/Guide). For more information about XML, see Chapter 22.<br />
What’s the Problem?<br />
You want <strong>to</strong> add some formatting <strong>to</strong> a plain text file. Let’s say you’ve been handed the file from<br />
someone who can’t be bothered with writing in HTML, and you need <strong>to</strong> use the document as a<br />
Web page. Instead of adding all the necessary tags manually, you want your program <strong>to</strong> do it<br />
au<strong>to</strong>matically.<br />
Your task is basically <strong>to</strong> classify various text elements, such as headlines and emphasized<br />
text, and then clearly mark them. In the specific problem addressed here, you add HTML markup<br />
<strong>to</strong> the text, so the resulting document can be displayed in a Web browser and used as a Web<br />
page. However, once you have built your basic engine, there is no reason why you can’t add<br />
other kind of markup (such as various forms of XML or perhaps ALTEX codes). After analyzing a<br />
text file, you can even perform other tasks, such as extracting all the headlines <strong>to</strong> make a table<br />
of contents.<br />
391