16.01.2014 Views

Beginning Python - From Novice to Professional

Beginning Python - From Novice to Professional

Beginning Python - From Novice to Professional

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

244 CHAPTER 10 ■ BATTERIES INCLUDED<br />

GREEDY AND NONGREEDY PATTERNS<br />

The repetition opera<strong>to</strong>rs are by default greedy; that means that they will match as much as possible. For<br />

example, let’s say I rewrote the emphasis program <strong>to</strong> use the following pattern:<br />

>>> emphasis_pattern = r'\*(.+)\*'<br />

This matches an asterisk, followed by one or more letters, and then another asterisk. Sounds perfect,<br />

doesn’t it? But it isn’t:<br />

>>> re.sub(emphasis_pattern, r'\1', '*This* is *it*!')<br />

'This* is *it!'<br />

As you can see, the pattern matched everything from the first asterisk <strong>to</strong> the last—including the two<br />

asterisks between! This is what it means <strong>to</strong> be greedy: Take everything you can.<br />

In this case, you clearly don’t want this overly greedy behavior. The solution presented in the preceding<br />

text (using a character set matching anything except an asterisk) is fine when you know that one specific letter<br />

is illegal. But let’s consider another scenario: What if you used the form '**something**' <strong>to</strong> signify emphasis?<br />

Now it shouldn’t be a problem <strong>to</strong> include single asterisks inside the emphasized phrase. But how do you avoid<br />

being <strong>to</strong>o greedy?<br />

Actually, it’s quite easy; you just use a nongreedy version of the repetition opera<strong>to</strong>r. All the repetition<br />

opera<strong>to</strong>rs can be made nongreedy by putting a question mark after them:<br />

>>> emphasis_pattern = r'\*\*(.+?)\*\*'<br />

>>> re.sub(emphasis_pattern, r'\1', '**This** is **it**!')<br />

'This is it!'<br />

Here I’ve used the opera<strong>to</strong>r +? instead of +, which means that the pattern will match one or more occurrences<br />

of the wildcard, as before. However, it will match as few as it can, because it is now nongreedy; it will<br />

match only the minimum needed <strong>to</strong> reach the next occurrence of '\*\*', which is the end of the pattern. As<br />

you can see, it works nicely.<br />

Examples<br />

Finding out who an e-mail is from. Have you ever saved an e-mail as a text file? If you have, you may have seen<br />

that it contains a lot of essentially unreadable text at the <strong>to</strong>p, similar <strong>to</strong> that shown in Listing 10-9.<br />

Listing 10-9. A Set of (Fictitious) E-mail Headers<br />

<strong>From</strong> foo@bar.baz Thu Dec 20 01:22:50 2004<br />

Return-Path: <br />

Received: from xyzzy42.bar.com (xyzzy.bar.baz [123.456.789.42])<br />

by frozz.bozz.floop (8.9.3/8.9.3) with ESMTP id BAA25436<br />

for ; Thu, 20 Dec 2004 01:22:50 +0100 (MET)<br />

Received: from [43.253.124.23] by bar.baz<br />

(InterMail vM.4.01.03.27 201-229-121-127-20010626) with ESMTP<br />

id ;<br />

Thu, 20 Dec 2004 00:22:42 +0000

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!