Beginning Python - From Novice to Professional
Beginning Python - From Novice to Professional Beginning Python - From Novice to Professional
CHAPTER 10 ■ BATTERIES INCLUDED 245 User-Agent: Microsoft-Outlook-Express-Macintosh-Edition/5.02.2022 Date: Wed, 19 Dec 2004 17:22:42 -0700 Subject: Re: Spam From: Foo Fie To: Magnus Lie Hetland CC: Message-ID: In-Reply-To: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit Status: RO Content-Length: 55 Lines: 6 So long, and thanks for all the spam! Yours, Foo Fie Let’s try to find out who this e-mail is from. If you examine the text, I’m sure you can figure it out in this case (especially if you look at the message itself, at the bottom, of course). But can you see a general pattern? How do you extract the name of the sender, without the e-mail address? Or, how can you list all the e-mail addresses mentioned in the headers? Let’s handle the first task first. The line containing the sender begins with the string 'From: ' and ends with an e-mail address enclosed in angle brackets (< and >). You want the text found between those. If you use the fileinput module, this ought to be an easy task. A program solving the problem is shown in Listing 10-10. ■Note You could solve this problem without using regular expressions if you wanted. You could also use the email module. Listing 10-10. A Program for Finding the Sender of an E-mail # find_sender.py import fileinput, re pat = re.compile('From: (.*?) $') for line in fileinput.input(): m = pat.match(line) if m: print m.group(1) You can then run the program like this (assuming that the e-mail message is in the text file message.eml):
246 CHAPTER 10 ■ BATTERIES INCLUDED $ python find_sender.py message.eml Foo Fie You should note the following about this program: • I compile the regular expression to make the processing more efficient. • I enclose the subpattern I want to extract in parentheses, making it a group. • I use a nongreedy pattern to match the name because I want to stop matching when I reach the first left angle bracket (or, rather, the space preceding it). • I use a dollar sign to indicate that I want the pattern to match the entire line, all the way to the end. • I use an if statement to make sure that I did in fact match something before I try to extract the match of a specific group. To list all the e-mail addresses mentioned in the headers, you need to construct a regular expression that matches an e-mail address but nothing else. You can then use the method findall to find all the occurrences in each line. To avoid duplicates, you keep the addresses in a set (described earlier in this chapter). Finally, you extract the keys, sort them, and print them out: import fileinput, re pat = re.compile(r'[a-z\-\.]+@[a-z\-\.]+', re.IGNORECASE) addresses = set() for line in fileinput.input(): for address in pat.findall(line): addresses.add(address) for address in sorted(addresses): print address The resulting output when running this program (with the preceding e-mail message as input) is as follows: Mr.Gumby@bar.baz foo@bar.baz foo@baz.com magnus@bozz.floop Note that when sorting, uppercase letters come before lowercase letters. ■Note I haven’t adhered strictly to the problem specification here. The problem was to find the addresses in the header, but in this case the program finds all the addresses in the entire file. To avoid that, you can call fileinput.close() if you find an empty line because the header can’t contain empty lines, and you would be finished. Alternatively, you can use fileinput.nextfile() to start processing the next file, if there is more than one.
- Page 226 and 227: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 228 and 229: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 230 and 231: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 232 and 233: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 234 and 235: CHAPTER 10 ■ ■ ■ Batteries In
- Page 236 and 237: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 238 and 239: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 240 and 241: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 242 and 243: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 244 and 245: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 246 and 247: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 248 and 249: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 250 and 251: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 252 and 253: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 254 and 255: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 256 and 257: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 258 and 259: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 260 and 261: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 262 and 263: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 264 and 265: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 266 and 267: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 268 and 269: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 270 and 271: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 272 and 273: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 274 and 275: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 278 and 279: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 280 and 281: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 282 and 283: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 284: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 287 and 288: 256 CHAPTER 11 ■ FILES AND STUFF
- Page 289 and 290: 258 CHAPTER 11 ■ FILES AND STUFF
- Page 291 and 292: 260 CHAPTER 11 ■ FILES AND STUFF
- Page 293 and 294: 262 CHAPTER 11 ■ FILES AND STUFF
- Page 295 and 296: 264 CHAPTER 11 ■ FILES AND STUFF
- Page 297 and 298: 266 CHAPTER 11 ■ FILES AND STUFF
- Page 299 and 300: 268 CHAPTER 11 ■ FILES AND STUFF
- Page 301 and 302: 270 CHAPTER 12 ■ GRAPHICAL USER I
- Page 303 and 304: 272 CHAPTER 12 ■ GRAPHICAL USER I
- Page 305 and 306: 274 CHAPTER 12 ■ GRAPHICAL USER I
- Page 307 and 308: 276 CHAPTER 12 ■ GRAPHICAL USER I
- Page 309 and 310: 278 CHAPTER 12 ■ GRAPHICAL USER I
- Page 311 and 312: 280 CHAPTER 12 ■ GRAPHICAL USER I
- Page 313 and 314: 282 CHAPTER 12 ■ GRAPHICAL USER I
- Page 316 and 317: CHAPTER 13 ■ ■ ■ Database Sup
- Page 318 and 319: CHAPTER 13 ■ DATABASE SUPPORT 287
- Page 320 and 321: CHAPTER 13 ■ DATABASE SUPPORT 289
- Page 322 and 323: CHAPTER 13 ■ DATABASE SUPPORT 291
- Page 324 and 325: CHAPTER 13 ■ DATABASE SUPPORT 293
246 CHAPTER 10 ■ BATTERIES INCLUDED<br />
$ python find_sender.py message.eml<br />
Foo Fie<br />
You should note the following about this program:<br />
• I compile the regular expression <strong>to</strong> make the processing more efficient.<br />
• I enclose the subpattern I want <strong>to</strong> extract in parentheses, making it a group.<br />
• I use a nongreedy pattern <strong>to</strong> match the name because I want <strong>to</strong> s<strong>to</strong>p matching when I reach the first left<br />
angle bracket (or, rather, the space preceding it).<br />
• I use a dollar sign <strong>to</strong> indicate that I want the pattern <strong>to</strong> match the entire line, all the way <strong>to</strong> the end.<br />
• I use an if statement <strong>to</strong> make sure that I did in fact match something before I try <strong>to</strong> extract the match<br />
of a specific group.<br />
To list all the e-mail addresses mentioned in the headers, you need <strong>to</strong> construct a regular expression that matches<br />
an e-mail address but nothing else. You can then use the method findall <strong>to</strong> find all the occurrences in each line.<br />
To avoid duplicates, you keep the addresses in a set (described earlier in this chapter). Finally, you extract the keys,<br />
sort them, and print them out:<br />
import fileinput, re<br />
pat = re.compile(r'[a-z\-\.]+@[a-z\-\.]+', re.IGNORECASE)<br />
addresses = set()<br />
for line in fileinput.input():<br />
for address in pat.findall(line):<br />
addresses.add(address)<br />
for address in sorted(addresses):<br />
print address<br />
The resulting output when running this program (with the preceding e-mail message as input) is as follows:<br />
Mr.Gumby@bar.baz<br />
foo@bar.baz<br />
foo@baz.com<br />
magnus@bozz.floop<br />
Note that when sorting, uppercase letters come before lowercase letters.<br />
■Note I haven’t adhered strictly <strong>to</strong> the problem specification here. The problem was <strong>to</strong> find the addresses<br />
in the header, but in this case the program finds all the addresses in the entire file. To avoid that, you can<br />
call fileinput.close() if you find an empty line because the header can’t contain empty lines, and you<br />
would be finished. Alternatively, you can use fileinput.nextfile() <strong>to</strong> start processing the next file, if<br />
there is more than one.