Beginning Python - From Novice to Professional
Beginning Python - From Novice to Professional Beginning Python - From Novice to Professional
CHAPTER 10 ■ BATTERIES INCLUDED 237 SPECIAL CHARACTERS IN CHARACTER SETS In general, special characters such as dots, asterisks, and question marks have to be escaped with a backslash if you want them to appear as literal characters in the pattern, rather than function as regexp operators. Inside character sets, escaping these characters is generally not necessary (although perfectly legal). You should, however, keep in mind the following rules: You do have to escape the caret (^) if it appears at the beginning of the character set unless you want it to function as a negation operator. (In other words, don’t place it at the beginning unless you mean it.) Similarly, the right bracket (]) and the dash (-) must be put either at the beginning of the character set or escaped with a backslash. (Actually, the dash may also be put at the end, if you wish.) Alternatives and Subpatterns Character sets are nice when you let each letter vary independently, but what if you want to match only the strings 'python' and 'perl'? You can’t specify such a specific pattern with character sets or wildcards. Instead, you use the special character for alternatives: the “pipe” character (|). So, your pattern would be 'python|perl'. However, sometimes you don’t want to use the choice operator on the entire pattern—just a part of it. To do that, you enclose the part, or subpattern, in parentheses. The previous example could be rewritten as 'p(ython|erl)'. (Note that the term subpattern can also be used about a single character.) Optional and Repeated Subpatterns By adding a question mark after a subpattern, you make it optional. It may appear in the matched string, but it isn’t strictly required. So, for example, the (slightly unreadable) pattern r'(http://)?(www\.)?python\.org' would match all of the following strings (and nothing else): 'http://www.python.org' 'http://python.org' 'www.python.org' 'python.org' A few things are worth noting here: • I’ve escaped the dots, to prevent them from functioning as wildcards. • I’ve used a raw string to reduce the number of backslashes needed. • Each optional subpattern is enclosed in parentheses. • The optional subpatterns may appear or not, independently of each other.
238 CHAPTER 10 ■ BATTERIES INCLUDED The question mark means that the subpattern can appear once or not at all. There are a few other operators that allow you to repeat a subpattern more than once: (pattern)* (pattern)+ (pattern){m,n} pattern is repeated zero or more times pattern is repeated one or more times pattern is repeated from m to n times So, for example, r'w*\.python\.org' matches 'www.python.org', but also '.python.org', 'ww.python.org', and 'wwwwwww.python.org'. Similarly, r'w+\.python\.org' matches 'w.python.org' but not '.python.org', and r'w{3,4}\.python\.org' matches only 'www.python.org' and 'wwww.python.org'. ■Note The term match is used loosely here to mean that the pattern matches the entire string. The match function, described in the text that follows, requires only that the pattern matches the beginning of the string. The Beginning and End of a String Until now, you’ve only been looking at a pattern matching an entire string, but you can also try to find a substring that matches the patterns, such as the substring 'www' of the string 'www.python.org' matching the pattern 'w+'. When you’re searching for substrings like this, it can sometimes be useful to anchor this substring either at the beginning or the end of the full string. For example, you might want to match 'ht+p' at the beginning of a string, but not anywhere else. Then you use a caret ('^') to mark the beginning: '^ht+p' would match 'http://python.org' (and 'htttttp://python.org', for that matter) but not 'www.http.org'. Similarly, the end of a string may be indicated by the dollar sign ('$'). ■Note For a complete listing of regexp operators, see the standard library reference, in the section “Regular Expression Syntax” (http://python.org/doc/lib/re-syntax.html). Contents of the re Module Knowing how to write regular expressions isn’t much good if you can’t use them for anything. The re module contains several useful functions for working with regular expressions. Some of the most important ones are described in Table 10-8.
- Page 218 and 219: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 220 and 221: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 222 and 223: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 224 and 225: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 226 and 227: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 228 and 229: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 230 and 231: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 232 and 233: CHAPTER 9 ■ MAGIC METHODS, PROPER
- Page 234 and 235: CHAPTER 10 ■ ■ ■ Batteries In
- Page 236 and 237: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 238 and 239: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 240 and 241: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 242 and 243: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 244 and 245: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 246 and 247: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 248 and 249: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 250 and 251: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 252 and 253: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 254 and 255: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 256 and 257: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 258 and 259: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 260 and 261: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 262 and 263: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 264 and 265: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 266 and 267: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 270 and 271: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 272 and 273: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 274 and 275: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 276 and 277: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 278 and 279: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 280 and 281: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 282 and 283: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 284: CHAPTER 10 ■ BATTERIES INCLUDED 2
- Page 287 and 288: 256 CHAPTER 11 ■ FILES AND STUFF
- Page 289 and 290: 258 CHAPTER 11 ■ FILES AND STUFF
- Page 291 and 292: 260 CHAPTER 11 ■ FILES AND STUFF
- Page 293 and 294: 262 CHAPTER 11 ■ FILES AND STUFF
- Page 295 and 296: 264 CHAPTER 11 ■ FILES AND STUFF
- Page 297 and 298: 266 CHAPTER 11 ■ FILES AND STUFF
- Page 299 and 300: 268 CHAPTER 11 ■ FILES AND STUFF
- Page 301 and 302: 270 CHAPTER 12 ■ GRAPHICAL USER I
- Page 303 and 304: 272 CHAPTER 12 ■ GRAPHICAL USER I
- Page 305 and 306: 274 CHAPTER 12 ■ GRAPHICAL USER I
- Page 307 and 308: 276 CHAPTER 12 ■ GRAPHICAL USER I
- Page 309 and 310: 278 CHAPTER 12 ■ GRAPHICAL USER I
- Page 311 and 312: 280 CHAPTER 12 ■ GRAPHICAL USER I
- Page 313 and 314: 282 CHAPTER 12 ■ GRAPHICAL USER I
- Page 316 and 317: CHAPTER 13 ■ ■ ■ Database Sup
238 CHAPTER 10 ■ BATTERIES INCLUDED<br />
The question mark means that the subpattern can appear once or not at all. There are a<br />
few other opera<strong>to</strong>rs that allow you <strong>to</strong> repeat a subpattern more than once:<br />
(pattern)*<br />
(pattern)+<br />
(pattern){m,n}<br />
pattern is repeated zero or more times<br />
pattern is repeated one or more times<br />
pattern is repeated from m <strong>to</strong> n times<br />
So, for example, r'w*\.python\.org' matches 'www.python.org', but also '.python.org',<br />
'ww.python.org', and 'wwwwwww.python.org'. Similarly, r'w+\.python\.org' matches<br />
'w.python.org' but not '.python.org', and r'w{3,4}\.python\.org' matches only<br />
'www.python.org' and 'wwww.python.org'.<br />
■Note The term match is used loosely here <strong>to</strong> mean that the pattern matches the entire string. The match<br />
function, described in the text that follows, requires only that the pattern matches the beginning of the string.<br />
The <strong>Beginning</strong> and End of a String<br />
Until now, you’ve only been looking at a pattern matching an entire string, but you can also<br />
try <strong>to</strong> find a substring that matches the patterns, such as the substring 'www' of the string<br />
'www.python.org' matching the pattern 'w+'. When you’re searching for substrings like this,<br />
it can sometimes be useful <strong>to</strong> anchor this substring either at the beginning or the end of the<br />
full string. For example, you might want <strong>to</strong> match 'ht+p' at the beginning of a string, but<br />
not anywhere else. Then you use a caret ('^') <strong>to</strong> mark the beginning: '^ht+p' would match<br />
'http://python.org' (and 'htttttp://python.org', for that matter) but not 'www.http.org'.<br />
Similarly, the end of a string may be indicated by the dollar sign ('$').<br />
■Note For a complete listing of regexp opera<strong>to</strong>rs, see the standard library reference, in the section “Regular<br />
Expression Syntax” (http://python.org/doc/lib/re-syntax.html).<br />
Contents of the re Module<br />
Knowing how <strong>to</strong> write regular expressions isn’t much good if you can’t use them for anything.<br />
The re module contains several useful functions for working with regular expressions. Some of<br />
the most important ones are described in Table 10-8.