16.01.2014 Views

Beginning Python - From Novice to Professional

Beginning Python - From Novice to Professional

Beginning Python - From Novice to Professional

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

236 CHAPTER 10 ■ BATTERIES INCLUDED<br />

The Wildcard<br />

A regexp can match more than one string, and you create such a pattern by using some special<br />

characters. For example, the period character (dot) matches any character (except a newline),<br />

so the regular expression '.ython' would match both the string 'python' and the string<br />

'jython'. It would also match strings such as 'qython', '+ython', or ' ython' (in which the first<br />

letter is a single space), but not strings such as 'cpython' or 'ython' because the period matches<br />

a single letter, and neither two nor zero.<br />

Because it matches “anything” (any single character except a newline), the period is called<br />

a wildcard.<br />

Escaping Special Characters<br />

When you use special characters such as this, it’s important <strong>to</strong> know that you may run in<strong>to</strong><br />

problems if you try <strong>to</strong> use them as normal characters. For example, imagine you want <strong>to</strong> match<br />

the string 'python.org'. Do you simply use the pattern 'python.org'? You could, but that would<br />

also match 'pythonzorg', for example, which you probably wouldn’t want. (The dot matches<br />

any character except newline, remember?) To make a special character behave like a normal<br />

one, you escape it, just as I demonstrated how <strong>to</strong> escape quotes in strings in Chapter 1. You<br />

place a backslash in front of it. Thus, in this example, you would use 'python\\.org', which<br />

would match 'python.org', and nothing else.<br />

■Note To get a single backslash, which is required here by the re module, you need <strong>to</strong> write two backslashes<br />

in the string—<strong>to</strong> escape it from the interpreter. Thus you have two levels of escaping here: (1) from<br />

the interpreter, and (2) from the re module. (Actually, in some cases you can get away with using a single<br />

backslash and have the interpreter escape it for you au<strong>to</strong>matically, but don’t rely on it.) If you are tired of<br />

doubling up backslashes, use a raw string, such as r'python\.org'.<br />

Character Sets<br />

Matching any character can be useful, but sometimes you want more control. You can create a<br />

so-called character set by enclosing a substring in brackets. Such a character set will match any<br />

of the characters it contains, so '[pj]ython' would match both 'python' and 'jython', but nothing<br />

else. You can also use ranges, such as '[a-z]' <strong>to</strong> match any character from a <strong>to</strong> z (alphabetically),<br />

and you can combine such ranges by putting one after another, such as '[a-zA-Z0-9]'<br />

<strong>to</strong> match uppercase and lowercase letters and digits. (Note that the character set will match<br />

only one such character, though.)<br />

To invert the character set, put the character ^ first, as in '[^abc]' <strong>to</strong> match any character<br />

except a, b, or c.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!