13.07.2015 Views

Linux System Administration Recipes A Problem-Solution Approach

Linux System Administration Recipes A Problem-Solution Approach

Linux System Administration Recipes A Problem-Solution Approach

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 9 ■ WORKING WITH TEXT IN FILESBut if you’ve had to change something in one file on the web server, you’ll probably have to do it inmore than one file. Happily, this also works with shell wildcards. So, to make the previous change toevery file in a particular directory, use this:perl -i.bak -pe 's/widget/wodget/g' *.htmlOr to do this recursively through all the directories (this might be useful if you change the name ofyour main CSS file, for example), use this:find . -type f | xargs perl -p -i.old -e 's/old.css/new.css/g'The -type f argument to find limits this to files, so you won’t get errors from Perl trying to do thissubstitution in directories. You can also use Perl to split columned data, in much the same way as youcan use awk. This is the -a option, which adds an implicit split (by default on whitespace) inside thewhile() loop that -p provides. As with awk, you can change the character to split on with -F. So, to getthe user and home directory information out of /etc/passwd, use this:# perl -an -F: -e 'print $F[0], "\n", $F[5], "\n";' /etc/passwd9-5. When It’s Not ASCII: Dealing with UTF-8Before talking about how <strong>Linux</strong> handles various sorts of text, let’s establish what text encodings areavailable:• ASCII: This is the most basic encoding available. It uses 7 bits to encode the 128characters that it uses, which means that the eighth bit of every byte is left empty.It’s limited, but you can rely on ASCII characters being viewable and available verynearly everywhere. (The exception is certain sorts of mainframe, but if you’reusing these you’ll know about it!)• ISO-8859-1: Given that ASCII has only 128 characters (including 33 mostlyobsolete nonprinting characters), all the accented characters and some othernonletter standard characters (such as the U.K. pound sign) are missing from it.Thus, people started creating extensions to it. The first extension, aka Latin 1, wasISO-8859-1, which broadly speaking deals with Western Europe. There are lotsmore similarly numbered encodings, such as ISO-8859-15, and so on.• UTF-8: The extensions system is cumbersome and requires multiple encodings(which may be incompatible) to be stored on your system. UTF-8 aims to fix thisby encoding all available characters, using between one and four bytes. Westernstylecharacters all fit within the first two bytes. Everything else in the BasicMultilingual Plane, which covers almost all characters in common use in anylanguage, fits within three bytes. The fourth byte is used for characters in the othercharacter planes (various less-common characters from Asian scripts, historicallanguages, and noncharacter notations such as musical notation). UTF-8 is veryspace-efficient for Western-style languages and less so for other languages.195Download at WoweBook.Com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!