The.Algorithm.Design.Manual.Springer-Verlag.1998

The.Algorithm.Design.Manual.Springer-Verlag.1998 The.Algorithm.Design.Manual.Springer-Verlag.1998

18.04.2013 Views

Text Compression have been corrupted. However, fidelity is not such an issue in image or video compression, where the presence of small artifacts will be imperceptible to the viewer. Significantly greater compression ratios can be obtained using lossy compression, which is why all image/video/audio compression algorithms take advantage of this freedom. ● Can I simplify my data before I compress it? - The most effective way to free up space on a disk is to delete files you don't need. Likewise, any preprocessing you can do to a file to reduce its information content before compression will pay off later in better performance. For example, is it possible to eliminate extra blank spaces or lines from the file? Can the document be converted entirely to uppercase characters or have formatting information removed? ● Does it matter whether the algorithm is patented? - One concern is that many data compression algorithms are patented, in particular the LZW variation of the Lempel-Ziv algorithm discussed below. Further, Unisys, the owner of the patent, makes periodic attempts to collect. My personal (although not legal) recommendation is to ignore them, unless you are in the business of selling text compression software. If this makes you uncomfortable, note that there are other variations on the Lempel-Ziv algorithm that are not under patent protection and perform about as well. See the notes and implementations below. ● How do I compress image data - Run-length coding is the simplest lossless compression algorithm for image data, where we replace runs of identical pixel values with one instance of the pixel and an integer giving the length of the run. This works well on binary images with large regions of similar pixels (like scanned text) and terribly on images with many quantization levels and a little noise. It can also be applied to text with many fields that have been padded by blanks. Issues like how many bits to allocate to the count field and the traversal order converting the twodimensional image to a stream of pixels can have a surprisingly large impact on the compression ratio. For serious image and video compression applications, I recommend that you use a lossy coding method and not fool around with implementing it yourself. JPEG is the standard highperformance image compression method, while MPEG is designed to exploit the frame-to-frame coherence of video. Encoders and decoders for both are provided in the implementation section. ● Must compression and decompression both run in real time? - For many applications, fast decompression is more important than fast compression, and algorithms such as JPEG exist to take advantage of this. While compressing video for a CD-ROM, the compression will be done only once, while decompression will be necessary anytime anyone plays it. In contrast, operating systems that increase the effective capacity of disks by automatically compressing each file will need a symmetric algorithm with fast compression times as well. Although there are literally dozens of text compression algorithms available, they are characterized by two basic approaches. In static algorithms, such as Huffman codes, a single coding table is built by analyzing the entire document. In adaptive algorithms, such as Lempel-Ziv, a coding table is built on the fly and adapts to the local character distribution of the document. An adaptive algorithm will likely prove to be the correct answer: ● Huffman codes - Huffman codes work by replacing each alphabet symbol by a variable-length file:///E|/BOOK/BOOK5/NODE205.HTM (2 of 4) [19/1/2003 1:32:11]

Text Compression code string. ASCII uses eight bits per symbol in English text, which is wasteful, since certain characters (such as `e') occur far more often than others (such as `q'). Huffman codes compress text by assigning `e' a short code word and `q' a longer one. Optimal Huffman codes can be constructed using an efficient greedy algorithm. Sort the symbols in increasing order by frequency. We will merge the two least frequently used symbols x and y into a new symbol m, whose frequency is the sum of the frequencies of its two child symbols. By replacing x and y by m, we now have a smaller set of symbols, and we can repeat this operation n- 1 times until all symbols have been merged. Each merging operation defines a node in a binary tree, and the left or right choices on the path from root-to-leaf define the bit of the binary code word for each symbol. Maintaining the list of symbols sorted by frequency can be done using priority queues, which yields an -time Huffman code construction algorithm. Although they are widely used, Huffman codes have three primary disadvantages. First, you must make two passes over the document on encoding, the first to gather statistics and build the coding table and the second to actually encode the document. Second, you must explicitly store the coding table with the document in order to reconstruct it, which eats into your space savings on short documents. Finally, Huffman codes exploit only nonuniformity in symbol distribution, while adaptive algorithms can recognize the higher-order redundancy in strings such as 0101010101.... ● Lempel-Ziv algorithms - Lempel-Ziv algorithms, including the popular LZW variant, compress text by building the coding table on the fly as we read the document. The coding table available for compression changes at each position in the text. A clever protocol between the encoding program and the decoding program ensures that both sides of the channel are always working with the exact same code table, so no information can be lost. Lempel-Ziv algorithms build coding tables of recently-used text strings, which can get arbitrarily long. Thus it can exploit frequently-used syllables, words, and even phrases to build better encodings. Further, since the coding table alters with position, it adapts to local changes in the text distribution, which is important because most documents exhibit significant locality of reference. The truly amazing thing about the Lempel-Ziv algorithm is how robust it is on different types of files. Even when you know that the text you are compressing comes from a special restricted vocabulary or is all lowercase, it is very difficult to beat Lempel-Ziv by using an applicationspecific algorithm. My recommendation is not to try. If there are obvious application-specific redundancies that can safely be eliminated with a simple preprocessing step, go ahead and do it. But don't waste much time fooling around. No matter how hard you work, you are unlikely to get significantly better text compression than with gzip or compress, and you might well do worse. Implementations: A complete list of available compression programs is provided in the comp.compression FAQ (frequently asked questions) file, discussed below. This FAQ will likely point you to what you are looking for, if you don't find it in this section. file:///E|/BOOK/BOOK5/NODE205.HTM (3 of 4) [19/1/2003 1:32:11]

Text Compression<br />

have been corrupted. However, fidelity is not such an issue in image or video compression, where<br />

the presence of small artifacts will be imperceptible to the viewer. Significantly greater<br />

compression ratios can be obtained using lossy compression, which is why all image/video/audio<br />

compression algorithms take advantage of this freedom.<br />

● Can I simplify my data before I compress it? - <strong>The</strong> most effective way to free up space on a disk is<br />

to delete files you don't need. Likewise, any preprocessing you can do to a file to reduce its<br />

information content before compression will pay off later in better performance. For example, is it<br />

possible to eliminate extra blank spaces or lines from the file? Can the document be converted<br />

entirely to uppercase characters or have formatting information removed?<br />

● Does it matter whether the algorithm is patented? - One concern is that many data compression<br />

algorithms are patented, in particular the LZW variation of the Lempel-Ziv algorithm discussed<br />

below. Further, Unisys, the owner of the patent, makes periodic attempts to collect. My personal<br />

(although not legal) recommendation is to ignore them, unless you are in the business of selling<br />

text compression software. If this makes you uncomfortable, note that there are other variations<br />

on the Lempel-Ziv algorithm that are not under patent protection and perform about as well. See<br />

the notes and implementations below.<br />

● How do I compress image data - Run-length coding is the simplest lossless compression<br />

algorithm for image data, where we replace runs of identical pixel values with one instance of the<br />

pixel and an integer giving the length of the run. This works well on binary images with large<br />

regions of similar pixels (like scanned text) and terribly on images with many quantization levels<br />

and a little noise. It can also be applied to text with many fields that have been padded by blanks.<br />

Issues like how many bits to allocate to the count field and the traversal order converting the twodimensional<br />

image to a stream of pixels can have a surprisingly large impact on the compression<br />

ratio.<br />

For serious image and video compression applications, I recommend that you use a lossy coding<br />

method and not fool around with implementing it yourself. JPEG is the standard highperformance<br />

image compression method, while MPEG is designed to exploit the frame-to-frame<br />

coherence of video. Encoders and decoders for both are provided in the implementation section.<br />

● Must compression and decompression both run in real time? - For many applications, fast<br />

decompression is more important than fast compression, and algorithms such as JPEG exist to<br />

take advantage of this. While compressing video for a CD-ROM, the compression will be done<br />

only once, while decompression will be necessary anytime anyone plays it. In contrast, operating<br />

systems that increase the effective capacity of disks by automatically compressing each file will<br />

need a symmetric algorithm with fast compression times as well.<br />

Although there are literally dozens of text compression algorithms available, they are characterized by<br />

two basic approaches. In static algorithms, such as Huffman codes, a single coding table is built by<br />

analyzing the entire document. In adaptive algorithms, such as Lempel-Ziv, a coding table is built on the<br />

fly and adapts to the local character distribution of the document. An adaptive algorithm will likely prove<br />

to be the correct answer:<br />

● Huffman codes - Huffman codes work by replacing each alphabet symbol by a variable-length<br />

file:///E|/BOOK/BOOK5/NODE205.HTM (2 of 4) [19/1/2003 1:32:11]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!