The.Algorithm.Design.Manual.Springer-Verlag.1998
The.Algorithm.Design.Manual.Springer-Verlag.1998 The.Algorithm.Design.Manual.Springer-Verlag.1998
Text Compression have been corrupted. However, fidelity is not such an issue in image or video compression, where the presence of small artifacts will be imperceptible to the viewer. Significantly greater compression ratios can be obtained using lossy compression, which is why all image/video/audio compression algorithms take advantage of this freedom. ● Can I simplify my data before I compress it? - The most effective way to free up space on a disk is to delete files you don't need. Likewise, any preprocessing you can do to a file to reduce its information content before compression will pay off later in better performance. For example, is it possible to eliminate extra blank spaces or lines from the file? Can the document be converted entirely to uppercase characters or have formatting information removed? ● Does it matter whether the algorithm is patented? - One concern is that many data compression algorithms are patented, in particular the LZW variation of the Lempel-Ziv algorithm discussed below. Further, Unisys, the owner of the patent, makes periodic attempts to collect. My personal (although not legal) recommendation is to ignore them, unless you are in the business of selling text compression software. If this makes you uncomfortable, note that there are other variations on the Lempel-Ziv algorithm that are not under patent protection and perform about as well. See the notes and implementations below. ● How do I compress image data - Run-length coding is the simplest lossless compression algorithm for image data, where we replace runs of identical pixel values with one instance of the pixel and an integer giving the length of the run. This works well on binary images with large regions of similar pixels (like scanned text) and terribly on images with many quantization levels and a little noise. It can also be applied to text with many fields that have been padded by blanks. Issues like how many bits to allocate to the count field and the traversal order converting the twodimensional image to a stream of pixels can have a surprisingly large impact on the compression ratio. For serious image and video compression applications, I recommend that you use a lossy coding method and not fool around with implementing it yourself. JPEG is the standard highperformance image compression method, while MPEG is designed to exploit the frame-to-frame coherence of video. Encoders and decoders for both are provided in the implementation section. ● Must compression and decompression both run in real time? - For many applications, fast decompression is more important than fast compression, and algorithms such as JPEG exist to take advantage of this. While compressing video for a CD-ROM, the compression will be done only once, while decompression will be necessary anytime anyone plays it. In contrast, operating systems that increase the effective capacity of disks by automatically compressing each file will need a symmetric algorithm with fast compression times as well. Although there are literally dozens of text compression algorithms available, they are characterized by two basic approaches. In static algorithms, such as Huffman codes, a single coding table is built by analyzing the entire document. In adaptive algorithms, such as Lempel-Ziv, a coding table is built on the fly and adapts to the local character distribution of the document. An adaptive algorithm will likely prove to be the correct answer: ● Huffman codes - Huffman codes work by replacing each alphabet symbol by a variable-length file:///E|/BOOK/BOOK5/NODE205.HTM (2 of 4) [19/1/2003 1:32:11]
Text Compression code string. ASCII uses eight bits per symbol in English text, which is wasteful, since certain characters (such as `e') occur far more often than others (such as `q'). Huffman codes compress text by assigning `e' a short code word and `q' a longer one. Optimal Huffman codes can be constructed using an efficient greedy algorithm. Sort the symbols in increasing order by frequency. We will merge the two least frequently used symbols x and y into a new symbol m, whose frequency is the sum of the frequencies of its two child symbols. By replacing x and y by m, we now have a smaller set of symbols, and we can repeat this operation n- 1 times until all symbols have been merged. Each merging operation defines a node in a binary tree, and the left or right choices on the path from root-to-leaf define the bit of the binary code word for each symbol. Maintaining the list of symbols sorted by frequency can be done using priority queues, which yields an -time Huffman code construction algorithm. Although they are widely used, Huffman codes have three primary disadvantages. First, you must make two passes over the document on encoding, the first to gather statistics and build the coding table and the second to actually encode the document. Second, you must explicitly store the coding table with the document in order to reconstruct it, which eats into your space savings on short documents. Finally, Huffman codes exploit only nonuniformity in symbol distribution, while adaptive algorithms can recognize the higher-order redundancy in strings such as 0101010101.... ● Lempel-Ziv algorithms - Lempel-Ziv algorithms, including the popular LZW variant, compress text by building the coding table on the fly as we read the document. The coding table available for compression changes at each position in the text. A clever protocol between the encoding program and the decoding program ensures that both sides of the channel are always working with the exact same code table, so no information can be lost. Lempel-Ziv algorithms build coding tables of recently-used text strings, which can get arbitrarily long. Thus it can exploit frequently-used syllables, words, and even phrases to build better encodings. Further, since the coding table alters with position, it adapts to local changes in the text distribution, which is important because most documents exhibit significant locality of reference. The truly amazing thing about the Lempel-Ziv algorithm is how robust it is on different types of files. Even when you know that the text you are compressing comes from a special restricted vocabulary or is all lowercase, it is very difficult to beat Lempel-Ziv by using an applicationspecific algorithm. My recommendation is not to try. If there are obvious application-specific redundancies that can safely be eliminated with a simple preprocessing step, go ahead and do it. But don't waste much time fooling around. No matter how hard you work, you are unlikely to get significantly better text compression than with gzip or compress, and you might well do worse. Implementations: A complete list of available compression programs is provided in the comp.compression FAQ (frequently asked questions) file, discussed below. This FAQ will likely point you to what you are looking for, if you don't find it in this section. file:///E|/BOOK/BOOK5/NODE205.HTM (3 of 4) [19/1/2003 1:32:11]
- Page 585 and 586: Intersection Detection ● Do you w
- Page 587 and 588: Intersection Detection Implementati
- Page 589 and 590: Bin Packing Next: Medial-Axis Trans
- Page 591 and 592: Bin Packing approach for general sh
- Page 593 and 594: Medial-Axis Transformation Next: Po
- Page 595 and 596: Medial-Axis Transformation Implemen
- Page 597 and 598: Polygon Partitioning number of piec
- Page 599 and 600: Simplifying Polygons Next: Shape Si
- Page 601 and 602: Simplifying Polygons vertices and o
- Page 603 and 604: Shape Similarity Next: Motion Plann
- Page 605 and 606: Shape Similarity or how close it is
- Page 607 and 608: Motion Planning There is a wide ran
- Page 609 and 610: Motion Planning often arise in the
- Page 611 and 612: Maintaining Line Arrangements Think
- Page 613 and 614: Maintaining Line Arrangements Next:
- Page 615 and 616: Minkowski Sum where x+y is the vect
- Page 617 and 618: Set and String Problems Next: Set C
- Page 619 and 620: Set Cover Next: Set Packing Up: Set
- Page 621 and 622: Set Cover Figure: Hitting set is du
- Page 623 and 624: Set Packing Next: String Matching U
- Page 625 and 626: Set Packing Notes: An excellent exp
- Page 627 and 628: String Matching shouldn't try. Furt
- Page 629 and 630: String Matching and texts, I recomm
- Page 631 and 632: Approximate String Matching This sa
- Page 633 and 634: Approximate String Matching http://
- Page 635: Text Compression Next: Cryptography
- Page 639 and 640: Cryptography Next: Finite State Mac
- Page 641 and 642: Cryptography ● How can I validate
- Page 643 and 644: Cryptography MD5 [Riv92] is the sec
- Page 645 and 646: Finite State Machine Minimization F
- Page 647 and 648: Finite State Machine Minimization S
- Page 649 and 650: Longest Common Substring than edit
- Page 651 and 652: Longest Common Substring include [A
- Page 653 and 654: Shortest Common Superstring Finding
- Page 655 and 656: Software systems Next: LEDA Up: Alg
- Page 657 and 658: LEDA Next: Netlib Up: Software syst
- Page 659 and 660: Netlib Algorithms Mon Jun 2 23:33:5
- Page 661 and 662: The Stanford GraphBase Next: Combin
- Page 663 and 664: Algorithm Animations with XTango Ne
- Page 665 and 666: Programs from Books Next: Discrete
- Page 667 and 668: Handbook of Data Structures and Alg
- Page 669 and 670: Algorithms from P to NP Next: Compu
- Page 671 and 672: Algorithms in C++ Next: Data Source
- Page 673 and 674: Textbooks Next: On-Line Resources U
- Page 675 and 676: On-Line Resources Next: Literature
- Page 677 and 678: People Next: Software Up: On-Line R
- Page 679 and 680: Professional Consulting Services Ne
- Page 681 and 682: Index A Up: Index - All Index: A ab
- Page 683 and 684: Index A artists steal ASA ASCII asp
- Page 685 and 686: Index B binary representation - sub
Text Compression<br />
have been corrupted. However, fidelity is not such an issue in image or video compression, where<br />
the presence of small artifacts will be imperceptible to the viewer. Significantly greater<br />
compression ratios can be obtained using lossy compression, which is why all image/video/audio<br />
compression algorithms take advantage of this freedom.<br />
● Can I simplify my data before I compress it? - <strong>The</strong> most effective way to free up space on a disk is<br />
to delete files you don't need. Likewise, any preprocessing you can do to a file to reduce its<br />
information content before compression will pay off later in better performance. For example, is it<br />
possible to eliminate extra blank spaces or lines from the file? Can the document be converted<br />
entirely to uppercase characters or have formatting information removed?<br />
● Does it matter whether the algorithm is patented? - One concern is that many data compression<br />
algorithms are patented, in particular the LZW variation of the Lempel-Ziv algorithm discussed<br />
below. Further, Unisys, the owner of the patent, makes periodic attempts to collect. My personal<br />
(although not legal) recommendation is to ignore them, unless you are in the business of selling<br />
text compression software. If this makes you uncomfortable, note that there are other variations<br />
on the Lempel-Ziv algorithm that are not under patent protection and perform about as well. See<br />
the notes and implementations below.<br />
● How do I compress image data - Run-length coding is the simplest lossless compression<br />
algorithm for image data, where we replace runs of identical pixel values with one instance of the<br />
pixel and an integer giving the length of the run. This works well on binary images with large<br />
regions of similar pixels (like scanned text) and terribly on images with many quantization levels<br />
and a little noise. It can also be applied to text with many fields that have been padded by blanks.<br />
Issues like how many bits to allocate to the count field and the traversal order converting the twodimensional<br />
image to a stream of pixels can have a surprisingly large impact on the compression<br />
ratio.<br />
For serious image and video compression applications, I recommend that you use a lossy coding<br />
method and not fool around with implementing it yourself. JPEG is the standard highperformance<br />
image compression method, while MPEG is designed to exploit the frame-to-frame<br />
coherence of video. Encoders and decoders for both are provided in the implementation section.<br />
● Must compression and decompression both run in real time? - For many applications, fast<br />
decompression is more important than fast compression, and algorithms such as JPEG exist to<br />
take advantage of this. While compressing video for a CD-ROM, the compression will be done<br />
only once, while decompression will be necessary anytime anyone plays it. In contrast, operating<br />
systems that increase the effective capacity of disks by automatically compressing each file will<br />
need a symmetric algorithm with fast compression times as well.<br />
Although there are literally dozens of text compression algorithms available, they are characterized by<br />
two basic approaches. In static algorithms, such as Huffman codes, a single coding table is built by<br />
analyzing the entire document. In adaptive algorithms, such as Lempel-Ziv, a coding table is built on the<br />
fly and adapts to the local character distribution of the document. An adaptive algorithm will likely prove<br />
to be the correct answer:<br />
● Huffman codes - Huffman codes work by replacing each alphabet symbol by a variable-length<br />
file:///E|/BOOK/BOOK5/NODE205.HTM (2 of 4) [19/1/2003 1:32:11]