Biljeske 05 - TehniÄki fakultet u Rijeci
Biljeske 05 - TehniÄki fakultet u Rijeci
Biljeske 05 - TehniÄki fakultet u Rijeci
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
TEORIJA INFORMACIJE<br />
Željko Jeričević, dr. sc.<br />
Zavod za računarstvo, Tehnički <strong>fakultet</strong> &<br />
Zavod za biologiju i medicinsku genetiku, Medicinski <strong>fakultet</strong><br />
51000 Rijeka, Croatia<br />
Phone: (+385) 51-651 594<br />
E-mail: zeljko.jericevic@riteh.hr<br />
http://www.riteh.uniri.hr/~zeljkoj/Zeljko_Jericevic.html
Information theory<br />
Iz dosadašnjeg gradiva znamo da se informacija prije slanja kroz<br />
kanal treba prirediti. To se postiže pretvorbom informacije u<br />
formu koja ima entropiju blisku maksimalnoj čime se efikasnost<br />
prenosa približava maksimalnoj. Ovo se može postići<br />
kompresijom bez gubitaka informacije (lossless compression),<br />
napr. Huffmanovim kodiranjem.<br />
Druga pretvorba odnosi se na sigurnost prenosa pri čemu se<br />
informacija prevodi u formu gdje je za određeni tip pogrešaka<br />
moguća automatska korekcija (napr. Hamming-ovim<br />
kodiranjem).<br />
10 February 2012 zeljko.jericevic@riteh.hr 2
Sažimanje (compression)<br />
10 February 2012 zeljko.jericevic@riteh.hr 3
Samuel F.B. Morse (1791-1872)<br />
Slovo e je najčešće upotrebljavano<br />
10 February 2012
Morse-ov kod<br />
Točka je 1 bit, crta je 3 bita =><br />
razmak unutar istog slova je 1<br />
bit, razmak između slova je 3<br />
bita, razmak između riječi je 7<br />
bitova.<br />
Najkraći znak (slovo E) 1 bit,<br />
najduži znak (broj 0) 19 bitova,<br />
razmak među riječima 7 bitova.<br />
Kako to komparira s ASCII?<br />
Svi znakovi 8 bitova, razmak<br />
među riječima 8 bitova. Za<br />
točnu komparaciju potrebno je<br />
znati učestalost znakova.
Zadatak<br />
Izaberite engleski tekst po volji s projekta Gutemberg u<br />
ASCII formatu (min 100KByte).<br />
Odredite frekvenciju pojavljivanja svih ASCII znakova u<br />
tekstu.<br />
Izračunajte koliko bitova vam treba za Morseovu<br />
reprezentacije teksta i usporedite s bitovima potrebnim<br />
za ASCII reprezentaciju.
David A. Huffman (1925-1999)<br />
David Huffman is best known for his legendary<br />
Huffman code, a compression scheme for lossless<br />
variable length encoding. It was the result of a term<br />
paper he wrote while a graduate student at the<br />
Massachusetts Institute of Technology (MIT), where<br />
he earned a D.Sc. degree on a thesis named The<br />
Synthesis of Sequential Switching Circuits, advised by<br />
Samuel H. Caldwell (1953).<br />
"Huffman Codes" are used in nearly every application<br />
that involves the compression and transmission of<br />
digital data, such as fax machines, modems, computer<br />
networks, and high-definition television (HDTV), to<br />
name a few.<br />
From Wikipedia<br />
10 February 2012 7
Shannon-Fano kodiranje<br />
Top-down kodiranje (preteča Huffmanovog kodiranja)<br />
“In Shannon–Fano coding, the symbols are arranged in<br />
order from most probable to least probable, and then<br />
divided into two sets whose total probabilities are as<br />
close as possible to being equal. All symbols then have<br />
the first digits of their codes assigned; symbols in the<br />
first set receive "0" and symbols in the second set<br />
receive "1". As long as any sets with more than one<br />
member remain, the same process is repeated on those<br />
sets, to determine successive digits of their codes.<br />
When a set has been reduced to one symbol, of course,<br />
this means the symbol's code is complete and will not<br />
form the prefix of any other symbol's code.” from Wikipedia<br />
8
Shannon-Fano kodiranje<br />
9
Shannon-Fano kodiranje<br />
Top-down kodiranje (preteča Huffmanovog kodiranja)<br />
ABACADACABEDAADBECAEBADCAECAEBADBACABAD<br />
Simbol A B C D E<br />
Ucestalost 15 7 6 6 5<br />
Vjerojatnost 0.38461538 0.17948718 0.15384615 0.15384615 0.1282<strong>05</strong>13<br />
10 February 2012 zeljko.jericevic@riteh.hr 10
Shannon-Fano kodiranje<br />
Top-down kodiranje<br />
Simbol A B C D E<br />
Ucestalost 15 7 6 6 5<br />
Vjerojatnost 0.39 0.18 0.15 0.15 0.13<br />
Simbol A B C D E<br />
Kod 00 01 10 110 111<br />
2Bit<br />
⋅ 15+ 7+ 6 + 3Bit<br />
⋅ 6+<br />
5<br />
( ) ( )<br />
39<br />
≈<br />
2.28 Bits / symbol<br />
11
Huffman-ovo kodiranje<br />
Bottom-up:<br />
“D & E have the lowest frequencies and so<br />
are allocated 0 and 1 respectively and<br />
grouped together with a combined<br />
probability of 0.282<strong>05</strong>128. The lowest pair<br />
now are B and C so they're allocated 0 and<br />
1 and grouped together with a combined<br />
probability of 0.33333333. This leaves BC<br />
and DE now with the lowest probabilities<br />
so 0 and 1 are prepended to their codes<br />
and they are combined. This then leaves<br />
just A and BCDE, which have 0 and 1<br />
prepended respectively and are then<br />
combined. This leaves us with a single node<br />
and our algorithm is complete.” From<br />
Wikipedia<br />
12
Huffman-ovo kodiranje<br />
Bottom-up<br />
Simbol A B C D E<br />
Ucestalost 15 7 6 6 5<br />
Vjerojatnost 0.39 0.18 0.15 0.15 0.13<br />
Simbol A B C D E<br />
Kod 0 100 101 110 111<br />
1Bit<br />
⋅ 15+ 3Bit<br />
7+ 6+ 6+<br />
5<br />
39<br />
( )<br />
≈<br />
2.23 Bits / symbol<br />
13
Znak Ucestalost Kod<br />
razmak 7 111<br />
a 4 010<br />
e 4 000<br />
f 3 1101<br />
h 2 1010<br />
i 2 1000<br />
m 2 0111<br />
n 2 0010<br />
s 2 1011<br />
t 2 0110<br />
l 1 11001<br />
o 1 00110<br />
p 1 10011<br />
r 1 11000<br />
u 1 00111<br />
x 1 10010<br />
Huffman-ovo kodiranje<br />
"this is an example of a<br />
huffman tree"<br />
14
Huffman-ovo kodiranje<br />
15
Huffman-ovo kodiranje<br />
16
Huffman-ovo kodiranje<br />
17
Huffman-ovo kodiranje<br />
18
Huffman-ovo kodiranje<br />
19
Huffman-ovo kodiranje<br />
From Malan<br />
20
Huffman-ovo kodiranje<br />
"this is an example of a<br />
huffman tree"<br />
21
Huffman-ovo kodiranje<br />
"this is an example of a<br />
huffman tree"<br />
22
Huffman-ovo kodiranje<br />
23
Huffman-ovo kodiranje<br />
24
Huffman-ovo kodiranje<br />
25
Huffman-ovo kodiranje<br />
26
Huffman-ovo kodiranje<br />
27
Huffman-ovo kodiranje<br />
"this is an example of a<br />
huffman tree"<br />
Znak p Huffman<br />
A 0.2 ?<br />
B 0.1 ?<br />
C 0.1 ?<br />
D 0.15 ?<br />
E 0.45 ?<br />
28
Huffman-ovo kodiranje<br />
"this is an example of a<br />
huffman tree"<br />
Znak p Huffman<br />
A 0.2 01<br />
B 0.1 0000<br />
C 0.1 0001<br />
D 0.15 001<br />
E 0.45 1<br />
29
Modificirano Huffman-ovo kodiranje<br />
Modificirano Huffmanovo kodiranje se koristi u<br />
fax mašinama za kodiranje crnog na bijeloj<br />
podlozi (bitmape). Kombinira Huffman-ove<br />
kodove varijabilne duljine s repetitivnim<br />
kodiranjem.<br />
Za kodiranje crnog na bijelom, 1 bit po pikslu<br />
(bijeli bitovi imaju vrijednost 0, crni bitovi imaju<br />
vrijednost 1). Repeticije crnih i bijelih pikslova se<br />
izbroje i pošalju kao Huffmanovi kodovi<br />
varijabilne duljine.<br />
30
Modificirano Huffmanovo kodiranje:<br />
CCITT (Huffman) Encoding<br />
“CCITT (International Telegraph and Telephone<br />
Consultative Committee) is a standards organization that<br />
has developed a series of communications protocols for the<br />
facsimile transmission of black-and-white images over<br />
telephone lines and data networks. These protocols are<br />
known officially as the CCITT T.4 and T.6 standards but<br />
are more commonly referred to as CCITT Group 3 and<br />
Group 4 compression, respectively.”<br />
31
CCITT Encodings<br />
“Group 3 and Group 4 encodings are compression<br />
algorithms that are specifically designed for<br />
encoding 1-bit image data. Many document and<br />
FAX file formats support Group 3 compression,<br />
and several, including TIFF, also support Group 4.<br />
”<br />
32
CCITT Encodings<br />
“Group 3 encoding was designed specifically for bilevel,<br />
black-and-white image data<br />
telecommunications. All modern FAX machines<br />
and FAX modems support Group 3 facsimile<br />
transmissions. Group 3 encoding and decoding is<br />
fast, maintains a good compression ratio for a wide<br />
variety of document data, and contains information<br />
that aids a Group 3 decoder in detecting and<br />
correcting errors without special hardware.”<br />
33
CCITT Encodings<br />
“Group 4 is a more efficient form of bi-level<br />
compression that has almost entirely replaced the<br />
use of Group 3 in many conventional document<br />
image storage systems. (An exception is facsimile<br />
document storage systems where original Group 3<br />
images are required to be stored in an unaltered<br />
state.)”<br />
34
CCITT Encodings<br />
“Group 4 encoded data is approximately half the<br />
size of 1-dimensional Group 3-encoded data.<br />
Although Group 4 is fairly difficult to implement<br />
efficiently, it encodes at least as fast as Group 3<br />
and in some implementations decodes even faster.<br />
Also, Group 4 was designed for use on data<br />
networks, so it does not contain the<br />
synchronization codes used for error detection<br />
that Group 3 does, making it a poor choice for an<br />
35<br />
image transfer protocol. ”
CCITT Encodings<br />
“Group 3 normally achieves a compression ratio<br />
of 5:1 to 8:1 on a standard 200-dpi (204x196 dpi),<br />
A4-sized document. Group 4 results are roughly<br />
twice as efficient as Group 3, achieving<br />
compression ratios upwards of 15:1 with the same<br />
document. Claims that the CCITT algorithms are<br />
capable of far better compression on standard<br />
business documents are exaggerated--largely by<br />
hardware vendors.”<br />
36
CCITT Encodings<br />
“Because the CCITT algorithms have been optimized for<br />
type and handwritten documents, it stands to reason that<br />
images radically different in composition will not<br />
compress very well. This is all too true. Bi-level bitmaps<br />
that contain a high frequency of short runs, as typically<br />
found in digitally half-toned continuous-tone images, do<br />
not compress as well using the CCITT algorithms. Such<br />
images will usually result in a compression ratio of 3:1 or<br />
even lower, and many will actually compress to a size<br />
larger than the original.”<br />
37
CCITT Encodings<br />
“The CCITT actually defines three algorithms for<br />
the encoding of bi-level image data:<br />
• Group 3 One-Dimensional (G31D)<br />
• Group 3 Two-Dimensional (G32D)<br />
• Group 4 Two-Dimensional (G42D)<br />
”<br />
38
Modificirano Huffman-ovo kodiranje<br />
Svaka linije je kodirana kao izmjenjujući<br />
nizovi crnih i bijelih bitova. Nizove duljine<br />
63 ili manje kodirani su takozvanim<br />
završnim kodom (termination code). Nizovi<br />
duljine 64 ili više imaju početni (makeup<br />
code) ispred završnog koda.<br />
39
Modificirano Huffman-ovo kodiranje<br />
Kodovi su određeni unaprijed, prema<br />
reprezentativnoj statistici za printane<br />
dokumente (85% bijelo, 15% crno; kraće<br />
crne sekvence su vjerojatnije od bijelih, duže<br />
bijele sekvence su vjerojatnije od crnih;<br />
svaki redak dokumenta počinje s bijelom<br />
sekvencom).<br />
40
Modificirano Huffman-ovo kodiranje<br />
Svaka linije je kodirana kao izmjenjujući<br />
nizovi crnih i bijelih bitova. Nizove duljine<br />
63 ili manje kodirani su takozvanim<br />
završnim kodom (termination code). Nizovi<br />
duljine 64 ili više imaju početni (makeup<br />
code) ispred završnog koda.<br />
41
Modificirano Huffman-ovo kodiranje<br />
završni kodovi<br />
Run Length White bits Black bits<br />
0 00110101 0000110111<br />
1 000111 010<br />
2 0111 11<br />
3 1000 10<br />
4 1011 011<br />
5 1100 0011<br />
6 1110 0010<br />
7 1111 00011<br />
8 10011 000101
Modificirano Huffman-ovo kodiranje<br />
završni kodovi<br />
Run Length White bits Black bits<br />
11 01000 0000101<br />
.<br />
17 101011 0000011000<br />
.<br />
28 0011000 000011001100<br />
.<br />
63 00110100 000001100111<br />
43
Modificirano Huffman-ovo kodiranje<br />
(bijeli i crni) početni kodovi<br />
64 11011 000000111<br />
128 10010 00011001000<br />
192 010111 000011001001<br />
256 0110111 000001011011<br />
320 00110110 000000110011<br />
.<br />
1600 010011010 0000001011011<br />
1664 011000 0000001100100<br />
1728 010011011 0000001100101<br />
1792 00000001000 00000001000<br />
… 2560
Modificirano Huffman-ovo kodiranje<br />
specijalni kodovi<br />
“Several special code words are also defined in a<br />
Group 3-encoded data stream. These codes are<br />
used to provide synchronization in the event that a<br />
phone transmission experiences a burst of noise.<br />
By recognizing this special code, a CCITT decoder<br />
may identify transmission errors and attempt to<br />
apply a recovery algorithm that approximates the<br />
lost data.”<br />
45
Modificirano Huffman-ovo kodiranje<br />
specijalni kodovi<br />
“The EOL code is a 12-bit code word that begins each line<br />
in a Group 3 transmission. This unique code word is used<br />
to detect the start/end of a scan line during the image<br />
transmission. If a burst of noise temporarily corrupts the<br />
signal, a Group 3 decoder throws away the unrecognized<br />
data it receives until it encounteres an EOL code. The<br />
decoder would then start receiving the transmission as<br />
normal again, assuming that the data following the EOL<br />
is the beginning of the next scan line. The decoder might<br />
also replace the bad line with a predefined set of data,<br />
46<br />
such as a white scan line.”
A decoder also uses EOL codes for several purposes. It<br />
uses them to keep track of the width of a decoded scan<br />
line. (An incorrect scan-line width may be an error, or it<br />
may be an indication to pad with white pixels to the<br />
EOL.) In addition, it uses EOL codes to keep track of the<br />
number of scan lines in an image, in order to detect a<br />
short image. If it finds one, it pads the remaining length<br />
with scan lines of all white pixels.<br />
EOL is 000000000001<br />
RTC (Return To Control) is 6 consecutive EOL codes and<br />
signifies end of message transmition.<br />
Modificirano Huffman-ovo kodiranje<br />
specijalni kodovi
Modificirano Huffman-ovo kodiranje<br />
primjeri<br />
48
Modificirano Huffman-ovo kodiranje<br />
primjeri<br />
49
Modificirano Huffman-ovo kodiranje<br />
primjeri<br />
a) Potrebno je samo poslati završni kod za 20 crnih bitova:<br />
00001101000<br />
50
Modificirano Huffman-ovo kodiranje<br />
primjeri<br />
b) Šaljemo početni kod za 64 bijelih bitova (11011) i<br />
završni kod za 36 bijelih bitova: 00010011<br />
51
Modificirano Huffman-ovo kodiranje<br />
primjeri<br />
c) Šaljemo 4 početna koda: 3 za 2560 crnih bitova<br />
(000000011111) i jedan za 1088 crnih bitova<br />
(0000001110101) i završni kod za 32 crnih bitova:<br />
000001101010<br />
52
Modificirano Huffman-ovo kodiranje<br />
primjeri<br />
Počinjemo s crnim bitom (neuobičajeno), umećemo<br />
sekvencu bijelih bitova dužine 0, zatim kod za 1 crni bit,<br />
sljedi kod za 4 bijela bita, zatim kod za 2 crna bita, kod za<br />
1 bijeli bit, kod za 1 crni bit, početni kod za 1216 bijelih<br />
bitova i konačni kod za 50 bijelih bitova, EOL<br />
0 bijelo 00110101<br />
1 crno 010<br />
4 bijelo 1011<br />
2 crno 11<br />
1 bijelo 0111<br />
1 crno 010<br />
1266 Bijelo 011011000 + 01010011<br />
EOL 000000000001<br />
53
CCITT Encodings<br />
“The CCITT actually defines three algorithms for<br />
the encoding of bi-level image data:<br />
• Group 3 One-Dimensional (G31D)<br />
• Group 3 Two-Dimensional (G32D)<br />
• Group 4 Two-Dimensional (G42D)<br />
”<br />
54
G32D CCITT Encoding<br />
“With Group 3 Two-Dimensional (G32D)<br />
encoding, the way a scan line is encoded may<br />
depend on the immediately preceding scan-line<br />
data. Many images have a high degree of vertical<br />
coherence (redundancy). By describing the<br />
differences between two scan lines, rather than<br />
describing the scan line contents, 2D encoding<br />
achieves better compression.”<br />
55
G32D CCITT Encoding<br />
“The first pixel of each run length is called a changing<br />
element. Each changing element marks a color transition<br />
within a scan line (the point where a run of one color ends<br />
and a run of the next color begins).<br />
The position of each changing element in a scan line is<br />
described as being a certain number of pixels from a<br />
changing element in the current, coding line (horizontal<br />
coding is performed) or in the preceding, reference line<br />
(vertical coding is performed). The output codes used to<br />
describe the actual positional information are called<br />
Relative Element Address Designate (READ) codes.” 56
G32D CCITT Encoding<br />
“Shorter code words are used to describe the color<br />
transitions that are less than four pixels away from each<br />
other on the code line or the reference line. Longer code<br />
words are used to describe color transitions lying a<br />
greater distance from the current changing element.<br />
2D encoding is more efficient than 1-dimensional because<br />
the usual data that is compressed (typed or handwritten<br />
documents) contains a high amount of 2D coherence.”<br />
57
G32D CCITT Encoding<br />
“Because a G32D-encoded scan line is dependent on the<br />
correctness of the preceding scan line, an error, such as a<br />
burst of line noise, can affect multiple, 2-dimensionally<br />
encoded scan lines. If a transmission error corrupts a<br />
segment of encoded scan line data, that line cannot be<br />
decoded. But, worse still, all scan lines occurring after it<br />
also decode improperly.”<br />
58
G32D CCITT Encoding<br />
“To minimize the damage created by noise, G32D uses a<br />
variable called a K factor and 2-dimensionally encodes K-<br />
1 lines following a 1-dimensionally encoded line. If<br />
corruption of the data transmission occurs, only K-1 scan<br />
lines of data will be lost. The decoder will be able to<br />
resync the decoding at the next available EOL code.”<br />
59
G32D CCITT Encoding<br />
“The typical value for K is 2 or 4. G32D data that is<br />
encoded with a K value of 4 appears as a single block of<br />
data. Each block contains three lines of 2D scan-line data<br />
followed by a scan line of 1-dimensionally encoded data.”<br />
60
G32D CCITT Encoding<br />
“The K variable is not normally used in decoding the<br />
G32D data. Instead, the EOL code is modified to indicate<br />
the algorithm used to encode the line following it. If a 1<br />
bit is appended to the EOL code, the line following is 1-<br />
dimensionally encoded; if a 0 bit is appended, the line<br />
following the EOL code is 2-dimensionally encoded. All<br />
other transmission code word markers (FILL and RTC)<br />
follow the same rule as in G31D encoding. K is only<br />
needed in decoding if regeneration of the previous 1-<br />
dimensionally encoded scan line is necessary for error<br />
61<br />
recovery. ”
CCITT Encodings<br />
“The CCITT actually defines three algorithms for<br />
the encoding of bi-level image data:<br />
• Group 3 One-Dimensional (G31D)<br />
• Group 3 Two-Dimensional (G32D)<br />
• Group 4 Two-Dimensional (G42D)<br />
”<br />
62
G42D CCITT Encoding<br />
“Group 4 Two-Dimensional (G42D) encoding was<br />
developed from the G32D algorithm as a better 2D<br />
compression scheme--so much better, in fact, that Group<br />
4 has almost completely replaced G32D in commercial<br />
use.”<br />
63
G42D CCITT Encoding<br />
“Group 4 encoding is identical to G32D encoding except<br />
for a few modifications. Group 4 is basically the G32D<br />
algorithm with no EOL codes and a K variable set to<br />
infinity. Group 4 was designed specifically to encode data<br />
residing on disk drives and data networks. The built-in<br />
transmission error detection/correction found in Group 3<br />
is therefore not needed by Group 4 data.”<br />
64
G42D CCITT Encoding<br />
“The first reference line in Group 4 encoding is an<br />
imaginary scan line containing all white pixels. In G32D<br />
encoding, the first reference line is the first scan line of<br />
the image. In Group 4 encoding, the RTC code word is<br />
replaced by an end of facsimile block (EOFB) code, which<br />
consists of two consecutive Group 3 EOL code words.<br />
Like the Group 3 RTC, the EOFB is also part of the<br />
transmission protocol and not actually part of the image<br />
data. Also, Group 4-encoded image data may be padded<br />
out with fill bits after the EOFB to end on a byte<br />
boundary.”<br />
65
G42D CCITT Encoding<br />
“Group 4 encoding will usually result in an image<br />
compressed twice as small as if it were done with G31D<br />
encoding. The main tradeoff is that Group 4 encoding is<br />
more complex and requires more time to perform. When<br />
implemented in hardware, however, the difference in<br />
execution speed between the Group 3 and Group 4<br />
algorithms is not significant, which usually makes Group<br />
4 a better choice in most imaging system<br />
implementations.”<br />
66
CCITT documents<br />
• "Standardization of Group 3 Facsimile Apparatus for Document<br />
Transmission," Recommendation T.4, Volume VII, Fascicle VII.3,<br />
Terminal Equipment and Protocols for Telematic Services, The<br />
International Telegraph and Telephone Consultative Committee<br />
(CCITT), Geneva, Switzerland, 1985, pp. 16-31.<br />
• "Facsimile Coding Schemes and Coding Control Functions for<br />
Group 4 Facsimile Apparatus," Recommendation T.6, Volume<br />
VII, Fascicle VII.3, Terminal Equipment and Protocols for<br />
Telematic Services, The International Telegraph and Telephone<br />
Consultative Committee (CCITT), Geneva, Switzerland, 1985, pp.<br />
40-48.<br />
10 February 2012 zeljko.jericevic@riteh.hr 67
Hvala na pažnji<br />
Željko Jeričević, dr. sc.<br />
Zavod za računarstvo, Tehnički <strong>fakultet</strong> &<br />
Zavod za biologiju i medicinsku genetiku, Medicinski <strong>fakultet</strong><br />
51000 Rijeka, Croatia<br />
Phone: (+385) 51-651 594<br />
E-mail: zeljko.jericevic@riteh.hr<br />
http://www.riteh.uniri.hr/~zeljkoj/Zeljko_Jericevic.html<br />
10 February 2012 zeljko.jericevic@riteh.hr 68