13.06.2014 Views

Biljeske 05 - Tehnički fakultet u Rijeci

Biljeske 05 - Tehnički fakultet u Rijeci

Biljeske 05 - Tehnički fakultet u Rijeci

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

TEORIJA INFORMACIJE<br />

Željko Jeričević, dr. sc.<br />

Zavod za računarstvo, Tehnički <strong>fakultet</strong> &<br />

Zavod za biologiju i medicinsku genetiku, Medicinski <strong>fakultet</strong><br />

51000 Rijeka, Croatia<br />

Phone: (+385) 51-651 594<br />

E-mail: zeljko.jericevic@riteh.hr<br />

http://www.riteh.uniri.hr/~zeljkoj/Zeljko_Jericevic.html


Information theory<br />

Iz dosadašnjeg gradiva znamo da se informacija prije slanja kroz<br />

kanal treba prirediti. To se postiže pretvorbom informacije u<br />

formu koja ima entropiju blisku maksimalnoj čime se efikasnost<br />

prenosa približava maksimalnoj. Ovo se može postići<br />

kompresijom bez gubitaka informacije (lossless compression),<br />

napr. Huffmanovim kodiranjem.<br />

Druga pretvorba odnosi se na sigurnost prenosa pri čemu se<br />

informacija prevodi u formu gdje je za određeni tip pogrešaka<br />

moguća automatska korekcija (napr. Hamming-ovim<br />

kodiranjem).<br />

10 February 2012 zeljko.jericevic@riteh.hr 2


Sažimanje (compression)<br />

10 February 2012 zeljko.jericevic@riteh.hr 3


Samuel F.B. Morse (1791-1872)<br />

Slovo e je najčešće upotrebljavano<br />

10 February 2012


Morse-ov kod<br />

Točka je 1 bit, crta je 3 bita =><br />

razmak unutar istog slova je 1<br />

bit, razmak između slova je 3<br />

bita, razmak između riječi je 7<br />

bitova.<br />

Najkraći znak (slovo E) 1 bit,<br />

najduži znak (broj 0) 19 bitova,<br />

razmak među riječima 7 bitova.<br />

Kako to komparira s ASCII?<br />

Svi znakovi 8 bitova, razmak<br />

među riječima 8 bitova. Za<br />

točnu komparaciju potrebno je<br />

znati učestalost znakova.


Zadatak<br />

Izaberite engleski tekst po volji s projekta Gutemberg u<br />

ASCII formatu (min 100KByte).<br />

Odredite frekvenciju pojavljivanja svih ASCII znakova u<br />

tekstu.<br />

Izračunajte koliko bitova vam treba za Morseovu<br />

reprezentacije teksta i usporedite s bitovima potrebnim<br />

za ASCII reprezentaciju.


David A. Huffman (1925-1999)<br />

David Huffman is best known for his legendary<br />

Huffman code, a compression scheme for lossless<br />

variable length encoding. It was the result of a term<br />

paper he wrote while a graduate student at the<br />

Massachusetts Institute of Technology (MIT), where<br />

he earned a D.Sc. degree on a thesis named The<br />

Synthesis of Sequential Switching Circuits, advised by<br />

Samuel H. Caldwell (1953).<br />

"Huffman Codes" are used in nearly every application<br />

that involves the compression and transmission of<br />

digital data, such as fax machines, modems, computer<br />

networks, and high-definition television (HDTV), to<br />

name a few.<br />

From Wikipedia<br />

10 February 2012 7


Shannon-Fano kodiranje<br />

Top-down kodiranje (preteča Huffmanovog kodiranja)<br />

“In Shannon–Fano coding, the symbols are arranged in<br />

order from most probable to least probable, and then<br />

divided into two sets whose total probabilities are as<br />

close as possible to being equal. All symbols then have<br />

the first digits of their codes assigned; symbols in the<br />

first set receive "0" and symbols in the second set<br />

receive "1". As long as any sets with more than one<br />

member remain, the same process is repeated on those<br />

sets, to determine successive digits of their codes.<br />

When a set has been reduced to one symbol, of course,<br />

this means the symbol's code is complete and will not<br />

form the prefix of any other symbol's code.” from Wikipedia<br />

8


Shannon-Fano kodiranje<br />

9


Shannon-Fano kodiranje<br />

Top-down kodiranje (preteča Huffmanovog kodiranja)<br />

ABACADACABEDAADBECAEBADCAECAEBADBACABAD<br />

Simbol A B C D E<br />

Ucestalost 15 7 6 6 5<br />

Vjerojatnost 0.38461538 0.17948718 0.15384615 0.15384615 0.1282<strong>05</strong>13<br />

10 February 2012 zeljko.jericevic@riteh.hr 10


Shannon-Fano kodiranje<br />

Top-down kodiranje<br />

Simbol A B C D E<br />

Ucestalost 15 7 6 6 5<br />

Vjerojatnost 0.39 0.18 0.15 0.15 0.13<br />

Simbol A B C D E<br />

Kod 00 01 10 110 111<br />

2Bit<br />

⋅ 15+ 7+ 6 + 3Bit<br />

⋅ 6+<br />

5<br />

( ) ( )<br />

39<br />

≈<br />

2.28 Bits / symbol<br />

11


Huffman-ovo kodiranje<br />

Bottom-up:<br />

“D & E have the lowest frequencies and so<br />

are allocated 0 and 1 respectively and<br />

grouped together with a combined<br />

probability of 0.282<strong>05</strong>128. The lowest pair<br />

now are B and C so they're allocated 0 and<br />

1 and grouped together with a combined<br />

probability of 0.33333333. This leaves BC<br />

and DE now with the lowest probabilities<br />

so 0 and 1 are prepended to their codes<br />

and they are combined. This then leaves<br />

just A and BCDE, which have 0 and 1<br />

prepended respectively and are then<br />

combined. This leaves us with a single node<br />

and our algorithm is complete.” From<br />

Wikipedia<br />

12


Huffman-ovo kodiranje<br />

Bottom-up<br />

Simbol A B C D E<br />

Ucestalost 15 7 6 6 5<br />

Vjerojatnost 0.39 0.18 0.15 0.15 0.13<br />

Simbol A B C D E<br />

Kod 0 100 101 110 111<br />

1Bit<br />

⋅ 15+ 3Bit<br />

7+ 6+ 6+<br />

5<br />

39<br />

( )<br />

≈<br />

2.23 Bits / symbol<br />

13


Znak Ucestalost Kod<br />

razmak 7 111<br />

a 4 010<br />

e 4 000<br />

f 3 1101<br />

h 2 1010<br />

i 2 1000<br />

m 2 0111<br />

n 2 0010<br />

s 2 1011<br />

t 2 0110<br />

l 1 11001<br />

o 1 00110<br />

p 1 10011<br />

r 1 11000<br />

u 1 00111<br />

x 1 10010<br />

Huffman-ovo kodiranje<br />

"this is an example of a<br />

huffman tree"<br />

14


Huffman-ovo kodiranje<br />

15


Huffman-ovo kodiranje<br />

16


Huffman-ovo kodiranje<br />

17


Huffman-ovo kodiranje<br />

18


Huffman-ovo kodiranje<br />

19


Huffman-ovo kodiranje<br />

From Malan<br />

20


Huffman-ovo kodiranje<br />

"this is an example of a<br />

huffman tree"<br />

21


Huffman-ovo kodiranje<br />

"this is an example of a<br />

huffman tree"<br />

22


Huffman-ovo kodiranje<br />

23


Huffman-ovo kodiranje<br />

24


Huffman-ovo kodiranje<br />

25


Huffman-ovo kodiranje<br />

26


Huffman-ovo kodiranje<br />

27


Huffman-ovo kodiranje<br />

"this is an example of a<br />

huffman tree"<br />

Znak p Huffman<br />

A 0.2 ?<br />

B 0.1 ?<br />

C 0.1 ?<br />

D 0.15 ?<br />

E 0.45 ?<br />

28


Huffman-ovo kodiranje<br />

"this is an example of a<br />

huffman tree"<br />

Znak p Huffman<br />

A 0.2 01<br />

B 0.1 0000<br />

C 0.1 0001<br />

D 0.15 001<br />

E 0.45 1<br />

29


Modificirano Huffman-ovo kodiranje<br />

Modificirano Huffmanovo kodiranje se koristi u<br />

fax mašinama za kodiranje crnog na bijeloj<br />

podlozi (bitmape). Kombinira Huffman-ove<br />

kodove varijabilne duljine s repetitivnim<br />

kodiranjem.<br />

Za kodiranje crnog na bijelom, 1 bit po pikslu<br />

(bijeli bitovi imaju vrijednost 0, crni bitovi imaju<br />

vrijednost 1). Repeticije crnih i bijelih pikslova se<br />

izbroje i pošalju kao Huffmanovi kodovi<br />

varijabilne duljine.<br />

30


Modificirano Huffmanovo kodiranje:<br />

CCITT (Huffman) Encoding<br />

“CCITT (International Telegraph and Telephone<br />

Consultative Committee) is a standards organization that<br />

has developed a series of communications protocols for the<br />

facsimile transmission of black-and-white images over<br />

telephone lines and data networks. These protocols are<br />

known officially as the CCITT T.4 and T.6 standards but<br />

are more commonly referred to as CCITT Group 3 and<br />

Group 4 compression, respectively.”<br />

31


CCITT Encodings<br />

“Group 3 and Group 4 encodings are compression<br />

algorithms that are specifically designed for<br />

encoding 1-bit image data. Many document and<br />

FAX file formats support Group 3 compression,<br />

and several, including TIFF, also support Group 4.<br />

”<br />

32


CCITT Encodings<br />

“Group 3 encoding was designed specifically for bilevel,<br />

black-and-white image data<br />

telecommunications. All modern FAX machines<br />

and FAX modems support Group 3 facsimile<br />

transmissions. Group 3 encoding and decoding is<br />

fast, maintains a good compression ratio for a wide<br />

variety of document data, and contains information<br />

that aids a Group 3 decoder in detecting and<br />

correcting errors without special hardware.”<br />

33


CCITT Encodings<br />

“Group 4 is a more efficient form of bi-level<br />

compression that has almost entirely replaced the<br />

use of Group 3 in many conventional document<br />

image storage systems. (An exception is facsimile<br />

document storage systems where original Group 3<br />

images are required to be stored in an unaltered<br />

state.)”<br />

34


CCITT Encodings<br />

“Group 4 encoded data is approximately half the<br />

size of 1-dimensional Group 3-encoded data.<br />

Although Group 4 is fairly difficult to implement<br />

efficiently, it encodes at least as fast as Group 3<br />

and in some implementations decodes even faster.<br />

Also, Group 4 was designed for use on data<br />

networks, so it does not contain the<br />

synchronization codes used for error detection<br />

that Group 3 does, making it a poor choice for an<br />

35<br />

image transfer protocol. ”


CCITT Encodings<br />

“Group 3 normally achieves a compression ratio<br />

of 5:1 to 8:1 on a standard 200-dpi (204x196 dpi),<br />

A4-sized document. Group 4 results are roughly<br />

twice as efficient as Group 3, achieving<br />

compression ratios upwards of 15:1 with the same<br />

document. Claims that the CCITT algorithms are<br />

capable of far better compression on standard<br />

business documents are exaggerated--largely by<br />

hardware vendors.”<br />

36


CCITT Encodings<br />

“Because the CCITT algorithms have been optimized for<br />

type and handwritten documents, it stands to reason that<br />

images radically different in composition will not<br />

compress very well. This is all too true. Bi-level bitmaps<br />

that contain a high frequency of short runs, as typically<br />

found in digitally half-toned continuous-tone images, do<br />

not compress as well using the CCITT algorithms. Such<br />

images will usually result in a compression ratio of 3:1 or<br />

even lower, and many will actually compress to a size<br />

larger than the original.”<br />

37


CCITT Encodings<br />

“The CCITT actually defines three algorithms for<br />

the encoding of bi-level image data:<br />

• Group 3 One-Dimensional (G31D)<br />

• Group 3 Two-Dimensional (G32D)<br />

• Group 4 Two-Dimensional (G42D)<br />

”<br />

38


Modificirano Huffman-ovo kodiranje<br />

Svaka linije je kodirana kao izmjenjujući<br />

nizovi crnih i bijelih bitova. Nizove duljine<br />

63 ili manje kodirani su takozvanim<br />

završnim kodom (termination code). Nizovi<br />

duljine 64 ili više imaju početni (makeup<br />

code) ispred završnog koda.<br />

39


Modificirano Huffman-ovo kodiranje<br />

Kodovi su određeni unaprijed, prema<br />

reprezentativnoj statistici za printane<br />

dokumente (85% bijelo, 15% crno; kraće<br />

crne sekvence su vjerojatnije od bijelih, duže<br />

bijele sekvence su vjerojatnije od crnih;<br />

svaki redak dokumenta počinje s bijelom<br />

sekvencom).<br />

40


Modificirano Huffman-ovo kodiranje<br />

Svaka linije je kodirana kao izmjenjujući<br />

nizovi crnih i bijelih bitova. Nizove duljine<br />

63 ili manje kodirani su takozvanim<br />

završnim kodom (termination code). Nizovi<br />

duljine 64 ili više imaju početni (makeup<br />

code) ispred završnog koda.<br />

41


Modificirano Huffman-ovo kodiranje<br />

završni kodovi<br />

Run Length White bits Black bits<br />

0 00110101 0000110111<br />

1 000111 010<br />

2 0111 11<br />

3 1000 10<br />

4 1011 011<br />

5 1100 0011<br />

6 1110 0010<br />

7 1111 00011<br />

8 10011 000101


Modificirano Huffman-ovo kodiranje<br />

završni kodovi<br />

Run Length White bits Black bits<br />

11 01000 0000101<br />

.<br />

17 101011 0000011000<br />

.<br />

28 0011000 000011001100<br />

.<br />

63 00110100 000001100111<br />

43


Modificirano Huffman-ovo kodiranje<br />

(bijeli i crni) početni kodovi<br />

64 11011 000000111<br />

128 10010 00011001000<br />

192 010111 000011001001<br />

256 0110111 000001011011<br />

320 00110110 000000110011<br />

.<br />

1600 010011010 0000001011011<br />

1664 011000 0000001100100<br />

1728 010011011 0000001100101<br />

1792 00000001000 00000001000<br />

… 2560


Modificirano Huffman-ovo kodiranje<br />

specijalni kodovi<br />

“Several special code words are also defined in a<br />

Group 3-encoded data stream. These codes are<br />

used to provide synchronization in the event that a<br />

phone transmission experiences a burst of noise.<br />

By recognizing this special code, a CCITT decoder<br />

may identify transmission errors and attempt to<br />

apply a recovery algorithm that approximates the<br />

lost data.”<br />

45


Modificirano Huffman-ovo kodiranje<br />

specijalni kodovi<br />

“The EOL code is a 12-bit code word that begins each line<br />

in a Group 3 transmission. This unique code word is used<br />

to detect the start/end of a scan line during the image<br />

transmission. If a burst of noise temporarily corrupts the<br />

signal, a Group 3 decoder throws away the unrecognized<br />

data it receives until it encounteres an EOL code. The<br />

decoder would then start receiving the transmission as<br />

normal again, assuming that the data following the EOL<br />

is the beginning of the next scan line. The decoder might<br />

also replace the bad line with a predefined set of data,<br />

46<br />

such as a white scan line.”


A decoder also uses EOL codes for several purposes. It<br />

uses them to keep track of the width of a decoded scan<br />

line. (An incorrect scan-line width may be an error, or it<br />

may be an indication to pad with white pixels to the<br />

EOL.) In addition, it uses EOL codes to keep track of the<br />

number of scan lines in an image, in order to detect a<br />

short image. If it finds one, it pads the remaining length<br />

with scan lines of all white pixels.<br />

EOL is 000000000001<br />

RTC (Return To Control) is 6 consecutive EOL codes and<br />

signifies end of message transmition.<br />

Modificirano Huffman-ovo kodiranje<br />

specijalni kodovi


Modificirano Huffman-ovo kodiranje<br />

primjeri<br />

48


Modificirano Huffman-ovo kodiranje<br />

primjeri<br />

49


Modificirano Huffman-ovo kodiranje<br />

primjeri<br />

a) Potrebno je samo poslati završni kod za 20 crnih bitova:<br />

00001101000<br />

50


Modificirano Huffman-ovo kodiranje<br />

primjeri<br />

b) Šaljemo početni kod za 64 bijelih bitova (11011) i<br />

završni kod za 36 bijelih bitova: 00010011<br />

51


Modificirano Huffman-ovo kodiranje<br />

primjeri<br />

c) Šaljemo 4 početna koda: 3 za 2560 crnih bitova<br />

(000000011111) i jedan za 1088 crnih bitova<br />

(0000001110101) i završni kod za 32 crnih bitova:<br />

000001101010<br />

52


Modificirano Huffman-ovo kodiranje<br />

primjeri<br />

Počinjemo s crnim bitom (neuobičajeno), umećemo<br />

sekvencu bijelih bitova dužine 0, zatim kod za 1 crni bit,<br />

sljedi kod za 4 bijela bita, zatim kod za 2 crna bita, kod za<br />

1 bijeli bit, kod za 1 crni bit, početni kod za 1216 bijelih<br />

bitova i konačni kod za 50 bijelih bitova, EOL<br />

0 bijelo 00110101<br />

1 crno 010<br />

4 bijelo 1011<br />

2 crno 11<br />

1 bijelo 0111<br />

1 crno 010<br />

1266 Bijelo 011011000 + 01010011<br />

EOL 000000000001<br />

53


CCITT Encodings<br />

“The CCITT actually defines three algorithms for<br />

the encoding of bi-level image data:<br />

• Group 3 One-Dimensional (G31D)<br />

• Group 3 Two-Dimensional (G32D)<br />

• Group 4 Two-Dimensional (G42D)<br />

”<br />

54


G32D CCITT Encoding<br />

“With Group 3 Two-Dimensional (G32D)<br />

encoding, the way a scan line is encoded may<br />

depend on the immediately preceding scan-line<br />

data. Many images have a high degree of vertical<br />

coherence (redundancy). By describing the<br />

differences between two scan lines, rather than<br />

describing the scan line contents, 2D encoding<br />

achieves better compression.”<br />

55


G32D CCITT Encoding<br />

“The first pixel of each run length is called a changing<br />

element. Each changing element marks a color transition<br />

within a scan line (the point where a run of one color ends<br />

and a run of the next color begins).<br />

The position of each changing element in a scan line is<br />

described as being a certain number of pixels from a<br />

changing element in the current, coding line (horizontal<br />

coding is performed) or in the preceding, reference line<br />

(vertical coding is performed). The output codes used to<br />

describe the actual positional information are called<br />

Relative Element Address Designate (READ) codes.” 56


G32D CCITT Encoding<br />

“Shorter code words are used to describe the color<br />

transitions that are less than four pixels away from each<br />

other on the code line or the reference line. Longer code<br />

words are used to describe color transitions lying a<br />

greater distance from the current changing element.<br />

2D encoding is more efficient than 1-dimensional because<br />

the usual data that is compressed (typed or handwritten<br />

documents) contains a high amount of 2D coherence.”<br />

57


G32D CCITT Encoding<br />

“Because a G32D-encoded scan line is dependent on the<br />

correctness of the preceding scan line, an error, such as a<br />

burst of line noise, can affect multiple, 2-dimensionally<br />

encoded scan lines. If a transmission error corrupts a<br />

segment of encoded scan line data, that line cannot be<br />

decoded. But, worse still, all scan lines occurring after it<br />

also decode improperly.”<br />

58


G32D CCITT Encoding<br />

“To minimize the damage created by noise, G32D uses a<br />

variable called a K factor and 2-dimensionally encodes K-<br />

1 lines following a 1-dimensionally encoded line. If<br />

corruption of the data transmission occurs, only K-1 scan<br />

lines of data will be lost. The decoder will be able to<br />

resync the decoding at the next available EOL code.”<br />

59


G32D CCITT Encoding<br />

“The typical value for K is 2 or 4. G32D data that is<br />

encoded with a K value of 4 appears as a single block of<br />

data. Each block contains three lines of 2D scan-line data<br />

followed by a scan line of 1-dimensionally encoded data.”<br />

60


G32D CCITT Encoding<br />

“The K variable is not normally used in decoding the<br />

G32D data. Instead, the EOL code is modified to indicate<br />

the algorithm used to encode the line following it. If a 1<br />

bit is appended to the EOL code, the line following is 1-<br />

dimensionally encoded; if a 0 bit is appended, the line<br />

following the EOL code is 2-dimensionally encoded. All<br />

other transmission code word markers (FILL and RTC)<br />

follow the same rule as in G31D encoding. K is only<br />

needed in decoding if regeneration of the previous 1-<br />

dimensionally encoded scan line is necessary for error<br />

61<br />

recovery. ”


CCITT Encodings<br />

“The CCITT actually defines three algorithms for<br />

the encoding of bi-level image data:<br />

• Group 3 One-Dimensional (G31D)<br />

• Group 3 Two-Dimensional (G32D)<br />

• Group 4 Two-Dimensional (G42D)<br />

”<br />

62


G42D CCITT Encoding<br />

“Group 4 Two-Dimensional (G42D) encoding was<br />

developed from the G32D algorithm as a better 2D<br />

compression scheme--so much better, in fact, that Group<br />

4 has almost completely replaced G32D in commercial<br />

use.”<br />

63


G42D CCITT Encoding<br />

“Group 4 encoding is identical to G32D encoding except<br />

for a few modifications. Group 4 is basically the G32D<br />

algorithm with no EOL codes and a K variable set to<br />

infinity. Group 4 was designed specifically to encode data<br />

residing on disk drives and data networks. The built-in<br />

transmission error detection/correction found in Group 3<br />

is therefore not needed by Group 4 data.”<br />

64


G42D CCITT Encoding<br />

“The first reference line in Group 4 encoding is an<br />

imaginary scan line containing all white pixels. In G32D<br />

encoding, the first reference line is the first scan line of<br />

the image. In Group 4 encoding, the RTC code word is<br />

replaced by an end of facsimile block (EOFB) code, which<br />

consists of two consecutive Group 3 EOL code words.<br />

Like the Group 3 RTC, the EOFB is also part of the<br />

transmission protocol and not actually part of the image<br />

data. Also, Group 4-encoded image data may be padded<br />

out with fill bits after the EOFB to end on a byte<br />

boundary.”<br />

65


G42D CCITT Encoding<br />

“Group 4 encoding will usually result in an image<br />

compressed twice as small as if it were done with G31D<br />

encoding. The main tradeoff is that Group 4 encoding is<br />

more complex and requires more time to perform. When<br />

implemented in hardware, however, the difference in<br />

execution speed between the Group 3 and Group 4<br />

algorithms is not significant, which usually makes Group<br />

4 a better choice in most imaging system<br />

implementations.”<br />

66


CCITT documents<br />

• "Standardization of Group 3 Facsimile Apparatus for Document<br />

Transmission," Recommendation T.4, Volume VII, Fascicle VII.3,<br />

Terminal Equipment and Protocols for Telematic Services, The<br />

International Telegraph and Telephone Consultative Committee<br />

(CCITT), Geneva, Switzerland, 1985, pp. 16-31.<br />

• "Facsimile Coding Schemes and Coding Control Functions for<br />

Group 4 Facsimile Apparatus," Recommendation T.6, Volume<br />

VII, Fascicle VII.3, Terminal Equipment and Protocols for<br />

Telematic Services, The International Telegraph and Telephone<br />

Consultative Committee (CCITT), Geneva, Switzerland, 1985, pp.<br />

40-48.<br />

10 February 2012 zeljko.jericevic@riteh.hr 67


Hvala na pažnji<br />

Željko Jeričević, dr. sc.<br />

Zavod za računarstvo, Tehnički <strong>fakultet</strong> &<br />

Zavod za biologiju i medicinsku genetiku, Medicinski <strong>fakultet</strong><br />

51000 Rijeka, Croatia<br />

Phone: (+385) 51-651 594<br />

E-mail: zeljko.jericevic@riteh.hr<br />

http://www.riteh.uniri.hr/~zeljkoj/Zeljko_Jericevic.html<br />

10 February 2012 zeljko.jericevic@riteh.hr 68

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!