25.08.2015 Views

In the Beginning was Information

6KezkB

6KezkB

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

According to Shannon’s definition <strong>the</strong> information content of a singlemessage (whe<strong>the</strong>r it is one symbol, one syllable, or one word) is a measureof <strong>the</strong> uncertainty of its reception. Probabilities can only have valuesranging from 0 to 1 (0 ≤ p ≤1), and it thus follows from equation (4) thatI(p) ≥ 0, meaning that <strong>the</strong> numerical value of information content isalways positive. The information content of a number of messages (e. g.symbols) is <strong>the</strong>n given by requirement (i) in terms of <strong>the</strong> sum of <strong>the</strong> valuesfor single messagesnI tot = lb(1/p 1 ) + lb(1/p 2 ) +...+ lb(1/p n ) = ∑ lb(1/p i ) (5)i=1As shown in [G7] equation (5) can be reduced to <strong>the</strong> following ma<strong>the</strong>maticallyequivalent relationship:NI tot = n x ∑ p(x i ) x lb(1/(p(x i )) = n x H (6)i=1Note <strong>the</strong> difference between n and N used with <strong>the</strong> summation sign ∑. <strong>In</strong>equation (5) <strong>the</strong> summation is taken over all n members of <strong>the</strong> receivedsequence of signs, but in (6) it is summed for <strong>the</strong> number of symbols N in<strong>the</strong> set of available symbols.Explanation of <strong>the</strong> variables used in <strong>the</strong> formulas:n = <strong>the</strong> number of symbols in a given (long) sequence (e. g. <strong>the</strong> totalnumber of letters in a book)N = number of different symbols available(e.g: N = 2 for <strong>the</strong> binary symbols 0 and 1, and for <strong>the</strong> Morse codesymbols and –N = 26 for <strong>the</strong> Latin alphabet: A, B, C,..., ZN = 26 x 26 = 676 for bigrams using <strong>the</strong> Latin alphabet: AA,AB, AC,..., ZZN = 4 for <strong>the</strong> genetic code: A, C, G, Tx i ; i = 1 to N, sequence of <strong>the</strong> N different symbolsI tot = information content of an entire sequence of symbolsH = <strong>the</strong> average information content of one symbol (or of a bigram, ortrigram; see Table 4); <strong>the</strong> average value of <strong>the</strong> information content ofone single symbol taken over a long sequence or even over <strong>the</strong> entirelanguage (counted for many books from various types of literature).Shannon’s equations (6) and (8) used to find <strong>the</strong> total (statistical!) informationcontent of a sequence of symbols (e. g. a sentence, a chapter, or abook), consist of two essentially different parts:173

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!