13.07.2015 Views

Context free languages A language recognizer is a device that ...

Context free languages A language recognizer is a device that ...

Context free languages A language recognizer is a device that ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Context</strong> <strong>free</strong> <strong><strong>language</strong>s</strong>A <strong>language</strong> <strong>recognizer</strong> <strong>is</strong> a <strong>device</strong> <strong>that</strong> accepts valid strings produced in agiven <strong>language</strong>. Finite state automata are formalized types of <strong>language</strong><strong>recognizer</strong>s.A <strong>language</strong> generator begins, when given a start signal, to construct a string.Its operation <strong>is</strong> not completely determined from the beginning but <strong>is</strong> limitedby a set of rules.Regular expressions can be viewed as <strong>language</strong> generators. For example,consider the regular expression a(a* ∪ b*)b. A verbal description of how togenerate a string in accordance with th<strong>is</strong> expression would be:First output an a. Then do one of the followingtwo things:Either output a number of a’s or output a number ofb’sFinally output a b.In th<strong>is</strong> chapter, we will study certain more complex sorts of <strong>language</strong>generators, called context-<strong>free</strong> grammars, which are based on a morecomplete understanding of the structure of the strings belonging to the<strong>language</strong>.To take the example of : a(a* ∪ b*)b, we can describe the strings generatedby th<strong>is</strong> <strong>language</strong> as cons<strong>is</strong>ting of a leading a, followed by a middle part M,to be made explicit later on, followed by a trailing b. Let S be a new symbolinterpreted as “ a string in the <strong>language</strong>”, we can express th<strong>is</strong> observation bywriting:S aMb <strong>is</strong> read can be. We call such an expression a rule.What can M be? either a string of a’s or a string of b’s. We express th<strong>is</strong> byadding:M AM BWhere A and B are new symbols <strong>that</strong> stand for strings of a’s and b’srespectively.


Now, let’s define a string of a’s.First, it can be empty, so we add the rule:A εor it may cons<strong>is</strong>t of a leading a followed by a string of a’s’A aA.Similarly for B:B ε andB bBThe <strong>language</strong> denoted by the regular expression a(a* ∪ b*)b can be definedby the following rules:S aMbM AM BA εA aA.B εB bBFor example to generate the string aaab:1. We start with the symbol S, as specified;2. We replace S by aMb according to the first rule, S aMb.3. To aMb we apply the rule MA and obtain aAb.4. We then twice apply the rule A aA to get the string aaaAb.5. Finally we apply the rule Aε6. In the resulting string aaab we cannot identify any symbol <strong>that</strong>appears to the left of in some rule, thus the operation of ourgenerator has ended and aaab was produced.A context-<strong>free</strong> grammar <strong>is</strong> a <strong>language</strong> generator <strong>that</strong> operates with a setof rules. <strong>Context</strong>-<strong>free</strong> grammars are a more powerful method ofrepresenting <strong><strong>language</strong>s</strong>.Such grammars can describe features <strong>that</strong> have a recursive structure.In a context <strong>free</strong> grammar, some symbols appear to the left of in rules-S, M, A, B- in our example, and some do not -a and b-Symbols of the latter kind are called terminals, since the string <strong>is</strong> madeup of these symbols.


The other symbols are called non-terminal symbols. The rule of the formX Y <strong>is</strong> called a production rule or a substitution rule.An important application of context-<strong>free</strong> grammars occurs in thespecification and compilation of programming <strong><strong>language</strong>s</strong>.Definition of <strong>Context</strong>-Free Grammars:Four important components in a grammatical description of a <strong>language</strong>:1. There <strong>is</strong> a finite set of symbols <strong>that</strong> forms strings of the <strong>language</strong>being defined. We call th<strong>is</strong> set the terminals or terminal symbols.2. There <strong>is</strong> a finite set of variables, also called no terminals orsyntactic categories. Each variable represents a <strong>language</strong>; i.e. a setof strings3. One of the variables represents the <strong>language</strong> being defined; it <strong>is</strong>called the start symbol. Other variables represent auxiliary classesof strings <strong>that</strong> are used to help define the <strong>language</strong> of the startsymbol.4. There <strong>is</strong> a finite set of productions or rules <strong>that</strong> represent therecursive definition of a <strong>language</strong>. Each production cons<strong>is</strong>ts of:a. A variable <strong>that</strong> <strong>is</strong> being (partially) defined by the production.Th<strong>is</strong> variable <strong>is</strong> often called the head of the production.b. The production symbol c. A string of zero or more terminals and variables. Th<strong>is</strong> stringcalled the body of the production, represents one way toform strings in the <strong>language</strong> of the variable of the head. In sodoing, we leave terminals unchanged and substitute for eachvariable of the body any string <strong>that</strong> <strong>is</strong> known to be in the<strong>language</strong>.The four components just described form a context <strong>free</strong> grammar.Formal Definition of a <strong>Context</strong> Free Grammar:A context <strong>free</strong> grammar <strong>is</strong> a 4-tuple (V,∑, R, S) where:1. V <strong>is</strong> a finite set called the variables, or non terminals2. ∑ <strong>is</strong> a finite set, d<strong>is</strong>joint from V, called the terminals3. R <strong>is</strong> a finite set of rules, with each rule being a variable and a stringof variables and terminals and4. S ∈ V <strong>is</strong> the start variable


If u, v and w are strings of variables and terminals, and A w <strong>is</strong> a rule ofthe grammar, we say <strong>that</strong> uAv yields uwv.We say <strong>that</strong> u derives v, writtenu ⇒* v, if1. u =v or2. if a sequence u 1 , u 2 , …u k ex<strong>is</strong>ts for k ≥ 0, such <strong>that</strong> u ⇒ u 1 ⇒ u 2 …..⇒u k ⇒ vConsider grammar G4= (V, ∑, R, )V ={ , , } and∑={a, +, x, (, )}.The rules are: + | x | () | aThe two strings a+a x a and (a+ a) x a can be generated with grammar G.ParsingA compiler translates code written in a programming <strong>language</strong> into anotherform, usually more suitable for execution. Th<strong>is</strong> process <strong>is</strong> called parsing.Parse TreesA parse tree for a grammar G <strong>is</strong> a tree where:• the root <strong>is</strong> the start symbol for G• the interior nodes are the non terminals of G• the leaf nodes are the terminal symbols of G.• the children of a node T (from left to right) correspond to the symbols onthe right hand side of some production for T in G.Every terminal string generated by a grammar has a corresponding parsetree; every valid parse tree represents a string generated by the grammar(called the yield of the parse tree).


Example: Given the following grammar, find a parse tree for the string1 + 2 * 3:1. 2. ( )3. + 4. - 5. * 6. / 7. 0 | 1 | 2 | ... 9The parse tree <strong>is</strong>:E --> E --> D --> 1+E --> E --> D --> 2*E --> D --> 3Ambiguous GrammarsA grammar for which there are two different parse trees for the sameterminal string <strong>is</strong> said to be ambiguous.The grammar for balanced parentheses given below <strong>is</strong> an example of anambiguous grammar:P --> ( P ) | P P | epsilonWe can prove th<strong>is</strong> grammar <strong>is</strong> ambiguous by demonstrating two parse treesfor the same terminal string.Here are two parse trees for the empty string:1) P --> P --> epsilonP --> epsilon2) P --> epsilon


Here are two parse trees for ():P --> P --> (P --> epsilon)P --> epsilonP --> P --> epsilonP --> (P --> epsilon)To prove <strong>that</strong> a grammar <strong>is</strong> ambiguous, it <strong>is</strong> sufficient to demonstrate <strong>that</strong>the grammar generates two d<strong>is</strong>tinct parse trees for the same terminal string.An unambiguous grammar for the same <strong>language</strong> (<strong>that</strong> <strong>is</strong>, the set of stringscons<strong>is</strong>ting of balanced parentheses) <strong>is</strong>:P --> ( P ) P | epsilonThe Problem of Ambiguous GrammarsA parse tree <strong>is</strong> supposed to d<strong>is</strong>play the structure used by a grammar togenerate an input string. Th<strong>is</strong> structure <strong>is</strong> not unique if the grammar <strong>is</strong>ambiguous. A problem ar<strong>is</strong>es if we attempt to impart meaning to an inputstring using a parse tree; if the parse tree <strong>is</strong> not unique, then the string hasmultiple meanings.We typically use a grammar to define the syntax of a programming<strong>language</strong>. The structure of the parse tree produced by the grammar impartssome meaning on the strings of the <strong>language</strong>.If the grammar <strong>is</strong> ambiguous, the compiler has no way to determine whichof two meanings to use. Thus, the code produced by the compiler <strong>is</strong> not fullydetermined by the program input to the compiler.


Ambiguous PrecedenceThe grammar for expressions given earlier:1. number2. ( )3. + 4. - 5. * 6. / Th<strong>is</strong> grammar <strong>is</strong> ambiguous as shown by the two parse trees for the inputstring: “number + number * number”:E --> E --> number+E --> E --> number*E --> numberE --> E --> E --> number+E --> number*E --> numberThe first parse tree gives precedence to multiplication over addition; thesecond parse tree gives precedence to addition over multiplication. In mostprogramming <strong><strong>language</strong>s</strong>, only the former meaning <strong>is</strong> correct. As written, th<strong>is</strong>grammar <strong>is</strong> ambiguous with respect to the precedence of the arithmeticoperators.Ambiguous AssociativityConsider again the same grammar for expressions:1. --> number2. --> ( )3. --> + 4. --> - 5. --> * 6. --> /


Th<strong>is</strong> grammar <strong>is</strong> ambiguous even if we only consider operators at the sameprecedence level, as in the input string “number - number + number”:E --> E --> number-E --> E --> number+E --> numberE --> E --> E --> number-E --> number+E --> numberThe first parse tree (incorrectly) gives precedence to the addition operator;the second parse tree gives precedence to the subtraction operator. Since wenormally group operators left to right within a precedence level, only thelatter interpretation <strong>is</strong> correct.Definition:If L <strong>is</strong> a context-<strong>free</strong> <strong>language</strong> for which there ex<strong>is</strong>ts an unambiguousgrammar, then L <strong>is</strong> said to be unambiguous. If every grammar <strong>that</strong> generatesL <strong>is</strong> ambiguous then the <strong>language</strong> L <strong>is</strong> called inherently ambiguous.Consider the <strong>language</strong>:L= {a n b n c m }∪{a n b m c m }L = L1 ∪ L2L1 <strong>is</strong> generated by:S1 ACA aAb|epsilonC cC| epsilon


L2 <strong>is</strong> generated by:S2 BDB aB|epsilonD bDc|epsilonThen L <strong>is</strong> generated by a combination of these 2 grammars with theadditional productionSS1|S2The grammar <strong>is</strong> ambiguous since the string a n b n c n has 2 derivationsAn Unambiguous Grammar for ExpressionsIt <strong>is</strong> possible to write a grammar for arithmetic expressions <strong>that</strong>• <strong>is</strong> unambiguous• enforces the precedence of * and / over + and -• enforces left associativityHere <strong>is</strong> one such grammar:1. + | - | 2. * | / | 3. ( ) | numberIf we attempt to build a parse tree for number - number + number, we seethere <strong>is</strong> only one such tree:E --> E --> E --> T --> F --> number-T --> F --> number+T --> number


Th<strong>is</strong> parse tree correctly represents left associativity by using recursion onthe left. If we rewrote the grammar to use recursion on the right, we wouldrepresent right associativity:1. --> + | - | 2. --> * | / | 3. --> ( ) | numberOur grammar also correctly represents precedence levels by introducing anew non-terminal symbol for each precedence level. According to ourgrammar, expressions cons<strong>is</strong>t of the sum or difference of terms (or a singleterms), where a term cons<strong>is</strong>ts of the product or div<strong>is</strong>ion of factors (or asingle factor), and a factor <strong>is</strong> a nested expression or a number.Leftmost and Rightmost Derivations:In order to restrict the choices we have in deriving a string, it <strong>is</strong> often usefulto require <strong>that</strong> at each step, we replace the leftmost variable by one of itsproduction bodies. Such a derivation <strong>is</strong> called a leftmost derivation.Similarly, it <strong>is</strong> possible to require <strong>that</strong> at each step, the rightmost variable <strong>is</strong>replaced by one of its bodies. So we call th<strong>is</strong> derivation rightmost.Derivation Trees:A derivation tree <strong>is</strong> an ordered tree in which nodes are labeled with the leftsides of productions and in which the children of a node represent itscorresponding right sides.Definition:Let G=(V, T, S, P) be a context-<strong>free</strong> grammar. An ordered tree <strong>is</strong> aderivation tree for G if and only if it has the following properties:1. The root <strong>is</strong> labeled S.2. Every leaf has a label from T ∪ {ε}3. Every interior vertex (which <strong>is</strong> not a leaf) has a label from V4. if a vertex has label A∈V, and its children are labeled(from left toright) a 1 , a 2 , …, a n , then P must contain a production of the form Aa 1 a 2 …a n5. A leaf labeled ε has no siblings, <strong>that</strong> are a vertex with a child labeled εcan have no other children.


A partial derivation tree sat<strong>is</strong>fies conditions 3, 4 and 5, but not 1necessarily and 2 <strong>is</strong> replaced by:2a. Every leaf has a label from V ∪T ∪ {ε}Consider the grammar G with productions:S aAB,A bBb,B A|εFind a partial derivation tree of G.Find the <strong>language</strong> of G.Consider the grammar G with productions:S abB,A aaBb,B bbAaA εFind the <strong>language</strong> of G.Find context-<strong>free</strong> grammars for the following <strong><strong>language</strong>s</strong>(with n >=0, m>=0)1. L= {a n b m : n ≤ m}2. L= { ww R }3. L={w ∈ {a, b}*: n a (w) ≠ n b (w)}Methods for transforming Grammars:Chomsky Normal Form:When working with CFG it <strong>is</strong> often convenient to have them in simplifiedform. One of the simplest and most useful forms <strong>is</strong> called the Chomskynormal form. Chomsky normal form or CNF <strong>is</strong> useful when givingalgorithms for working with CFG.


Definition:A context <strong>free</strong> grammar <strong>is</strong> in Chomsky normal form if every rule <strong>is</strong> of theform:ABCA aWhere a <strong>is</strong> any terminal and A, B, and C are non terminals different from thestart symbol. In addition, we permit the rule:Sepsilon, where S <strong>is</strong> the start symbol.Theorem:Any context <strong>free</strong> <strong>language</strong> <strong>is</strong> generated by a context <strong>free</strong> grammar inChomsky normal form.Proof by construction:1. Add a new start variable S0S, where S was the original startnonterminal. Th<strong>is</strong> will guarantee <strong>that</strong> the new start variable does notoccur on the right-hand side of a rule.2. We take care of all epsilon rules. We remove ε-rule of the form Aε,where A <strong>is</strong> not the start symbol.Th<strong>is</strong> <strong>is</strong> how to do it:For each occurrence of an A on the right-hand side of a rule, we add anew rule with <strong>that</strong> occurrence deleted.For example: if R uAv <strong>is</strong> a rule, we add the rule R uv. We do so foreach occurrence of the non-terminal A .So the rule R uAvAw causes us to generateR uvAwR uAvwR uvwWe repeat these steps for all rules of the form non-terminal ε3. We handle unit rules, or rules of the form AB. We remove thepreceding rule. Then, whenever a rule Bu appears, we add a rule


Au. u <strong>is</strong> a string of variables and terminals. We repeat these stepsuntil we eliminate all unit rules.4. We convert all remaining rules into the proper form. We replace eachrule of the form A u 1 u 2 …u k , where k ≥ 3 and each u i <strong>is</strong> a variable orterminal symbol, with the rulesA u 1 A 1 , A 1 u 2 A 2 ,…., and A k-2 u k-1 u kGreibach Normal Form:If L <strong>is</strong> an ε-<strong>free</strong> context-<strong>free</strong> <strong>language</strong>, there ex<strong>is</strong>ts a grammar inGreibach normal form for it.A context <strong>free</strong> grammar <strong>is</strong> said to be in Greibach normal form if allproductions have the form:A ax where a <strong>is</strong> terminal symbol and x <strong>is</strong> one or morenonterminal symbols preceded by a terminal symbol:S ABA aA|bB|bB bGreibach Normal Form:S aAB| bBB|bBA aA|bB|bB bTh<strong>is</strong> concept <strong>is</strong> important for what follows:Let A be a <strong>Context</strong> <strong>free</strong> <strong>language</strong>; then there <strong>is</strong> a CFG G <strong>that</strong>generates it.


Procedure on how to convert G into an equivalentPDA, called P:1. Place the marker symbol $ and the start variable on thestack2. Repeat the following steps forever:a. if the top of stack <strong>is</strong> a variable (non-terminal) symbol A, nondetermin<strong>is</strong>tically select one of the rules for A and substituteA by the string on the right-hand side of the rule.b. If the top of stack <strong>is</strong> a terminal symbol a, read the nextsymbol from the input and compare it to a. If they match,pop it repeat step 2. If they do not match, reject on th<strong>is</strong>branch of the non-determin<strong>is</strong>m.c. if the top of stack <strong>is</strong> the symbol $, enter the accept state.Doing so accepts the input if it has all been read.Find the automaton <strong>that</strong> corresponds to the following CFG:S aSASaA aA|bB|bB bThe underlying idea <strong>is</strong> to construct a pda <strong>that</strong> can carry out a leftmostderivation of any string in the <strong>language</strong>.Determin<strong>is</strong>tic Pushdown Automata and Determin<strong>is</strong>tic <strong>Context</strong>Free Languages:A determin<strong>is</strong>tic pushdown accepter(dpda) <strong>is</strong> a pushdown automaton<strong>that</strong> never has a choice in its moveDefinition:A pushdown automaton M = (Q, Σ, Γ, δ, q0, z, F) <strong>is</strong> said to bedetermin<strong>is</strong>tic if it <strong>is</strong> an automaton as described above with the addedrestriction <strong>that</strong> for every q ∈ Q and a ∈ Σ ∪ {ε} and b ∈Γ


1. δ(q, a, b) contains at most one element2. δ(q, ε, b) <strong>is</strong> not empty, then δ(q, c, b) must be empty for every c∈ ΣThe first of these conditions requires <strong>that</strong> for any given inputsymbol and any stack top, at most one move can be made.The second condition <strong>is</strong> when an ε-move <strong>is</strong> possible for someconfiguration, no input-consuming alternative <strong>is</strong> available.Note <strong>that</strong> we want to retain ε-moves; since the top of the stackplays a role in determining the next move, the presence of ε-movedoes not imply non determin<strong>is</strong>m.A <strong>language</strong> L <strong>is</strong> said to be a determin<strong>is</strong>tic context <strong>free</strong> <strong>language</strong> ifand only if there ex<strong>is</strong>ts a dpda M such <strong>that</strong>L =L(M).The importance of determin<strong>is</strong>tic context <strong>free</strong> <strong><strong>language</strong>s</strong> lies in thefact <strong>that</strong> they can be parsed efficiently. If we view the pushdownautomaton as a parsing <strong>device</strong>, there will be no backtracking, sowe can easily write a computer program for it.Let us see what grammars might be suitable for the description ofdetermin<strong>is</strong>tic context <strong>free</strong> grammars.Suppose we are parsing top down, attempting to find the leftmostderivations . We scan the input w from left to right, whiledeveloping a sentential form and proceed to match the currentlyscanned symbol with exactly the relevant grammar rule and avoidbacktracking. Th<strong>is</strong> would be impossible for general CFGs-grammars:A CF grammar G = (V, T, S, P) <strong>is</strong> said to be a simple grammar ors-grammar if all the productions are of the form:Aaxwhere A ∈ V, a ∈ T, x ∈ V*, and any pair (A, a) occurs at mostonce in P.The grammarS aS|bSS|c<strong>is</strong> an s-grammar.The grammarSaS|bSS|aSS|c


<strong>is</strong> not an s-grammar, since the pair (S, a) occurs in twoproductions.In s-grammars, at every stage of the parsing we know exactlywhich production rule has to be applied.Although s-grammars are useful, they are too restrictive to captureall aspects of the syntax of programming <strong><strong>language</strong>s</strong>.LL grammars:In an LL grammar, we still have the property <strong>that</strong> we can, bylooking at a limited part of the input (cons<strong>is</strong>ting of the scannedsymbol, plus a finite number of symbols following it), predictexactly which production rule must be used. The term LL <strong>is</strong>standard usage in books on compilers. The first L indicates <strong>that</strong>symbols are scanned from Left to right, the second L <strong>that</strong> Leftmostderivations are constructed.The grammarS aSb|ab<strong>is</strong> not an s-grammar but it <strong>is</strong> LL. (why ?)We say <strong>that</strong> a grammar <strong>is</strong> LL(k) if we can uniquely identify thecorrect production, given the currently scanned symbol and a“look-ahead” of the next k-1 symbols.The example above <strong>is</strong> LL(?)The pumping lemma <strong>is</strong> an effective tool for showing <strong>that</strong> certain<strong><strong>language</strong>s</strong> are not regular. Similar pumping lemmas are used forother <strong>language</strong> families.A pumping lemma for <strong>Context</strong>-Free Languages:The pumping lemma for context <strong>free</strong> <strong><strong>language</strong>s</strong> states <strong>that</strong> everycontext <strong>free</strong> <strong>language</strong> has a special value called the pumping lengthsuch <strong>that</strong> all longer strings in the <strong>language</strong> can be “pumped”. Th<strong>is</strong>means <strong>that</strong> the string can be divided into five parts so <strong>that</strong> the secondand the fourth parts may be repeated together.


Theorem:If A <strong>is</strong> a context <strong>language</strong>, then there <strong>is</strong> a number p( the pumpinglength) where, if s <strong>is</strong> any string in A of length at least p, then s may bedivided into five pieces s= uvxyz sat<strong>is</strong>fying the conditions:1. for each i ≥ 0, uv i xy i z ∈ A2. |vy| > 0 and3. |vxy| ≤ pWhen s <strong>is</strong> being divided into uvxyz, condition 2 says <strong>that</strong> either vor y <strong>is</strong> not the empty string.The pumping lemma <strong>is</strong> useful in showing <strong>that</strong> a <strong>language</strong> does notbelong to the family of context-<strong>free</strong> <strong><strong>language</strong>s</strong>.Show <strong>that</strong> the <strong>language</strong> L = {a n b n c n : n ≥ 0} <strong>is</strong> not context <strong>free</strong>.Let’s pick m, we pick the string a m b m c m which <strong>is</strong> in L.If we choose vxy to contain only a’s, then the pumped string <strong>is</strong> b m c mwhich <strong>is</strong> not in L.If we choose a decomposition so <strong>that</strong> v and y are composed of anequal number of a’s and b’s then the pumped string a k b k c m with k≠m<strong>is</strong> not in the <strong>language</strong>.So L = {a n b n c n : n ≥ 0} <strong>is</strong> not context <strong>free</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!