Context free languages A language recognizer is a device that ...

Context free languagesA language recognizer is a device that accepts valid strings produced in agiven language. Finite state automata are formalized types of languagerecognizers.A language generator begins, when given a start signal, to construct a string.Its operation is not completely determined from the beginning but is limitedby a set of rules.Regular expressions can be viewed as language generators. For example,consider the regular expression a(a* ∪ b*)b. A verbal description of how togenerate a string in accordance with this expression would be:First output an a. Then do one of the followingtwo things:Either output a number of a’s or output a number ofb’sFinally output a b.In this chapter, we will study certain more complex sorts of languagegenerators, called context-free grammars, which are based on a morecomplete understanding of the structure of the strings belonging to thelanguage.To take the example of : a(a* ∪ b*)b, we can describe the strings generatedby this language as consisting of a leading a, followed by a middle part M,to be made explicit later on, followed by a trailing b. Let S be a new symbolinterpreted as “ a string in the language”, we can express this observation bywriting:S aMb is read can be. We call such an expression a rule.What can M be? either a string of a’s or a string of b’s. We express this byadding:M AM BWhere A and B are new symbols that stand for strings of a’s and b’srespectively.

Now, let’s define a string of a’s.First, it can be empty, so we add the rule:A εor it may consist of a leading a followed by a string of a’s’A aA.Similarly for B:B ε andB bBThe language denoted by the regular expression a(a* ∪ b*)b can be definedby the following rules:S aMbM AM BA εA aA.B εB bBFor example to generate the string aaab:1. We start with the symbol S, as specified;2. We replace S by aMb according to the first rule, S aMb.3. To aMb we apply the rule MA and obtain aAb.4. We then twice apply the rule A aA to get the string aaaAb.5. Finally we apply the rule Aε6. In the resulting string aaab we cannot identify any symbol thatappears to the left of in some rule, thus the operation of ourgenerator has ended and aaab was produced.A context-free grammar is a language generator that operates with a setof rules. Context-free grammars are a more powerful method ofrepresenting languages.Such grammars can describe features that have a recursive structure.In a context free grammar, some symbols appear to the left of in rules-S, M, A, B- in our example, and some do not -a and b-Symbols of the latter kind are called terminals, since the string is madeup of these symbols.

The other symbols are called non-terminal symbols. The rule of the formX Y is called a production rule or a substitution rule.An important application of context-free grammars occurs in thespecification and compilation of programming languages.Definition of Context-Free Grammars:Four important components in a grammatical description of a language:1. There is a finite set of symbols that forms strings of the languagebeing defined. We call this set the terminals or terminal symbols.2. There is a finite set of variables, also called no terminals orsyntactic categories. Each variable represents a language; i.e. a setof strings3. One of the variables represents the language being defined; it iscalled the start symbol. Other variables represent auxiliary classesof strings that are used to help define the language of the startsymbol.4. There is a finite set of productions or rules that represent therecursive definition of a language. Each production consists of:a. A variable that is being (partially) defined by the production.This variable is often called the head of the production.b. The production symbol c. A string of zero or more terminals and variables. This stringcalled the body of the production, represents one way toform strings in the language of the variable of the head. In sodoing, we leave terminals unchanged and substitute for eachvariable of the body any string that is known to be in thelanguage.The four components just described form a context free grammar.Formal Definition of a Context Free Grammar:A context free grammar is a 4-tuple (V,∑, R, S) where:1. V is a finite set called the variables, or non terminals2. ∑ is a finite set, disjoint from V, called the terminals3. R is a finite set of rules, with each rule being a variable and a stringof variables and terminals and4. S ∈ V is the start variable

If u, v and w are strings of variables and terminals, and A w is a rule ofthe grammar, we say that uAv yields uwv.We say that u derives v, writtenu ⇒* v, if1. u =v or2. if a sequence u 1 , u 2 , …u k exists for k ≥ 0, such that u ⇒ u 1 ⇒ u 2 …..⇒u k ⇒ vConsider grammar G4= (V, ∑, R, )V ={ , , } and∑={a, +, x, (, )}.The rules are: + | x | () | aThe two strings a+a x a and (a+ a) x a can be generated with grammar G.ParsingA compiler translates code written in a programming language into anotherform, usually more suitable for execution. This process is called parsing.Parse TreesA parse tree for a grammar G is a tree where:• the root is the start symbol for G• the interior nodes are the non terminals of G• the leaf nodes are the terminal symbols of G.• the children of a node T (from left to right) correspond to the symbols onthe right hand side of some production for T in G.Every terminal string generated by a grammar has a corresponding parsetree; every valid parse tree represents a string generated by the grammar(called the yield of the parse tree).

Example: Given the following grammar, find a parse tree for the string1 + 2 * 3:1. 2. ( )3. + 4. - 5. * 6. / 7. 0 | 1 | 2 | ... 9The parse tree is:E --> E --> D --> 1+E --> E --> D --> 2*E --> D --> 3Ambiguous GrammarsA grammar for which there are two different parse trees for the sameterminal string is said to be ambiguous.The grammar for balanced parentheses given below is an example of anambiguous grammar:P --> ( P ) | P P | epsilonWe can prove this grammar is ambiguous by demonstrating two parse treesfor the same terminal string.Here are two parse trees for the empty string:1) P --> P --> epsilonP --> epsilon2) P --> epsilon

Here are two parse trees for ():P --> P --> (P --> epsilon)P --> epsilonP --> P --> epsilonP --> (P --> epsilon)To prove that a grammar is ambiguous, it is sufficient to demonstrate thatthe grammar generates two distinct parse trees for the same terminal string.An unambiguous grammar for the same language (that is, the set of stringsconsisting of balanced parentheses) is:P --> ( P ) P | epsilonThe Problem of Ambiguous GrammarsA parse tree is supposed to display the structure used by a grammar togenerate an input string. This structure is not unique if the grammar isambiguous. A problem arises if we attempt to impart meaning to an inputstring using a parse tree; if the parse tree is not unique, then the string hasmultiple meanings.We typically use a grammar to define the syntax of a programminglanguage. The structure of the parse tree produced by the grammar impartssome meaning on the strings of the language.If the grammar is ambiguous, the compiler has no way to determine whichof two meanings to use. Thus, the code produced by the compiler is not fullydetermined by the program input to the compiler.

Ambiguous PrecedenceThe grammar for expressions given earlier:1. number2. ( )3. + 4. - 5. * 6. / This grammar is ambiguous as shown by the two parse trees for the inputstring: “number + number * number”:E --> E --> number+E --> E --> number*E --> numberE --> E --> E --> number+E --> number*E --> numberThe first parse tree gives precedence to multiplication over addition; thesecond parse tree gives precedence to addition over multiplication. In mostprogramming languages, only the former meaning is correct. As written, thisgrammar is ambiguous with respect to the precedence of the arithmeticoperators.Ambiguous AssociativityConsider again the same grammar for expressions:1. --> number2. --> ( )3. --> + 4. --> - 5. --> * 6. --> /

This grammar is ambiguous even if we only consider operators at the sameprecedence level, as in the input string “number - number + number”:E --> E --> number-E --> E --> number+E --> numberE --> E --> E --> number-E --> number+E --> numberThe first parse tree (incorrectly) gives precedence to the addition operator;the second parse tree gives precedence to the subtraction operator. Since wenormally group operators left to right within a precedence level, only thelatter interpretation is correct.Definition:If L is a context-free language for which there exists an unambiguousgrammar, then L is said to be unambiguous. If every grammar that generatesL is ambiguous then the language L is called inherently ambiguous.Consider the language:L= {a n b n c m }∪{a n b m c m }L = L1 ∪ L2L1 is generated by:S1 ACA aAb|epsilonC cC| epsilon

L2 is generated by:S2 BDB aB|epsilonD bDc|epsilonThen L is generated by a combination of these 2 grammars with theadditional productionSS1|S2The grammar is ambiguous since the string a n b n c n has 2 derivationsAn Unambiguous Grammar for ExpressionsIt is possible to write a grammar for arithmetic expressions that• is unambiguous• enforces the precedence of * and / over + and -• enforces left associativityHere is one such grammar:1. + | - | 2. * | / | 3. ( ) | numberIf we attempt to build a parse tree for number - number + number, we seethere is only one such tree:E --> E --> E --> T --> F --> number-T --> F --> number+T --> number

This parse tree correctly represents left associativity by using recursion onthe left. If we rewrote the grammar to use recursion on the right, we wouldrepresent right associativity:1. --> + | - | 2. --> * | / | 3. --> ( ) | numberOur grammar also correctly represents precedence levels by introducing anew non-terminal symbol for each precedence level. According to ourgrammar, expressions consist of the sum or difference of terms (or a singleterms), where a term consists of the product or division of factors (or asingle factor), and a factor is a nested expression or a number.Leftmost and Rightmost Derivations:In order to restrict the choices we have in deriving a string, it is often usefulto require that at each step, we replace the leftmost variable by one of itsproduction bodies. Such a derivation is called a leftmost derivation.Similarly, it is possible to require that at each step, the rightmost variable isreplaced by one of its bodies. So we call this derivation rightmost.Derivation Trees:A derivation tree is an ordered tree in which nodes are labeled with the leftsides of productions and in which the children of a node represent itscorresponding right sides.Definition:Let G=(V, T, S, P) be a context-free grammar. An ordered tree is aderivation tree for G if and only if it has the following properties:1. The root is labeled S.2. Every leaf has a label from T ∪ {ε}3. Every interior vertex (which is not a leaf) has a label from V4. if a vertex has label A∈V, and its children are labeled(from left toright) a 1 , a 2 , …, a n , then P must contain a production of the form Aa 1 a 2 …a n5. A leaf labeled ε has no siblings, that are a vertex with a child labeled εcan have no other children.

A partial derivation tree satisfies conditions 3, 4 and 5, but not 1necessarily and 2 is replaced by:2a. Every leaf has a label from V ∪T ∪ {ε}Consider the grammar G with productions:S aAB,A bBb,B A|εFind a partial derivation tree of G.Find the language of G.Consider the grammar G with productions:S abB,A aaBb,B bbAaA εFind the language of G.Find context-free grammars for the following languages(with n >=0, m>=0)1. L= {a n b m : n ≤ m}2. L= { ww R }3. L={w ∈ {a, b}*: n a (w) ≠ n b (w)}Methods for transforming Grammars:Chomsky Normal Form:When working with CFG it is often convenient to have them in simplifiedform. One of the simplest and most useful forms is called the Chomskynormal form. Chomsky normal form or CNF is useful when givingalgorithms for working with CFG.

Definition:A context free grammar is in Chomsky normal form if every rule is of theform:ABCA aWhere a is any terminal and A, B, and C are non terminals different from thestart symbol. In addition, we permit the rule:Sepsilon, where S is the start symbol.Theorem:Any context free language is generated by a context free grammar inChomsky normal form.Proof by construction:1. Add a new start variable S0S, where S was the original startnonterminal. This will guarantee that the new start variable does notoccur on the right-hand side of a rule.2. We take care of all epsilon rules. We remove ε-rule of the form Aε,where A is not the start symbol.This is how to do it:For each occurrence of an A on the right-hand side of a rule, we add anew rule with that occurrence deleted.For example: if R uAv is a rule, we add the rule R uv. We do so foreach occurrence of the non-terminal A .So the rule R uAvAw causes us to generateR uvAwR uAvwR uvwWe repeat these steps for all rules of the form non-terminal ε3. We handle unit rules, or rules of the form AB. We remove thepreceding rule. Then, whenever a rule Bu appears, we add a rule

Au. u is a string of variables and terminals. We repeat these stepsuntil we eliminate all unit rules.4. We convert all remaining rules into the proper form. We replace eachrule of the form A u 1 u 2 …u k , where k ≥ 3 and each u i is a variable orterminal symbol, with the rulesA u 1 A 1 , A 1 u 2 A 2 ,…., and A k-2 u k-1 u kGreibach Normal Form:If L is an ε-free context-free language, there exists a grammar inGreibach normal form for it.A context free grammar is said to be in Greibach normal form if allproductions have the form:A ax where a is terminal symbol and x is one or morenonterminal symbols preceded by a terminal symbol:S ABA aA|bB|bB bGreibach Normal Form:S aAB| bBB|bBA aA|bB|bB bThis concept is important for what follows:Let A be a Context free language; then there is a CFG G thatgenerates it.

Procedure on how to convert G into an equivalentPDA, called P:1. Place the marker symbol $ and the start variable on thestack2. Repeat the following steps forever:a. if the top of stack is a variable (non-terminal) symbol A, nondeterministically select one of the rules for A and substituteA by the string on the right-hand side of the rule.b. If the top of stack is a terminal symbol a, read the nextsymbol from the input and compare it to a. If they match,pop it repeat step 2. If they do not match, reject on thisbranch of the non-determinism.c. if the top of stack is the symbol $, enter the accept state.Doing so accepts the input if it has all been read.Find the automaton that corresponds to the following CFG:S aSASaA aA|bB|bB bThe underlying idea is to construct a pda that can carry out a leftmostderivation of any string in the language.Deterministic Pushdown Automata and Deterministic ContextFree Languages:A deterministic pushdown accepter(dpda) is a pushdown automatonthat never has a choice in its moveDefinition:A pushdown automaton M = (Q, Σ, Γ, δ, q0, z, F) is said to bedeterministic if it is an automaton as described above with the addedrestriction that for every q ∈ Q and a ∈ Σ ∪ {ε} and b ∈Γ

1. δ(q, a, b) contains at most one element2. δ(q, ε, b) is not empty, then δ(q, c, b) must be empty for every c∈ ΣThe first of these conditions requires that for any given inputsymbol and any stack top, at most one move can be made.The second condition is when an ε-move is possible for someconfiguration, no input-consuming alternative is available.Note that we want to retain ε-moves; since the top of the stackplays a role in determining the next move, the presence of ε-movedoes not imply non determinism.A language L is said to be a deterministic context free language ifand only if there exists a dpda M such thatL =L(M).The importance of deterministic context free languages lies in thefact that they can be parsed efficiently. If we view the pushdownautomaton as a parsing device, there will be no backtracking, sowe can easily write a computer program for it.Let us see what grammars might be suitable for the description ofdeterministic context free grammars.Suppose we are parsing top down, attempting to find the leftmostderivations . We scan the input w from left to right, whiledeveloping a sentential form and proceed to match the currentlyscanned symbol with exactly the relevant grammar rule and avoidbacktracking. This would be impossible for general CFGs-grammars:A CF grammar G = (V, T, S, P) is said to be a simple grammar ors-grammar if all the productions are of the form:Aaxwhere A ∈ V, a ∈ T, x ∈ V*, and any pair (A, a) occurs at mostonce in P.The grammarS aS|bSS|cis an s-grammar.The grammarSaS|bSS|aSS|c

is not an s-grammar, since the pair (S, a) occurs in twoproductions.In s-grammars, at every stage of the parsing we know exactlywhich production rule has to be applied.Although s-grammars are useful, they are too restrictive to captureall aspects of the syntax of programming languages.LL grammars:In an LL grammar, we still have the property that we can, bylooking at a limited part of the input (consisting of the scannedsymbol, plus a finite number of symbols following it), predictexactly which production rule must be used. The term LL isstandard usage in books on compilers. The first L indicates thatsymbols are scanned from Left to right, the second L that Leftmostderivations are constructed.The grammarS aSb|abis not an s-grammar but it is LL. (why ?)We say that a grammar is LL(k) if we can uniquely identify thecorrect production, given the currently scanned symbol and a“look-ahead” of the next k-1 symbols.The example above is LL(?)The pumping lemma is an effective tool for showing that certainlanguages are not regular. Similar pumping lemmas are used forother language families.A pumping lemma for Context-Free Languages:The pumping lemma for context free languages states that everycontext free language has a special value called the pumping lengthsuch that all longer strings in the language can be “pumped”. Thismeans that the string can be divided into five parts so that the secondand the fourth parts may be repeated together.

Theorem:If A is a context language, then there is a number p( the pumpinglength) where, if s is any string in A of length at least p, then s may bedivided into five pieces s= uvxyz satisfying the conditions:1. for each i ≥ 0, uv i xy i z ∈ A2. |vy| > 0 and3. |vxy| ≤ pWhen s is being divided into uvxyz, condition 2 says that either vor y is not the empty string.The pumping lemma is useful in showing that a language does notbelong to the family of context-free languages.Show that the language L = {a n b n c n : n ≥ 0} is not context free.Let’s pick m, we pick the string a m b m c m which is in L.If we choose vxy to contain only a’s, then the pumped string is b m c mwhich is not in L.If we choose a decomposition so that v and y are composed of anequal number of a’s and b’s then the pumped string a k b k c m with k≠mis not in the language.So L = {a n b n c n : n ≥ 0} is not context free.

Context free languages A language recognizer is a device that ...

Create successful ePaper yourself

Delete template?

Save as template?