10.07.2015 Views

Notes on Modeling XML Element Content Models

Notes on Modeling XML Element Content Models

Notes on Modeling XML Element Content Models

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>Notes</str<strong>on</strong>g> <strong>on</strong><strong>Modeling</strong> <strong>XML</strong> <strong>Element</strong> C<strong>on</strong>tent <strong>Models</strong>Harrie PassierJune 16, 2011AbstractWe describe a method for modeling element c<strong>on</strong>tent of <strong>XML</strong>-elements.Often, these models are simple, but complex <strong>on</strong>es occur. Because modelingthe complex <strong>on</strong>es is not trivial, designers of these models may haveproblems with drawing up correct and c<strong>on</strong>cise models. As far as we know,study books about <strong>XML</strong> do not present methods for this type of modeling.The method we propose is based <strong>on</strong> modeling regular expressi<strong>on</strong>s.Using this, it is possible to model element c<strong>on</strong>tent of <strong>XML</strong>-elements moresystematically.keywords: <strong>XML</strong>, modeling, regular expressi<strong>on</strong>, didactics of informatics1 Introducti<strong>on</strong>DTD’s and c<strong>on</strong>tent models The type of an <strong>XML</strong> document is describedby a grammar, for example a document type definiti<strong>on</strong> (DTD) or an <strong>XML</strong>-Schema. In these notes, we focus <strong>on</strong> DTD’s. Roughly, a DTD not <strong>on</strong>ly liststhe elements an <strong>XML</strong> document c<strong>on</strong>sists of but also lists the attributes as partof each element. An element declarati<strong>on</strong> looks like . An example of a DTD is:There are four types of c<strong>on</strong>tent models:1. Empty. The element does not c<strong>on</strong>tain child elements. For example:.2. Any. The c<strong>on</strong>tent of the element can c<strong>on</strong>sist of any sequence of characterdata or elements. These elements must be declared by corresp<strong>on</strong>dingelement declarati<strong>on</strong>s in the DTD. For example: .1


3. Mixed. The c<strong>on</strong>tent model has the form of (#PCDATA | e 1 | e 2 ...)*,where each e i is an element name. This means the c<strong>on</strong>tent may c<strong>on</strong>tainarbitrary character data, interspersed with any number of elements of thespecified names. For example .4. <strong>Element</strong> c<strong>on</strong>tent. The element c<strong>on</strong>tains a specific number of child elementsin a specific order. For example .In our example, the elements books and book have a c<strong>on</strong>tent model of typeelement c<strong>on</strong>tent. The c<strong>on</strong>tent model of element books is book + , which means<strong>on</strong>e or more (+) book elements. Other cardinalities are ∗: zero or more times,and ?: zero or <strong>on</strong>e times. The c<strong>on</strong>tent model of the element book c<strong>on</strong>sistsof a sequence of <strong>on</strong>e title element, followed by <strong>on</strong>e ore more author elements,followed by <strong>on</strong>e of more chapter elements. A sequence of elements is denoted bya comma separated list. Another possibility is a choice, depicted by a verticalbar |. For example, car ∗ |cycle + means zero or more car elements or <strong>on</strong>e or morecycle elements. In the example above, the elements title, author, and chapterare of type parsed character data, i.e. of type string. Parsed chacacter data(#PCDATA) is the text found between the start and end tag of an <strong>XML</strong> element.A possible <strong>XML</strong> instance document of the example DTD is:<strong>XML</strong>: Theory and applicati<strong>on</strong>sL. WiegeringH. PootjesH. PassierIntroducti<strong>on</strong>Basics of <strong>XML</strong>In these notes we focus <strong>on</strong> the fourth type of c<strong>on</strong>tent model, element c<strong>on</strong>tent,and how to model this type systematically.Difficulties with modeling element c<strong>on</strong>tent In the example above, theelement c<strong>on</strong>tent of the elements books and book is simple. But in some casesfinding the right model, i.e. representing the order and cardinalities of childelements, can be difficult. In the following example, the element c<strong>on</strong>tent ofelement rec is more complex.2


Each rec element c<strong>on</strong>tains a particular sequences of elements a, b, c and/or d.All these elements are empty, i.e. do not c<strong>on</strong>tain other elements and/or characterdata. A corresp<strong>on</strong>ding c<strong>on</strong>tent model of element rec could be (b, a?, c)|(a, b, c, d?).Finding this c<strong>on</strong>cise model by hand in an efficient way is not a trivial task andrequires some knowledge and training.Existing educati<strong>on</strong>al material As far as we know, there is no text bookabout <strong>XML</strong> that presents a systematic way for modeling c<strong>on</strong>tent models ofDTD’s (and/or <strong>XML</strong>-Schema’s). In <strong>on</strong>e book, it is observed that element c<strong>on</strong>tentcan be written using a variati<strong>on</strong> of the regular expressi<strong>on</strong> notati<strong>on</strong> [16].But it is not explained how to reach a correct and c<strong>on</strong>cise model.<strong>XML</strong>-editors Most <strong>XML</strong>-editors c<strong>on</strong>tain an algorithm for inferring c<strong>on</strong>tentmodels as part of a functi<strong>on</strong> for generating DTD’s or <strong>XML</strong>-Schema’s (see for example[1]). Often, these generated models can be made more c<strong>on</strong>cise. C<strong>on</strong>sider,for example, the following <strong>XML</strong>-document:Using <strong>XML</strong>Spy, the generated c<strong>on</strong>tent model of element rec is(a, b, (a, b, (a, b)?)?)?. This model is precise, i.e. <strong>on</strong>ly zero, <strong>on</strong>e, two or threesequences of a- and b-elements are allowed. On the other hand, the model(a, b) ∗ is more c<strong>on</strong>cise and often usable. The price we have paid is a decrease inprecisi<strong>on</strong>, i.e. all sequences of a- en b-elements are allowed.Furthermore, DTD-generators can be err<strong>on</strong>eous. For example, the c<strong>on</strong>tentmodel of element b in the following <strong>XML</strong>-document3


generated by <strong>XML</strong>-Spy (Enterprise editi<strong>on</strong> versi<strong>on</strong> 2007) is ((c|d), c)?, whichis of course incorrect, since <strong>on</strong>e c-element as c<strong>on</strong>tent (the sec<strong>on</strong>d b-element) isnot part of the generated model: empty, (c, c) and (d, c). A correct <strong>on</strong>e is forexample (d?, c)?. These examples make clear that manual inference remainsnecessary.Literature Approximately half of the <strong>XML</strong> documents available do not referto a schema [2, 15]. Furthermore, about two-thirds of <strong>XML</strong> Schema definiti<strong>on</strong>sgathered from schema repositories and from the web at large are not valid withrespect to the W3C <strong>XML</strong> Schema specificati<strong>on</strong> and are prove to be err<strong>on</strong>eous,rendering them essentially useless for immediate applicati<strong>on</strong> [4, 14]. A similarobservati<strong>on</strong> was made c<strong>on</strong>cerning DTD’s [8].Approach We propose a method for modeling element c<strong>on</strong>tent of <strong>XML</strong>elementsby rewriting element c<strong>on</strong>tent using rules for manipulating regular expressi<strong>on</strong>sin a systematic way. This will help us finding a correct, precise, aswell as a c<strong>on</strong>cise c<strong>on</strong>tent model in cases where element c<strong>on</strong>tent has a complexstructure. We introduce the method for DTD-elements, but the method will beuseful for modeling <strong>XML</strong>-Schema complex types with complex c<strong>on</strong>tent too.These notes In these notes, we describe the approach in an informal style. Insecti<strong>on</strong> 2 we introduce the c<strong>on</strong>cept regular expressi<strong>on</strong>, including the languageand the rules for rewriting those expressi<strong>on</strong>s. Secti<strong>on</strong> 3 shows the relati<strong>on</strong>between <strong>XML</strong> element c<strong>on</strong>tent and regular expressi<strong>on</strong>s. In secti<strong>on</strong> 4 we showhow the rules can be applied to rewrite element c<strong>on</strong>tent. Secti<strong>on</strong> 5 discusseswhat deterministic c<strong>on</strong>tent models are, and how to handle n<strong>on</strong> deterministic<strong>on</strong>es. In secti<strong>on</strong> 6 we explain briefly what a rewrite system is, define the goals wewant to reach in rewriting element c<strong>on</strong>tent, and present a strategy for applyingthe rules more systematically. Secti<strong>on</strong> 7 formalizes the approach described.Secti<strong>on</strong> 8 briefly describes the main idea behind the algorithms used in most<strong>XML</strong>-editors. Finally, secti<strong>on</strong> 9 c<strong>on</strong>cludes and summarizes the results of thesenotes.4


2 Regular expressi<strong>on</strong>sIn this secti<strong>on</strong> we introduce the c<strong>on</strong>cept of regular expressi<strong>on</strong>s and the rules forrewriting such expressi<strong>on</strong>s in a certain form. Regular expressi<strong>on</strong>s are an importantformalism used in most schema languages for describing element c<strong>on</strong>tentas part of the c<strong>on</strong>tent model of <strong>XML</strong>-elements [16].2.1 Definiti<strong>on</strong>Definiti<strong>on</strong> A regular expressi<strong>on</strong> (RE) over an alphabet Σ c<strong>on</strong>sisting of someset of atoms, that are typically characters or names, is based <strong>on</strong> the followingrules [11]:1. the empty set ∅ is a regular expressi<strong>on</strong>;2. the empty string ɛ is a regular expressi<strong>on</strong>;3. each atom in Σ is by itself a regular expressi<strong>on</strong>;4. if R and S are regular expressi<strong>on</strong>s, then the following are also regularexpressi<strong>on</strong>s: R?, R ∗ , R + , RS, R|S, and (R)where the unary operators ?, ∗, + express cardinalities, i.e. zero or <strong>on</strong>e time (?),zero or more times (∗) and <strong>on</strong>e or more times (+). The binary operator | meansa choice between two expressi<strong>on</strong>s. C<strong>on</strong>catenati<strong>on</strong> of two expressi<strong>on</strong>s is denotedby juxtapositi<strong>on</strong>. The operators ?, ∗, + bind str<strong>on</strong>ger than c<strong>on</strong>catenati<strong>on</strong>,and c<strong>on</strong>catenati<strong>on</strong> binds str<strong>on</strong>ger than the choice operator. For example, theexpressi<strong>on</strong> ab ∗ |c, where the alphabet is a, b, c, must be interpreted as (a(b ∗ ))|cand not as a(b ∗ |c) or as (ab) ∗ |c.Remark The definiti<strong>on</strong>s R? and R + are actually redundant, because R? isequivalent with R|ɛ and R + is equivalent with RR ∗ .2.2 The language of a regular expressi<strong>on</strong>Not every regular expressi<strong>on</strong> describes a single string. Instead, a regular expressi<strong>on</strong>describes a language that is a set of strings. We now give the rules toassociate every well formed regular expressi<strong>on</strong> with the language it represents.Definiti<strong>on</strong> We define inductively the language L(R), represented by a regularexpressi<strong>on</strong> R, as follows [12]:L(∅) = ∅L(ɛ) = {ɛ}L(a) = {a}L(ST ) = L(S)L(T )L(S|T ) = L(S) ∪ L(T )L(T ∗ ) = (L(T )) ∗L(T ?) = L(T |ɛ) = L(T ) ∪ {ɛ}L(T + ) = L(T T ∗ ) = L(T )L(T ∗ )5


The definiti<strong>on</strong> for sequence uses set c<strong>on</strong>catenati<strong>on</strong>. The c<strong>on</strong>catenati<strong>on</strong> of twosets K and M is defined as: KM = {xy | x ∈ K, y ∈ M}. If for example K ={a, b} and M = {c, d}, then KM = {ac, ad, bc, bd}, i.e. all combinati<strong>on</strong>s.Example As an example of a language of a regular expressi<strong>on</strong>, we derive, fromleft to right, the language of the expressi<strong>on</strong> ((c|d)c)?, which was the incorrectc<strong>on</strong>tent model generated by <strong>XML</strong>-Spy and Oxygen in secti<strong>on</strong> 1:2.3 RulesL(((c|d)c)?)= zero or more timesL((c|d)c) ∪ {ɛ}= sequenceL(c|d)L(c) ∪ {ɛ}= choice(L(c) ∪ L(d))L(c) ∪ {ɛ}= atom({c} ∪ L(d))L(c) ∪ {ɛ}= atom({c} ∪ {d})L(c) ∪ {ɛ}= set uni<strong>on</strong>{c, d} L(c) ∪ {ɛ}= atom{c, d} {c} ∪ {ɛ}= set c<strong>on</strong>catenati<strong>on</strong>{cc, dc} ∪ {ɛ}= set uni<strong>on</strong>{cc, dc, ɛ}The choice operator (|) is associative, commutative, and idempotent. The c<strong>on</strong>catenati<strong>on</strong>operator is associative and the empty sequence ɛ is the unit of c<strong>on</strong>catenati<strong>on</strong>.In formulas this reads for all regular expressi<strong>on</strong>s R, S, and T [12]:R|(S|T ) = (R|S)|TR|S = S|RR|R = RR(ST ) = (RS)TRɛ = R (= ɛR)(2.1a)(2.1b)(2.1c)(2.1d)(2.1e)6


The rules for cardinalities are:R ∗ = ɛ|RR ∗(2.2a)R + = RR ∗(2.2b)R? = ɛ|R (2.2c)Other equati<strong>on</strong>s are:RR ∗ = R ∗ R(2.3a)(R ∗ S ∗ ) ∗ = (R|S) ∗ (2.3b)(RS) ∗ R = R(SR) ∗(2.3c)XR|XS = X(R|S)(2.3d)The last rule says that c<strong>on</strong>catenati<strong>on</strong> distributes over choice, i.e. if three arbitraryexpressi<strong>on</strong>s R, S, and X match the pattern <strong>on</strong> the left hand site, then wecan replace the left hand by the right hand site. Using the distributi<strong>on</strong> rule,ab|ac can be rewritten as a(b|c) and abc|abd as ab(c|d).The distributi<strong>on</strong> rule XR|XS = X(R|S) appears in another form, namelyRX|SX = (R|S)X. The first <strong>on</strong>e is called left-factoring and the sec<strong>on</strong>d <strong>on</strong>e iscalled right-factoring. We will explain the use of the distributi<strong>on</strong> rule (especiallyleft-factoring) further in secti<strong>on</strong> 5.All rules can be read from left to right and vice versa. Some rules havemultiple occurrences due to commutativity (for example R? = ɛ|R versus R? =R|ɛ).Whereas the empty string ɛ often occurs, we will never use the empty set ∅in modeling <strong>XML</strong> element c<strong>on</strong>tent. For completeness, we present the rules forthe empty set:∅|R = R = R|∅∅R = ∅ = ∅R∅ ∗ = ɛ(2.4a)(2.4b)(2.4c)Remark Notice that some combinati<strong>on</strong>s of these rules lead to new useful <strong>on</strong>es.For example, the combinati<strong>on</strong> of the rules 2.1.e, 2.2.c, and 2.3.d leads to thefollowing meaningful rule 1 :RX|X = R?X(2.5a)Remark We have no evidence that this set of rules is complete.Remark All of the rules listed can be proven using the definiti<strong>on</strong> of the languageof regular expressi<strong>on</strong>s(L). We give an example:Proof To be proven: R? = ɛ|R.Proof by using the definiti<strong>on</strong> of L: L(R?) = L(ɛ|R)L(R?) = L(R|ɛ) = L(ɛ|R)1 The derivati<strong>on</strong> is as follows: RX|X 2.1.e → RX|ɛX 2.3.d → (R|ɛ)X 2.2.c → R?X.□7


Rules 2.3.a, 2.3.b, and 2.3.c can be proven by inducti<strong>on</strong> too. As an example, wegive a proof of rule 2.3.c.Proof To be proven: (RS) ∗ R = R(SR) ∗ .Proof by inducti<strong>on</strong>:Base step: (RS) 0 R = R = R(SR) 0Assume: (RS) k R = R(SR) k for all k < n, then we have to prove that (RS) k+1 R =R(SR) k+1 .(RS) k R = RS(RS) k−1 R, but (RS) k−1 R = R(SR) k−1 by the inducti<strong>on</strong> hypothesis,thus:(RS) k R = RSR(SR) k−1 = R(SR)(SR) k−1 = R(SR) k .□3 <strong>XML</strong> element c<strong>on</strong>tent and regular expressi<strong>on</strong>sAs we have seen, the c<strong>on</strong>tent of an element definiti<strong>on</strong> specifies the order andnumber of occurrences of child elements. The DTD-notati<strong>on</strong> for specifyingelement c<strong>on</strong>tent is almost the same as the regular expressi<strong>on</strong> notati<strong>on</strong> [16, 17].Thus specifying element c<strong>on</strong>tent we can use a variati<strong>on</strong> <strong>on</strong> the regular expressi<strong>on</strong>notati<strong>on</strong>:• the alphabet c<strong>on</strong>sists of element names;• c<strong>on</strong>catenati<strong>on</strong> is represented by comma (,) instead of using juxtapositi<strong>on</strong>of its operands. As before, choice is represented by a vertical bar (|).C<strong>on</strong>catenati<strong>on</strong> and choice behave as explained in secti<strong>on</strong> 2;• the cardinality operators +, ∗ and ? are represented and behave as explainedin secti<strong>on</strong> 2;• empty element c<strong>on</strong>tent corresp<strong>on</strong>ds to the empty string ɛ.As with regular expressi<strong>on</strong>s, the operators ?, ∗, + bind str<strong>on</strong>ger than c<strong>on</strong>catenati<strong>on</strong>,and c<strong>on</strong>catenati<strong>on</strong> binds str<strong>on</strong>ger than the choice operator. Furthermore,<strong>on</strong>ly a restricted form of regular expressi<strong>on</strong>s called deterministic regular expressi<strong>on</strong>sis permitted. We will explain this type of expressi<strong>on</strong> in secti<strong>on</strong> 5.An <strong>XML</strong> element can have empty c<strong>on</strong>tent which is represented by the symbolɛ. As such, ɛ is a valid representati<strong>on</strong> of <strong>XML</strong> element c<strong>on</strong>tent. Because theDTD-language does not allow empty c<strong>on</strong>tent as part of a composite c<strong>on</strong>tentmodel, ɛ may not occur as a sub expressi<strong>on</strong> in a final model. For example ɛ|Rmust be rewritten in R?.The definiti<strong>on</strong>s of + and ∗ need some attenti<strong>on</strong> in the c<strong>on</strong>text of <strong>XML</strong>. Bothoperators are often applied, because they can dramatically reduce the expressi<strong>on</strong>’ssize. For example, a|a, a|a, a, a can be rewritten to a + . The price we haveto pay for this reducti<strong>on</strong> in size is a decrease in precisi<strong>on</strong>: the expressi<strong>on</strong> a + allowsall sequences of <strong>on</strong>e ore more a’s, i.e. the language described by a|a, a|a, a, ais a subset of the language described by a + . To express this property, we introducea partial ordering [9] between expressi<strong>on</strong>s: R ≤ S if L(R) ⊆ L(S). And as8


a c<strong>on</strong>sequence:∀k : k >= 1 : R k ≤ R +∀k : k >= 0 : R k ≤ R ∗(3.1a)(3.1b)where R 0 = ɛ and R k+1 = R, R k .Remark Actually, we need a new set of rules that fit the syntax of <strong>XML</strong>c<strong>on</strong>tent models. However, we will use the rules from the previous secti<strong>on</strong> forboth regular expressi<strong>on</strong>s as well as <strong>XML</strong> c<strong>on</strong>tent models (the most importantdifference is the notati<strong>on</strong> for c<strong>on</strong>catenati<strong>on</strong>: juxtapositi<strong>on</strong> instead of comma).Now we know how the DTD-notati<strong>on</strong> of element c<strong>on</strong>tent matches the notati<strong>on</strong>for regular expressi<strong>on</strong>s, we can use the rules to rewrite regular expressi<strong>on</strong>s.4 Rewriting element c<strong>on</strong>tentIn this secti<strong>on</strong> we give three examples showing how we can rewrite elementc<strong>on</strong>tent using the rules from secti<strong>on</strong> 2 and 3. For now, we c<strong>on</strong>centrate <strong>on</strong>c<strong>on</strong>ciseness of c<strong>on</strong>tent models.Example We take the c<strong>on</strong>tent model of element book from our book exampletitle, author, author, author, chapter, chapter. Using the rewrite rules, we canrewrite this model as follows:≤≤title, author, author, author, chapter, chapter(3.1.b)title, author + , chapter, chapter(3.1.b)title, author + , chapter +We have applied rule 3.1.b twice and have reached a c<strong>on</strong>cise form. In thisrewriting we assume the number of authors and chapters may be <strong>on</strong>e or more.Example Assume the c<strong>on</strong>tent of element recs: a, b, c|b, c|b, a, c|a, b, c, d. Thismodel can be rewritten as follows:a, b, c|b, c|b, a, c|a, b, c, d= (2.1.b twice)b, c|b, a, c|a, b, c|a, b, c, d= (2.3.d)b, (c|a, c)|a, b, c|a, b, c, d= (2.5.a)b, a?, c|a, b, c|a, b, c, d= (2.5.a)b, a?, c|a, b, c, d?9


Notice that this rewriting has been shortened using the derived rule RX|X =R?X (2.5.a). Furthermore, in this case other rewritings are possible with otherfinal models.Example C<strong>on</strong>sider a, (b, a) ∗ , b|(a, b) ∗ . This model can be rewritten as follows:a, (b, a) ∗ , b|(a, b) ∗= (2.3.c)a, b, (a, b) ∗ |(a, b) ∗= (2.2.a)a, b, (a, b) ∗ |a, b, (a, b) ∗ |ɛ= (2.1.c)a, b, (a, b) ∗ |ɛ= (2.2.a)(a, b) ∗In this rewriting we have introduced the empty element as part of a compositec<strong>on</strong>tent model in the sec<strong>on</strong>d step. As stated earlier, element c<strong>on</strong>tent can’tc<strong>on</strong>tain an empty element as sub expressi<strong>on</strong>. We always have to remove thissymbol before ending the rewriting. Furthermore, this rewriting shows thatsometimes an (intermediate) expressi<strong>on</strong> has to ’grow’, i.e. the number of symbolsincreases before further simplificati<strong>on</strong> can happen. Applicati<strong>on</strong> of rule 2.3.b.expands a, b, (a, b) ∗ |(a, b) ∗ into a, b, (a, b) ∗ |a, b, (a, b) ∗ |ɛ. After this expansi<strong>on</strong>,we are able to reduce the expressi<strong>on</strong> to the simple form (a, b) ∗ . In fairness, thisexample is more or less exotic in the c<strong>on</strong>text of <strong>XML</strong>!The sequence of rule applicati<strong>on</strong>s 2.2.a, 2.1.c, and 2.2.a in example 3 introducesanother usable rule, namely:RR ∗ |R ∗ = R + |R ∗ = R ∗(4.1a)Using this rule the rewriting reduces from four to two steps.5 Deterministic c<strong>on</strong>tent modelsNow we know the rules and how to use them, we can introduce another importanttopic we will use. The W3C has decided that <strong>XML</strong> should be compatiblewith SGML. As a c<strong>on</strong>sequence, c<strong>on</strong>tent models of DTD’s should be deterministictoo. A c<strong>on</strong>tent model is deterministic if for each element part of an instance<strong>XML</strong>-document, the bel<strong>on</strong>ging element in the c<strong>on</strong>tent model can be determinedwithout looking forward.Example The c<strong>on</strong>tent model (a, b)|(a, c) is not deterministic, because if an<strong>XML</strong>-parser reaches an element a, the parser can not c<strong>on</strong>clude, without lookingforward, which a should be taken in the c<strong>on</strong>tent model. If the next element is10


a b, the first <strong>on</strong>e (a, b) should be taken, if the next <strong>on</strong>e is a c the sec<strong>on</strong>d <strong>on</strong>e(a, c) should be taken.Formally, let ¯r stand for the expressi<strong>on</strong> obtained from the regular expressi<strong>on</strong> rby replacing the ith occurrence of symbol σ in r by σ i , for every i and σ. Forexample, for r = b ∗ a(b ∗ a) ∗ we have ¯r = b ∗ 1a 1 (b ∗ 2a 2 ) ∗ .Definiti<strong>on</strong> A regular expressi<strong>on</strong> r is deterministic if there are no words wσ i vand wσ j v ′ in L(¯r) such that i ≠ j [3].This definiti<strong>on</strong> is declarative. A functi<strong>on</strong>al <strong>on</strong>e, which we will use for determiningwhether or not a c<strong>on</strong>tent model is deterministic, is as follows:Definiti<strong>on</strong> Generally, there are two situati<strong>on</strong>s in which n<strong>on</strong> determinism occurs[19]:1. If the c<strong>on</strong>tent model c<strong>on</strong>tains X|Y and the sets of atoms that can start astring generated by X and Y are not disjoint. An example of this situati<strong>on</strong>is (a, b)|(a, c), in which case the set of starters generated by (a, b) is {a},and the set of starters generated by (a, c) is {a} too.2. If the c<strong>on</strong>tent model c<strong>on</strong>tains X ∗ , X + , and/or X?, and the set of atomsthat can start a string generated by X is not disjoint from the set of atomsthat can follow in this particular c<strong>on</strong>text. An example of this situati<strong>on</strong> isa ∗ , a, in which case the set of starters generated by a ∗ equals the set ofstarters generated by a, namely {a}.Removing n<strong>on</strong> determinism In the first situati<strong>on</strong> of the functi<strong>on</strong>al definiti<strong>on</strong>,n<strong>on</strong> determinism can always be removed by left factorizati<strong>on</strong>, using therules 2.3.d or 2.5.a.. Left factorizati<strong>on</strong> is a term from grammar transformati<strong>on</strong>theory [10]. Applicati<strong>on</strong> of rule 2.3.d <strong>on</strong> (a, b)|(a, c) results in: a, (b|c). Now,both c<strong>on</strong>tent models are semantically equal but the sec<strong>on</strong>d <strong>on</strong>e is deterministic,i.e. the parser does not have to look <strong>on</strong>e element forwards knowing whichc<strong>on</strong>tent model should be chosen.Sometimes, preparatory steps are needed to isolate the part that has to befactored. The next example shows such a situati<strong>on</strong>.Example As an example, the c<strong>on</strong>tent model (a?, b)|(b, a?) is not deterministic.11


This model can be turned into a deterministic <strong>on</strong>e, as follows:(a?, b)|(b, a?)= (2.5.a twice)b|(a, b)|b|(b, a)= (2.1.b)(a, b)|b|b|(b, a)= (2.1.c)(a, b)|b|(b, a)= (2.5.a)(a, b)|(b, a?)By firstly applying rule 2.5.a twice, we turn the expressi<strong>on</strong> into case <strong>on</strong>e of thefuncti<strong>on</strong>al definiti<strong>on</strong>: if the parser reads a ’b’, it is unclear which model shouldbe chosen. The first three steps prepare the model for factorizati<strong>on</strong>. The n<strong>on</strong>determinism is removed by factorizati<strong>on</strong> in the last step.The sec<strong>on</strong>d situati<strong>on</strong> of the functi<strong>on</strong>al definiti<strong>on</strong> is more complicated. In thecase of a ∗ , a, equivalent deterministic expressi<strong>on</strong>s are a, a ∗ (applicati<strong>on</strong> of rule2.3.a) and a + (applicati<strong>on</strong> of the rules 2.3.a. and 2.2.b). We give some extraexamples:Example C<strong>on</strong>sider a|(a, b) + :Example C<strong>on</strong>sider (a, b) ∗ , a:a|(a, b) += (2.2.b)a|a, b, (a, b) ∗= (2.3.d)a, (ɛ|b, (a, b) ∗ )= (2.2.c)a, (b, (a, b) ∗ )?(a, b) ∗ , a= (2.3.c)a, (b, a) ∗12


Example C<strong>on</strong>sider (a, b) ∗ |a:Example C<strong>on</strong>sider (a, b) ∗ |(a, c) ∗ :(a, b) ∗ |a= (2.2.a)a, b, (a, b) ∗ |ɛ|a= (2.1.b)ɛ|a|a, b, (a, b) ∗= (2.3.d)ɛ|a, (ɛ|b, (a, b) ∗ )= (2.2.c)ɛ|a, (b, (a, b) ∗ )?= (2.2.c)(a, (b, (a, b) ∗ )?)?(a, b) ∗ |(a, c) ∗= (2.3.b twice)a, b, (a, b) ∗ |ɛ|a, c, (a, c) ∗ |ɛ= (2.1.c and 2.1.b)ɛ|a, b, (a, b) ∗ |a, c, (a, c) ∗= (2.3.d)ɛ|a, (b, (a, b) ∗ |c, (a, c) ∗ )= (2.2.c)(a, (b, (a, b) ∗ |c, (a, c) ∗ ))?Although most n<strong>on</strong> deterministic regular repressi<strong>on</strong>s can be made deterministic,there are n<strong>on</strong> deterministic regular expressi<strong>on</strong>s that can not be rewritten intoan equivalent deterministic <strong>on</strong>e [3, 6, 7]. Examples of expressi<strong>on</strong>s that can’t bemade deterministic are:• (a, b) ∗ , (a, c) ∗• (a, b) ∗ , a?• (a, b) ∗ , (a|b)?Remark After introducti<strong>on</strong> of the ∗ operator, the subexpressi<strong>on</strong> enclosed bythe this operator could be part of n<strong>on</strong> determinism. Looking at the four examplesc<strong>on</strong>cerned this situati<strong>on</strong>, a useful strategy removing this n<strong>on</strong> determinismmight be:1. Expand expressi<strong>on</strong>s of the form R ∗ , R + , or R? using rules 2.2.a, 2.2.b,2.2.c, and 2,3.b;13


Normal form A normal form (or standard or can<strong>on</strong>ical form) of an expressi<strong>on</strong>in the domain of a term-rewriting system is an expressi<strong>on</strong> which has a certainform and cannot be rewritten anymore. Known examples in the field of logicare the disjunctive normal form (informally (. . . ∧ . . . ∧ . . .) ∨ . . . ∨ (. . . ∧ . . .)),and the c<strong>on</strong>junctive normal form (informally (. . . ∨ . . . ∨ . . .) ∧ . . . ∧ (. . . ∨ . . .)).Known normal forms of regular expressi<strong>on</strong>s are the star normal form (snf ) andthe epsil<strong>on</strong> normal form (enf ) [6]. These types of normal forms are not usefulin case of simplifying <strong>XML</strong> c<strong>on</strong>tent models. In secti<strong>on</strong> 6.3 we will introduce anormal form we can use in modeling <strong>XML</strong> element c<strong>on</strong>tent.Strategy A strategy can be regarded as a descripti<strong>on</strong> of how the rewrite rulesmay be combined, i.e. in which order they may be applied, in order to effectivelyreach a normal form. In secti<strong>on</strong> 6.4 we will describe a strategy for reaching anormal form for <strong>XML</strong> element c<strong>on</strong>tent.Rewriting <strong>XML</strong> element c<strong>on</strong>tent To develop a system for the systematicalrewriting of <strong>XML</strong> element c<strong>on</strong>tent, we have to solve two problems. The first isa lack of an appropriate existing normal form of regular expressi<strong>on</strong>s for <strong>XML</strong>element c<strong>on</strong>tent. For this, we describe a normal form we can use for modeling<strong>XML</strong> element c<strong>on</strong>tent in secti<strong>on</strong> 6.3. The sec<strong>on</strong>d <strong>on</strong>e is that of a strategy. Aswe will explain, for a certain category of <strong>XML</strong> c<strong>on</strong>tent models we are able togive a strategy. For all other models, we describe an approach in terms of rulesof thumb.Besides a normal form, we introduce some other aspects which are of importancein modeling <strong>XML</strong> element c<strong>on</strong>tent. These are the correctness, precisi<strong>on</strong>,and c<strong>on</strong>ciseness of a c<strong>on</strong>tent model.6.2 Aspects of <strong>XML</strong> c<strong>on</strong>tent models6.2.1 Precisi<strong>on</strong>A precise c<strong>on</strong>tent model c<strong>on</strong>tains exactly those sequences of child elements thatwe want to have and no other <strong>on</strong>e.Definiti<strong>on</strong> If R is a c<strong>on</strong>tent model and T represents all sequences of childelements that we want to have, then R is a precise model if and <strong>on</strong>ly if L(R) = Tholds.It is very easy to obtain a precise model. The <strong>on</strong>ly thing we have to do is towrite down the element c<strong>on</strong>tent as a number of choices based <strong>on</strong> the example<strong>XML</strong>-file(s).Example If we have for example the following <strong>XML</strong>-file:15


a precise c<strong>on</strong>tent model of element rec is: a, b, a, b, a, b|a, b|ɛ|a, b, a, b.We call such a first precise c<strong>on</strong>tent model the starting form (SF).Definiti<strong>on</strong> The starting form (SF) of an <strong>XML</strong> c<strong>on</strong>tent model is an expressi<strong>on</strong>c<strong>on</strong>sisting of a number of choices expr 1 |expr 2 |...|expr n , where each expr i isa sequence of child elements of the element c<strong>on</strong>sidered in the <strong>XML</strong> instancedocument(s)By rewriting an SF, we can transform the model into another form and makethe model more c<strong>on</strong>cise.The precisi<strong>on</strong> of a c<strong>on</strong>tent model cannot be tested using a validating parser:we have to check that <strong>on</strong>ly sequences of child elements we want are part of thesequences the c<strong>on</strong>tent model encloses. The <strong>on</strong>ly approach here is to derive thelanguage of the c<strong>on</strong>tent model (see secti<strong>on</strong> 2.2) and check whether the sequencesof child elements we want to have are the same as the language of the c<strong>on</strong>tentmodel.6.2.2 CorrectnessA correct c<strong>on</strong>tent model c<strong>on</strong>tains at least all sequences of child elements thatwe want to have.Definiti<strong>on</strong> If R is a c<strong>on</strong>tent model and T represents all sequences of childelements we want to have, then R is a correct model if and <strong>on</strong>ly if T is a subsetof the language generated by R: L(R) ⊇ T .Example The c<strong>on</strong>tent model (ab) ∗ encloses the model a, b, a, b, a, b|a, b|ɛ|a, b, a, band is as such a correct model.The c<strong>on</strong>tent model ANY is always a correct model.Definiti<strong>on</strong> The form EF of an <strong>XML</strong> c<strong>on</strong>tent model is a number of choicesatom 1 |atom 2 |...|atom n , where each atom i is an element name that can occur aschild element of the element c<strong>on</strong>sidered in the <strong>XML</strong> instance document(s).Another always correct c<strong>on</strong>tent model is EF ∗ .Example The atoms of the model a, b, a, b, a, b|a, b|ɛ|a, b, a, b are a and b. Assuch, another correct model is (a|b) ∗ . Notice that L(ANY) ⊇ L((a|b) ∗ ) ⊇L((a, b) ∗ ).16


The use of the cardinality operators ∗ and + needs some attenti<strong>on</strong> in this matter,because they will introduce imprecise models. Assume the c<strong>on</strong>tent modeltitle, author, author, author, chapter, chapter. The operator + may <strong>on</strong>ly be usedif <strong>on</strong>e or more authors and/or chapters are allowed. If, in this case, exactly threeauthors and two chapters are required, the operator + may not be used sinceL(author, author, author) L(author + ). The same goes for ∗. Whether theoperators + and ∗ can be used depends <strong>on</strong> the c<strong>on</strong>text.Note that <strong>XML</strong> Schema has the attributes minOccurs and maxOccurs tomake precise definiti<strong>on</strong>s for such situati<strong>on</strong>s. These definiti<strong>on</strong>s can be expressedusing regular expressi<strong>on</strong>s with numeric occurrence indicators (#res) [13], whichis an extensi<strong>on</strong> of traditi<strong>on</strong>al regular expressi<strong>on</strong>s. They allow to define the numberof required and allowed iterati<strong>on</strong>s of sub-expressi<strong>on</strong>s with numeric parameters.An expressi<strong>on</strong> R m..n denotes the catenati<strong>on</strong> of R with itself at least mand at most n times.Example As a sec<strong>on</strong>d example, the model ((c|d), c)? generated by <strong>XML</strong>-Spyis not a correct model (see secti<strong>on</strong> 1), because L(((c|d), c)?) = {ɛ, cc, dc} and{ɛ, cc, dc} {ɛ, c, dc}.The correctness of a c<strong>on</strong>tent model can be tested using a validating parser: itchecks all sequences of child elements in the example <strong>XML</strong>-document againstthe c<strong>on</strong>tent model.6.2.3 C<strong>on</strong>cisenessWe define c<strong>on</strong>ciseness in terms of the size of an expressi<strong>on</strong>. The expressi<strong>on</strong>size is defined as the sum of all symbols, excluding brackets, where we assumebrackets are used sparingly. The assumpti<strong>on</strong> is that expressi<strong>on</strong>s with a minimalnumber of symbols show the structure of the model most clearly. For example(a|b) ∗ is equal to (a ∗ , b ∗ ) ∗ , but the first <strong>on</strong>e expresses the structure of the modelmore clearly: a sequence (∗) of a’s and b’s in varying order (|).Definiti<strong>on</strong> The size of an expressi<strong>on</strong> is inductively defined by functi<strong>on</strong> sizewhich takes a regular expressi<strong>on</strong> as argument and returns the number of symbolsand operators of that expressi<strong>on</strong>:size atom = 1size(R | S) = size R + 1 + size Ssize(R , S) = size R + 1 + size Ssize R? = 1 + size Rsize R ∗ = 1 + size Rsize R + = 1 + size RRemark Notice that a case for empty c<strong>on</strong>tent (ɛ) is not included, becauseempty c<strong>on</strong>tent can not be part of an <strong>XML</strong> c<strong>on</strong>tent model.17


Example When we calculate the size of author + , title + stepwise, we get:size(author + , title + )= 3-th casesize(author + ) + 1 + size(title + )= 6-th case twice1 + size(author) + 1 + 1 + size(title)= first case twice1 + 1 + 1 + 1 + 1= arithmetic plus5Remark In cases of correct models, the c<strong>on</strong>tent model EF ∗ is always the mostc<strong>on</strong>cise form but often not strict enough. Therefore, c<strong>on</strong>ciseness al<strong>on</strong>e shouldnot be a criteri<strong>on</strong>, but should be evaluated in combinati<strong>on</strong> with the level ofimprecisi<strong>on</strong>.6.3 A normal form for <strong>XML</strong> c<strong>on</strong>tent modelsAssuming we start with an <strong>XML</strong> c<strong>on</strong>tent model in SF, our first model is a numberof choices: expr 1 |expr 2 |...|expr n , where each expr i is a particular sequence ofchild elements of the element c<strong>on</strong>sidered in the <strong>XML</strong> instance document(s)(seesecti<strong>on</strong> 6.2.1). The <strong>on</strong>ly requirement we further ask for is determinism (seesecti<strong>on</strong> 5).Definiti<strong>on</strong> A normal form of an <strong>XML</strong> c<strong>on</strong>tent model is a starting form (SF)which is deterministic. If we express the property that an expressi<strong>on</strong> expr isdeterministic as isDet(expr), then a normal form for <strong>XML</strong> element c<strong>on</strong>tent(XNF) is:isDet(SF ) (6.1)Notice that the property of determinism includes that all double occurrences areremoved (applicati<strong>on</strong> of rule R|R = R (2.1.c)). Additi<strong>on</strong>al, because the DTDlanguagedoes not allow empty c<strong>on</strong>tent as part of a composite c<strong>on</strong>tent model,ɛ-signs must be removed (applicati<strong>on</strong> of rule R? = ɛ|R (2.2.c). In secti<strong>on</strong> 7 wewill prove that every <strong>XML</strong> c<strong>on</strong>tent model in SF can be made deterministic.Example Assuming the c<strong>on</strong>tent model of element rec in the <strong>XML</strong>-file recs (seesecti<strong>on</strong> 6.2.1), our first model equals:a, b, a, b, a, b|a, b|ɛ|a, b, a, bBecause this model is not deterministic, we rewrite the model into a determin-18


istic <strong>on</strong>e:a, b, a, b, a, b|a, b|ɛ|a, b, a, b= (2.1.b twice)ɛ|a, b|a, b, a, b|a, b, a, b, a, b= (2.5.a)ɛ|a, b|a, b, a, b, (a, b)?= (2.5.a)ɛ|a, b, (a, b, (a, b)?)?= (2.2.c)(a, b, (a, b, (a, b)?)?)?6.4 StrategiesWe assume we start with an <strong>XML</strong> c<strong>on</strong>tent model in SF. In the case of a precisemodel, we can give a strategy for transforming an SF into a deterministic precisemodel. In the case of a correct model, we are not able to give a strategy. Instead,we present some rules of thumb.6.4.1 A strategy for Precise <strong>Models</strong>We start with an <strong>XML</strong> c<strong>on</strong>tent model in SF based <strong>on</strong> <strong>on</strong>e or more <strong>XML</strong> instancedocuments. Firstly, we remove redundant choices of the form R|R by applyingrule R|R = R. On the resulting model, we search iteratively under top-downc<strong>on</strong>trol for subexpressi<strong>on</strong>s of the form X|Y of which the sets of starters of Xand Y are not disjoined. If we find such a subexpressi<strong>on</strong>, we apply <strong>on</strong>e of thefactorizati<strong>on</strong> rules XR|XS = X(R|S) and XR|X = XR? <strong>on</strong> this expressi<strong>on</strong>,eventually after applicati<strong>on</strong> of the commutativity rule R|S = S|R. This partof the procedure stops when for all subexpressi<strong>on</strong> of the form X|Y the sets ofstarters of X and Y are disjoined. Because the SF started with can c<strong>on</strong>tainempty c<strong>on</strong>tent and empty c<strong>on</strong>tent can be introduced by applicati<strong>on</strong> of ruleXR|XS = X(R|S), we end with removing ɛ signs by applying rule ɛ|R = R?.The result of this strategy is a c<strong>on</strong>tent model in XNF. All rules used, must beapplied from left to right. More formally:Input: a c<strong>on</strong>tent model expr in SFRemove redundant choices of the form R|R by applying rule R|R = R.Repeat until isDet(expr)Search under top down c<strong>on</strong>trol for a subexpressi<strong>on</strong> of the form X|Yof which the sets of starters of X and Y are not disjoined.If needed, rearrange using the commutativity rule: R|S = S|R.Factorize this choice using <strong>on</strong>e of the rules XR|XS = X(R|S)or XR|X = XR? from left to right.If needed, remove empty c<strong>on</strong>tent by applying rule ɛ|R = R? from left to rightOutput: the c<strong>on</strong>tent model expr in XNF.19


After applying this strategy <strong>on</strong> a c<strong>on</strong>tent model in SF, the resulting expressi<strong>on</strong>is deterministic and the size of the resulting model is less or equals the size ofthe input model (see secti<strong>on</strong> 7).For determining the set of starters of an expressi<strong>on</strong>, we will define a functi<strong>on</strong>look-ahead in secti<strong>on</strong> 7.Remark The applied rewrite policy, searching and rewriting iteratively undertop-down c<strong>on</strong>trol, is also known as outermost reducti<strong>on</strong>, where each step reducesan outermost redex (a reducible expressi<strong>on</strong>) [5].Example Assume as an SF the following c<strong>on</strong>tent model: a, b|a, b, a, b|ɛ|c|c, c|a, b, c.We will develop a precise model in XNF:a, b|a, b, a, b|ɛ|c|c, c|a, b, c= (rearrange using 2.1.b)ɛ|a, b|a, b, a, b|a, b, c|c|c, c= (2.5.a)ɛ|a, b(a, b)?|a, b, c|c|c, c= (2.3.d)ɛ|a, b, ((a, b)?|c))|c|c, c= (2.5.a)ɛ|a, b, ((a, b)?|c))|c, c?= (2.2.c)This model is precise and in XNF.(a, b, ((a, b)?|c))|c, c?)?Example Assume as an SF the following c<strong>on</strong>tent model: a|a, x|a, x, b. We willdevelop a precise model in XNF:a|a, x|a, x, b= (2.5.a)a, x?|a, x, b= (2.3.d)a, (x?|x, b)= (2.2.c)a, (ɛ|x|x, b)= (2.5.a)a, (ɛ|x, b?)= (2.2.c)a, (x, b?)?20


This model is precise and in XNF. This derivati<strong>on</strong> looks like l<strong>on</strong>g for this model.The reas<strong>on</strong> is the top-down approach. An alternative rewriting is as follows:a|a, x|a, x, b= (2.3.d)a|a, (x|x, b)= (2.5.a)a|a, (x, b?)= (2.5.a)a, (x, b?)?6.4.2 Rules of thumb for Correct <strong>Models</strong>Again, we start with an <strong>XML</strong> c<strong>on</strong>tent model in SF based <strong>on</strong> <strong>on</strong>e or more <strong>XML</strong>instance documents. During rewriting, it is allowed to introduce the cardinalityoperators ∗ and +. Now, we have to deal with two problems;• Introducti<strong>on</strong> of these cardinality operators can introduce n<strong>on</strong> determinism.As we have menti<strong>on</strong>ed in secti<strong>on</strong> 5, this type of n<strong>on</strong> determinism can’talways be removed by rewriting the expressi<strong>on</strong>. As we have stated, wecould solve this situati<strong>on</strong> by make the expressi<strong>on</strong> less precise or by theintroducti<strong>on</strong> of extra levels in the <strong>XML</strong>-tree.• Whereas we have a strategy for precise models guaranteeing a c<strong>on</strong>tentmodel in XNF, for correct models we are unable to develop such a strategywhich is valuable. In secti<strong>on</strong> 6.2.2, we have seen that the model EF ∗ isalways a correct model. But, most times, this model is not strict enough.The problem is that a strategy for correct models can not automaticallydecide between precisi<strong>on</strong> and c<strong>on</strong>ciseness. Furthermore, whether the ∗operator can be introduced, depends <strong>on</strong> the c<strong>on</strong>text. For both problems,human interventi<strong>on</strong> is needed. As a result, in stead of a strategy we givesome rules of thumb.When to introduce the operators ∗ and +?in introducing these operators:There are two when-questi<strong>on</strong>s1. Do we introduce the operators ∗ and + before or after factorizati<strong>on</strong>? Generally,factorizati<strong>on</strong> removes repetitive parts of an expressi<strong>on</strong> and as a resultreduces the possibility to introduce the operators ∗ and +. If we havefor example the model a, b|a, b, a, b|b, factorizati<strong>on</strong> results in the modela, b, (a, b)?|b. On the other hand, if we assume the subexpressi<strong>on</strong> a, b isrepetitive and we first introduce the operator +, the resulting model couldbe (a, b) + |b. We recommend introducti<strong>on</strong> of the operators ∗ and/or + beforefactorizati<strong>on</strong> take place.2. Are there semantic c<strong>on</strong>strains for introducing the operators ∗ and +?The answer is certainly yes. A book c<strong>on</strong>sists of <strong>on</strong>e or more chapters, so a21


c<strong>on</strong>tent model of chapter + is reas<strong>on</strong>able. But a model as isbn ∗ , chapter + isincorrect, because each book has exactly <strong>on</strong>e isbn (Internati<strong>on</strong>al StandardBook Number). This questi<strong>on</strong> points to the fact that whereas modeling<strong>XML</strong>-c<strong>on</strong>tent can be supported by computer programs, introducti<strong>on</strong> ofcardinality operators is man’s handwork.Rules of thumbAgain, we assume as input a c<strong>on</strong>tent model in SF.Step 1. Remove redundant choices of the form R|R by applying rule R|R = R.Step 2. Introduce the operators ∗ and/or +. We distinguish three situati<strong>on</strong>s:i) After introducti<strong>on</strong> of the operator ∗ or +, the subexpressi<strong>on</strong> enclosedby the operator is not part of n<strong>on</strong> determinism. For example,the subexpressi<strong>on</strong> a, b|a, b, a, b in a, b|a, b, a, b|b can be replaced by(a, b) + , resulting in the model (a, b) + |b. The subexpressi<strong>on</strong> (a, b) +is not part of n<strong>on</strong> determinism. Our experience shows that thissituati<strong>on</strong> is most frequent.ii) After introducti<strong>on</strong> of the operator ∗ or +, the subexpressi<strong>on</strong> enclosedby the operator is part of n<strong>on</strong> determinism, but the n<strong>on</strong>determinism can be removed by rewriting the expressi<strong>on</strong>. For example,the subexpressi<strong>on</strong> a, b|a, b, a, b in a, b|a, b, a, b|a, b, c can bereplaced by (a, b) + resulting in (a, b) + |a, b, c. The subexpressi<strong>on</strong>(a, b) + is part of n<strong>on</strong> determinism. This n<strong>on</strong> determinism can beremoved by (see secti<strong>on</strong> 5):1) Expand expressi<strong>on</strong>s of the form R ∗ , R + , or R? using rules 2.2.a,2.2.b, 2.2.c, and 2,3.b;2) If needed, rearranging parts using rules 2.1.b, 2.3.a, and 2.3.c;3) Apply factorizati<strong>on</strong> using rules 2.3.d and 2.5.a.Applying this approach to our example results in:(a, b) + |a, b, c= (2.2.b)a, b, (a, b) ∗ |a, b, c= (2.3.d)(a, b, ((a, b) + |c))In many cases, the n<strong>on</strong> determinism introduced can be removed bythis approach.iii) After introducti<strong>on</strong> of the operator ∗ or +, the subexpressi<strong>on</strong> enclosedby the operator is part of n<strong>on</strong> determinism, but the n<strong>on</strong>determinism can not be removed by rewriting the expressi<strong>on</strong>. Asstated in secti<strong>on</strong> 5, there are two possibilities to solve such a situati<strong>on</strong>:22


a) Make the expressi<strong>on</strong> less precise, i.e. enlarge the language theregular expressi<strong>on</strong> expresses. For example, the expressi<strong>on</strong> (a, b) ∗ , (a, c) ∗can be rewritten into (a, (b|c)) ∗ , which expressi<strong>on</strong> is deterministic.b) Introduce extra levels in the <strong>XML</strong>-tree. Is we have for examplethe c<strong>on</strong>tent model (a, b) ∗ , (a, c) ∗ as in:we could remove the n<strong>on</strong> determinism by introducing <strong>on</strong>e extralevel:Step 3. Optimize the expressi<strong>on</strong> by applying factorizati<strong>on</strong> and removing ɛ signs.Remark In almost all cases, we can start with a c<strong>on</strong>tent model in SF whichsumming up exactly all sequences of child elements of the element c<strong>on</strong>sidered.Sometimes however, we can not sum up exactly these sequences. An exampleof such a situati<strong>on</strong> is a c<strong>on</strong>tent model representing the moves of chess game:(white, black) + , white?. We know the moves will alternate white and black, butthe number of moves and which color will be the last <strong>on</strong>e are uncertain. In suchcases, we start directly with step 2.Example Again, assume as an SF the c<strong>on</strong>tent model: a, b|a, b, a, b|ɛ|c|c, c|a, b, c.We will develop a correct model in XNF, Now, the cardinality operators ∗ and+ are allowed.a, b|a, b, a, b|ɛ|c|c, c|a, b, c= (rearrange using 2.1.b)≤ɛ|a, b|a, b, a, b|a, b, c|c|c, c(introduce * operator)(a, b) ∗ |a, b, c|c|c, c= (2.2.a)ɛ|a, b, (a, b) ∗ |a, b, c|c|c, c= (2.3.d)ɛ|a, b, ((a, b) ∗ |c)|c|c, c= (rearrange using 2.1.b)≤a, b, ((a, b) ∗ |c)|ɛ|c|c, c(introduce * operator)a, b, ((a, b) ∗ |c)|c ∗This model is correct and in XNF. If we assume all numbers of ab’s followed byall numbers of c’s are allowed, we get:(a, b) ∗ , c ∗23


This model is correct and in XNF too. The sec<strong>on</strong>d <strong>on</strong>e is more c<strong>on</strong>cise then thefirst <strong>on</strong>e. But the sec<strong>on</strong>d <strong>on</strong>e is less precise. Often, c<strong>on</strong>ciseness is at the expenseof precisi<strong>on</strong>.Remark The methods described for modeling precise and correct models, areapplicable in cases where we have <strong>on</strong>e <strong>XML</strong>-file with <strong>on</strong>e of more examplerecords, a number of <strong>XML</strong> files, and even without an example <strong>XML</strong>-file atall. In all these cases, we are able to sum up, by reading and/or thinking, allsequences of child elements of each element we want to have, i.e. to draw up ac<strong>on</strong>tent model in SF. In a sec<strong>on</strong>d phase, we can rewrite this form in precise orcorrect XNF.7 Formalizati<strong>on</strong>8 Algorithmic approachBesides rewriting regular expressi<strong>on</strong>s as explained in the previous secti<strong>on</strong>, thereare algorithmic soluti<strong>on</strong>s for simplifying regular expressi<strong>on</strong>s. For the sake ofcompleteness, we briefly menti<strong>on</strong> this approach in this secti<strong>on</strong>. More informati<strong>on</strong>can be found in [11, 10].These algorithms are not based <strong>on</strong> rewrite rules. Instead, these algorithmstransform a regular expressi<strong>on</strong> into a Finite-state Automati<strong>on</strong> (FA). There aretwo types of FA’s, namely N<strong>on</strong>-deterministic (NFA) and Deterministic <strong>on</strong>es(DFA). If the transformati<strong>on</strong> of a regular expressi<strong>on</strong> results in an NFA, thisNFA should be firstly transformed into a DFA. Turning an NFA into a DFAusually increases the size of the automati<strong>on</strong> by some factor. Often, a DFAcan be simplified by applying a DFA minimizati<strong>on</strong> algorithm. Finally, the optimizedDFA is transformed back into a regular expressi<strong>on</strong> (however, the lasttransformati<strong>on</strong>, from DFA to a deterministic regular expressi<strong>on</strong>, is not alwayspossible).Another approach uses a probabilistic algorithm that learns c<strong>on</strong>tent models,and selects the <strong>on</strong>e that best describes the samples [3].9 C<strong>on</strong>clusi<strong>on</strong>In these notes, a method is introduced for modeling <strong>XML</strong>-c<strong>on</strong>tent models systematically.The method is based <strong>on</strong> rewriting regular expressi<strong>on</strong>s, which <strong>XML</strong>c<strong>on</strong>tent models actually are. The method is introduced for DTD-elements, butwill be useful for modeling <strong>XML</strong>-Schema complex types with complex c<strong>on</strong>tenttoo.AcknowledgementsThe author thanks Bastiaan Heeren, Arnold van der Leer, and Lex Bijlsma forreading and commenting <strong>on</strong> draft versi<strong>on</strong>s of these notes.24


References[1] Altova. Xmlspy. http://www.altova.com, 2009.[2] D. Barbosa, L. Mignet, and P. Veltri. Studying the xml web: Gatheringstatistics from an xml sample. World Wide Web, 8:413–438, 2005.[3] G. Bex, W. Gelade, F. Neven, and S. Vansummeren. Learning deterministicregular expressi<strong>on</strong>s for the inference of schemas from xml data. WWW2008, 2008.[4] Geert Jan Bex, Frank Neven, and Jan Van den Bussche. Dtds versusxml schema: a practical study. In WebDB ’04: Proceedings of the 7thInternati<strong>on</strong>al Workshop <strong>on</strong> the Web and Databases, pages 79–84, New York,NY, USA, 2004. ACM.[5] R. Bird. Introducti<strong>on</strong> to functi<strong>on</strong>al programming using Haskell. PrenticeHall, 1988. 2rd editi<strong>on</strong>.[6] A. Bruggemann-klein and D. Wood. One-nambiguous regular languages.Informati<strong>on</strong> and Computati<strong>on</strong>, 140:229–253, 1998.[7] Anne Bruggemann-klein. Regular expressi<strong>on</strong>s into finite automata. TheoreticalComputer Science, 120:87–98, 1993.[8] Byr<strong>on</strong> Choi. A few tips for good xml design. Technical report, 2000.[9] B.A. Davey and H.A. Priestley. Introducti<strong>on</strong> to lattices and order. Cambridge,2002. 2e editi<strong>on</strong>.[10] D. Grune and C.J.H. Jacobs. Parsing Techniques. Springer, 2008.[11] J.E. Hopcroft and J.D. Ullman. Introducti<strong>on</strong> to Automata Theory, Languages,and Computati<strong>on</strong>. Addis<strong>on</strong>-Wesley, 1979.[12] J. Jeuring and D. Swierstra. Grammars and Parsing. Universiteit Utrecht,2001. Lecture notes - Not published.[13] P. Kilpelainen and R. Tuhkanen. Regular expressi<strong>on</strong>s with numerical occurrenceindicators—preliminary results. In In Proc. of the Eighth Symposium<strong>on</strong> Programming Languages and Software Tools, pages 163 - 173, 2003.[14] Wim Martens, Frank Neven, Thomas Schwentick, and Geert Jan Bex. Expressivenessand complexity of xml schema. ACM Trans. Database Syst.,31(3):770–813, 2006.[15] Laurent Mignet, Denils<strong>on</strong> Barbosa, and Pierangelo Veltri. The xml web: afirst study. In WWW ’03: Proceedings of the 12th internati<strong>on</strong>al c<strong>on</strong>ference<strong>on</strong> World Wide Web, pages 500–510, New York, NY, USA, 2003. ACM.[16] A. Moller and M. Schwartzback. An Introducti<strong>on</strong> to <strong>XML</strong> and Web Technologies.WileyAddis<strong>on</strong> Wesley, 2006. 1st editi<strong>on</strong>.25


[17] C.M. Sperberg-McQueen. Applicati<strong>on</strong>s of brzozowski derivates to xmlschema processing. Extreme Markup Languages, page 26, IDEAlliance,2005.[18] Jan van Leeuwen, editor. Handbook of Theoretical Computer Science, VolumeB: Formal <strong>Models</strong> and Sematics. Elsevier and MIT Press, 1990.[19] D.A. Watt and D.F. Brown. Programming language processors in Java.Prentice Hall, 2000.26

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!