Glushkov Automata - sbes - 2007
Glushkov Automata - sbes - 2007
Glushkov Automata - sbes - 2007
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
22nd Brazilian Symposium on Database<br />
SBBD <strong>2007</strong><br />
Assisting XML Schema Evolution that<br />
Preserves Validity<br />
Béatrice Bouchou<br />
bouchou@univ-tours.fr<br />
Laboratoire d'Informatique (LI)<br />
Denio Duarte<br />
denio@unochapeco.edu.br<br />
Centro Tecnológico (Cetec)<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Agenda<br />
● Motivation<br />
● Theoretical Background<br />
● Approach<br />
● Final Considerations<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Motivation<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Motivation<br />
Documents are valid<br />
Schema<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Motivation<br />
Schema<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
Constraint:<br />
Publications are grouped by:<br />
journal papers organized by<br />
subject and year of publication
Motivation<br />
Schema<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
Publication : Subject (Year Journal + ) *
Motivation<br />
Labs decide to consider<br />
conference papers<br />
as publications Publication : Subject (Year Journal + ) *<br />
Schema<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Motivation<br />
Labs decide to consider<br />
conference papers<br />
as publications Publication : Subject (Year Journal + ) *<br />
Schema<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
The schema must be updated
Motivation<br />
Publication : Subject (Year Journal + Conference) *<br />
Schema'<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
Mandatory
Motivation<br />
Publication : Subject (Year Journal + Conference?) *<br />
Schema'<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
Only one conference paper<br />
by year?
Motivation<br />
Publication : Subject (Year Journal + Conference + ) *<br />
Schema'<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
Mandatory
Motivation<br />
Publication : Subject (Year Journal + Conference * ) *<br />
Schema'<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
It's OK!
Motivation<br />
Publication : Subject (Year Journal + Conference * ) *<br />
Schema'<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
One of the laboratory sends<br />
several entries for conferences<br />
and journals.
Motivation<br />
Publication : Subject (Year Journal + Conference * ) *<br />
Schema'<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
The administrator must organize<br />
them for insertion since journals<br />
should appear before conference
Motivation<br />
Publication : Subject (Year (Journal|Conference) + ) *<br />
Schema'<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Motivation<br />
✔ It is difficult to evolve XML schemas<br />
mainly if the administrator is not a<br />
computer science expert.<br />
Schema'<br />
Document<br />
XML<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Motivation<br />
● The demand for tools designed for administrators<br />
not belonging to the computer science community<br />
● The cost (time and money) of revalidation process<br />
● Distributed XML databases following the same<br />
schema<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Theoretical Background<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Theoretical Background<br />
● Theoretical notions used in this work<br />
– Regular expressions (RE) to define the allowed subelements<br />
of an element.<br />
– Finite state automata (FSA) to verify whether or not<br />
the sub-elements respect the constraints imposed by<br />
the element.<br />
– Transformation of RE to FSA<br />
● <strong>Glushkov</strong>'s Algorithm ⇒<br />
<strong>Glushkov</strong> automaton<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
● Given a RE, a <strong>Glushkov</strong> automaton is built as<br />
follows:<br />
– All symbols in RE are subscripted by their positions.<br />
+ *<br />
● Subject (Year Journal+)* Subject (Year Journal )<br />
1 2 3<br />
– We add an end mark (#) to the RE:<br />
+ *<br />
● Subject (Year Journal ) #4<br />
1 2 3<br />
– Symbols and positions become transitions in the FSA<br />
– Each state represents a symbol (except the initial<br />
state)<br />
⇒<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
+ *<br />
● Subject (Year Journal ) #4<br />
1 2 3<br />
Year<br />
Subject Year<br />
0 1 2 3<br />
Journal<br />
� �<br />
Journal<br />
0 1 2 3 4<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
#<br />
#<br />
4<br />
<strong>Glushkov</strong> Graph (homogeneous)
<strong>Glushkov</strong> <strong>Automata</strong><br />
+ *<br />
● Subject (Year Journal ) #4<br />
1 2 3<br />
0 1 2 3 4<br />
Cycles are called orbits<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
+ *<br />
● Subject (Year Journal ) #4<br />
1 2 3<br />
0 1 2 3 4<br />
The orbits represent the starred subexpression of<br />
the regular expression<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
+ *<br />
● Subject (Year Journal ) #4<br />
1 2 3<br />
0 1 2 3 4<br />
The orbits represent the starred subexpression of<br />
the regular expression<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
+ *<br />
● Subject (Year Journal ) #4<br />
1 2 3<br />
0 1 2 3 4<br />
● The hierarchy of orbits H is formed by the orbits in the graph<br />
and the set of all nodes (respecting the set inclusion property):<br />
● H={{3},{2,3},{0,1,2,3,4}}<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
● The notion of orbit gives us another notion:<br />
– Context: the set of symbols that appear in an orbit.<br />
– A general context contains the symbols not<br />
belonging to any context.<br />
– Contexts are disjoint sets.<br />
● In our example:<br />
– Context for Journal: {Journal} corresponds to the<br />
orbit {3}<br />
– Context for Year: {Year} corresponds to the orbit {2,3}<br />
– The general one: {Subject}<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
● We can transform a <strong>Glushkov</strong> automaton into a<br />
RE following Caron and Ziadi approach: (Caron &<br />
Ziadi, 2000 - TCS 1 )<br />
– First, the orbits are removed: for each orbit, all arcs<br />
producing a cycle are deleted:<br />
0 1 2 3 4<br />
– The orbits are stored in the hierarchy of orbits H<br />
1<br />
Theoretical Computer Science<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
Orbit {2,3}
<strong>Glushkov</strong> <strong>Automata</strong><br />
● We can transform a <strong>Glushkov</strong> automaton into a<br />
RE:<br />
– First, the orbits are removed (cont.)<br />
0 1 2 3 4<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
Orbit {3}
<strong>Glushkov</strong> <strong>Automata</strong><br />
● We can transform a <strong>Glushkov</strong> automaton into a<br />
RE:<br />
– First, the orbits are removed (cont.)<br />
0 1 2 3 4<br />
● We have a graph without orbits<br />
● And the hierarchy of orbits H={{3},{2,3},{0,1,2,3,4}}<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
● We can transform a <strong>Glushkov</strong> automaton into a<br />
RE:<br />
– Second, we start a reduction process over the graph<br />
without orbits by using H.<br />
– This process is applied according to H, respecting the<br />
set inclusion property.<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
● We can transform a <strong>Glushkov</strong> automaton into a<br />
RE:<br />
– Applying three rules:<br />
Rule 1<br />
Rule 2<br />
Rule 3<br />
x y xy<br />
y<br />
x<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
x|y<br />
x x?
<strong>Glushkov</strong> <strong>Automata</strong><br />
● We can transform a <strong>Glushkov</strong> automaton into a<br />
RE:<br />
– Moreover, if a node represents a whole orbit, it is<br />
decorated with a + (positive closure)<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
<strong>Glushkov</strong> <strong>Automata</strong><br />
● Reduction process:<br />
0 -<br />
1 -<br />
2 -<br />
3 -<br />
4 -<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1 2 3 4<br />
1 2 3 + 4<br />
R 1<br />
1 (2 3 + ) + 4<br />
R 3<br />
1 (2 3 + ) + 4<br />
R 1<br />
1 (2 3 + ) * 4<br />
R 1<br />
0 1 (2 3 + ) * 5 - 4 Result:<br />
R 1<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
H={{3},{2,3},{0,1,2,3,4}}<br />
H={{2,3},{0,1,2,3,4}}<br />
H={{2,3},{0,1,2,3,4}}<br />
H={{0,1,2,3,4}}<br />
H={{0,1,2,3,4}}<br />
0 1 (2 3 + ) * 4
<strong>Glushkov</strong> <strong>Automata</strong><br />
● Reduction process:<br />
0 -<br />
1 -<br />
2 -<br />
3 -<br />
4 -<br />
0<br />
0<br />
0<br />
0<br />
0<br />
1 2 3 4<br />
1 2 3 + 4<br />
R 1<br />
1 (2 3 + ) + 4<br />
R 3<br />
1 (2 3 + ) + 4<br />
R 1<br />
1 (2 3 + ) * 4<br />
R 1<br />
R 1<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
H={{3},{2,3},{0,1,2,3,4}}<br />
H={{2,3},{0,1,2,3,4}}<br />
H={{2,3},{0,1,2,3,4}}<br />
H={{0,1,2,3,4}}<br />
H={{0,1,2,3,4}}<br />
0 1 (2 3 + ) * 5 - 4 Result: Subject (Year Journal+)*
Schema Update Primitives<br />
● Primitives:<br />
– Insertion, replacing, deletion, creation<br />
● Attributes, elements, content models<br />
– Cardinality changes<br />
– Constraints<br />
● Functional dependencies, types<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Schema Update Primitives<br />
● Guerrini et al in [WIDM 1 , 2005] have shown the<br />
impact of schema update primitives over XML<br />
documents:<br />
– We can propose primitives that may be<br />
consistency-preserving<br />
1<br />
Workshop on Web Information and Data Management<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Schema Update Primitives<br />
● They are:<br />
– Insertion as optional:<br />
● Attributes, elements<br />
– Creation<br />
● Content model<br />
– Element's cardinality changes<br />
● 1 to 0<br />
● 1 or 0 to 0:n<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Approach<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Proposed Conservative Primitives<br />
● Insertion of a sub-element in an element content<br />
model<br />
– ins<br />
● Making an element to be optional:<br />
– makeOpt<br />
● Extending the cardinality<br />
– ExtendCard<br />
● Create a new content model<br />
– createCM<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Insertion<br />
● User tool (prototype):<br />
You have chosen to insert Conference into the content model of Publication<br />
Select an element that has a semantic close to that of Conference:<br />
Subject (Year Journal + )*<br />
Select if you want to insert Conference:<br />
relatively to Journal<br />
relatively to (Journal+)<br />
Select if you want to insert Conference:<br />
as a choice: Journal | Conference<br />
before: Conference Journal<br />
after: Journal Conference<br />
Do you want Conference to be repeated:<br />
Yes<br />
No<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Insertion<br />
● To insert a new sub-element e' into an element e<br />
content model E<br />
– A node n (representing e' to be inserted into E) is<br />
inserted into the corresponding <strong>Glushkov</strong> graph<br />
without orbits G w (in a given position τ and context ζ):<br />
ins (e,e',τ,context,mode, times)<br />
● context= true (e' is inserted relatively to ζ) or false (relatively<br />
to τ)<br />
● mode can be choice, sequence-after or sequence-before<br />
● times = true, e' is decorated with +<br />
ins (Publication, Conference,3,false, choice, false)<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Insertion<br />
Publication : Subject (Year Journal+)*<br />
ins (Publication, Conference,3, false, choice, false)<br />
G<br />
G w<br />
G' w<br />
0 1 2 3 4<br />
0 1 2 3 4<br />
0 1 2 3 4<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
5<br />
H={{3},{2,3},{0,1,2,3,4}}<br />
H'={{5,3},{2,3,5},{0,1,2,3,4,5}}
Insertion<br />
Publication : Subject (Year Journal+)*<br />
ins (Publication, Conference,3, false, choice, false)<br />
G' w<br />
G' w<br />
0 1 2 3 4<br />
0 1 2 3|5 4<br />
+<br />
G' w 0 1 (2 (3|5))*4<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br<br />
5<br />
H'={{5,3},{2,3,5},{0,1,2,3,4,5}}<br />
H'={{5,3},{2,3,5},{0,1,2,3,4,5}}<br />
H'={{5,3},{2,3,5},{0,1,2,3,4,5}}<br />
H'={}<br />
Subject (Year (Journal|Conference)+)*
Other Primitives (Syntax)<br />
● Making an element to be optional:<br />
– makeOpt(e,τ)<br />
● Extending the cardinality<br />
– ExtendCard(e,τ)<br />
● Create a new content model<br />
– createCM(e,E)<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
High Level Primitives<br />
● Express complex updates in a more compact way:<br />
– insSubExp(e,β,τ,context,mode,times)<br />
● Making a sub-expression optional<br />
– makeSubExpOpt(e,β)<br />
● Extending the cardinality<br />
– extendSubExp(e,β)<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Final Considerations<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Conclusions<br />
● <strong>Glushkov</strong> automata (and graphs) allow us to<br />
identify starred sub-expressions of a regular<br />
expression.<br />
● The updates are performed in a intuitive way.<br />
● The proposed framework is consistencypreserving:<br />
– The documents valid wrt the old schema are<br />
valid wrt to the new one.<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Conclusions<br />
● Evolving XML schema is still a challenge:<br />
– Revalidation costs<br />
– Access to the documents to be revalidated<br />
– Data loss (to transform an invalid document into<br />
a valid one)<br />
– If the documents to be revalidated are stored in<br />
different sites: transfer costs.<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
Directions<br />
● The proposed primitives are not complete<br />
● Using our approach together with nonconservative<br />
primitives in a general framework<br />
● Apply this method for document integration<br />
● Consider other types of schema representation<br />
● Extend this approach to treat facet updates<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br
22nd Brazilian Symposium on Data Base<br />
SBBD <strong>2007</strong><br />
Assisting XML Schema Evolution that<br />
Preserves Validity<br />
Thank You!<br />
bouchou@univ-tours.fr denio@unochapeco.edu.br