06.07.2014 Views

Suffix and Prefix Arrays for Gappy Phrase Discovery

Suffix and Prefix Arrays for Gappy Phrase Discovery

Suffix and Prefix Arrays for Gappy Phrase Discovery

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

<strong>Suffix</strong> <strong>and</strong> <strong>Prefix</strong> <strong>Arrays</strong> <strong>for</strong> <strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Dale Gerdemann<br />

Department of Linguistics<br />

Computational Linguistics Section<br />

D-72074 Tübingen<br />

First Tübingen Workshop on Machine Learning, 2010


Outline<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

1 Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Outline<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

1 Introduction<br />

2 Terminology<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Outline<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

1 Introduction<br />

2 Terminology<br />

3 The Algorithm<br />

Conclusion


Outline<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

1 Introduction<br />

2 Terminology<br />

3 The Algorithm<br />

4 Toy Corpus Examples


Outline<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

1 Introduction<br />

2 Terminology<br />

3 The Algorithm<br />

4 Toy Corpus Examples<br />

5 Bridge Partners


Outline<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

1 Introduction<br />

2 Terminology<br />

3 The Algorithm<br />

4 Toy Corpus Examples<br />

5 Bridge Partners<br />

6 Conclusion


Introduction<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Goal: <strong>Discovery</strong> of recurrent patterns in texts.<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Introduction<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Goal: <strong>Discovery</strong> of recurrent patterns in texts.<br />

Examples:<br />

more than<br />

never even<br />

no idea what<br />

of the<br />

. . .


Introduction<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Goal: <strong>Discovery</strong> of recurrent patterns in texts.<br />

Examples:<br />

more than<br />

never even<br />

no idea what<br />

of the<br />

. . .<br />

The approach ignores traditional linguistic categories Det, N,<br />

NP, etc. Such linguistic knowledge can be rediscovered, <strong>and</strong><br />

applied to multi-word expressions, rather than to<br />

whitespace-delimited “words.”


<strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Recurrent patterns often occur with slight variation.<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


<strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Recurrent patterns often occur with slight variation.<br />

<strong>Gappy</strong> phrase discovery is an approach <strong>for</strong> finding sequences of<br />

two phrases separated from each other by a distance not<br />

greater than d.<br />

Bridge<br />

Partners<br />

Conclusion


<strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Recurrent patterns often occur with slight variation.<br />

<strong>Gappy</strong> phrase discovery is an approach <strong>for</strong> finding sequences of<br />

two phrases separated from each other by a distance not<br />

greater than d.<br />

Examples:<br />

from one X to the other


<strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Recurrent patterns often occur with slight variation.<br />

<strong>Gappy</strong> phrase discovery is an approach <strong>for</strong> finding sequences of<br />

two phrases separated from each other by a distance not<br />

greater than d.<br />

Examples:<br />

from one X to the other<br />

upside [|-] down


<strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Recurrent patterns often occur with slight variation.<br />

<strong>Gappy</strong> phrase discovery is an approach <strong>for</strong> finding sequences of<br />

two phrases separated from each other by a distance not<br />

greater than d.<br />

Examples:<br />

from one X to the other<br />

upside [|-] down<br />

åäèí è ñúùè, åäíà è ñúùà, åäíî è ñúùî, åäíè è ñúùè


<strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Recurrent patterns often occur with slight variation.<br />

<strong>Gappy</strong> phrase discovery is an approach <strong>for</strong> finding sequences of<br />

two phrases separated from each other by a distance not<br />

greater than d.<br />

Examples:<br />

from one X to the other<br />

upside [|-] down<br />

åäèí è ñúùè, åäíà è ñúùà, åäíî è ñúùî, åäíè è ñúùè<br />

Bézier curve . . . control point


<strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Recurrent patterns often occur with slight variation.<br />

<strong>Gappy</strong> phrase discovery is an approach <strong>for</strong> finding sequences of<br />

two phrases separated from each other by a distance not<br />

greater than d.<br />

Examples:<br />

from one X to the other<br />

upside [|-] down<br />

åäèí è ñúùè, åäíà è ñúùà, åäíî è ñúùî, åäíè è ñúùè<br />

Bézier curve . . . control point<br />

at the top of [her|his|his shrill little|its] voice


<strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Recurrent patterns often occur with slight variation.<br />

<strong>Gappy</strong> phrase discovery is an approach <strong>for</strong> finding sequences of<br />

two phrases separated from each other by a distance not<br />

greater than d.<br />

Examples:<br />

from one X to the other<br />

upside [|-] down<br />

åäèí è ñúùè, åäíà è ñúùà, åäíî è ñúùî, åäíè è ñúùè<br />

Bézier curve . . . control point<br />

at the top of [her|his|his shrill little|its] voice<br />

autumn , when the leaves are [|getting] brown


Bidirectional Examples<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Übungsaufgabe [|dem geneigten Leser] überlassen<br />

als [|leichte|triviale|untergeordneter] Übungsaufgabe<br />

Beweis [für unser|habe ich Ihnen] bereits<br />

Beweis [|geschieht] durch<br />

Beweis [|nun schon] zu<br />

den [|ge<strong>for</strong>derten] Beweis<br />

Höhe [|eines Fahnenmastes|eines Hochhauses] bestimmen<br />

Höhe [sowie|wissen und nicht] die Länge<br />

aus [einer bekannten|seiner] Höhe


Bidirectional Examples<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Übungsaufgabe [|dem geneigten Leser] überlassen<br />

als [|leichte|triviale|untergeordneter] Übungsaufgabe<br />

Beweis [für unser|habe ich Ihnen] bereits<br />

Beweis [|geschieht] durch<br />

Beweis [|nun schon] zu<br />

den [|ge<strong>for</strong>derten] Beweis<br />

Höhe [|eines Fahnenmastes|eines Hochhauses] bestimmen<br />

Höhe [sowie|wissen und nicht] die Länge<br />

aus [einer bekannten|seiner] Höhe<br />

To group these together, suffix arrays are used <strong>for</strong> the <strong>for</strong>ward<br />

case, <strong>and</strong> prefix arrays are used <strong>for</strong> the backward case.


Tokenizing by Character<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Many interesting alternations can be found by tokenizing<br />

characters (here also converting capitalization into markup):<br />

〈cap〉zum 〈cap〉beispiel könn[|t]en 〈cap〉sie<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Tokenizing by Character<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Many interesting alternations can be found by tokenizing<br />

characters (here also converting capitalization into markup):<br />

〈cap〉zum 〈cap〉beispiel könn[|t]en 〈cap〉sie<br />

But many junk alternations are also found:<br />

versch[ieb|l|w]u<br />

Zeitverschiebung, 1x<br />

Phasenverschiebung, 1x<br />

Rotverschiebung, 3x<br />

verschluckt,<br />

1x<br />

verschwunden,<br />

1x


Tokenizing by Character<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Many interesting alternations can be found by tokenizing<br />

characters (here also converting capitalization into markup):<br />

〈cap〉zum 〈cap〉beispiel könn[|t]en 〈cap〉sie<br />

But many junk alternations are also found:<br />

versch[ieb|l|w]u<br />

Zeitverschiebung, 1x<br />

Phasenverschiebung, 1x<br />

Rotverschiebung, 3x<br />

verschluckt,<br />

1x<br />

verschwunden,<br />

1x<br />

Challenge: filter out junk; retain interesting cases.


Motivation<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Lexicography <strong>and</strong> language research<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Motivation<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

Lexicography <strong>and</strong> language research<br />

Machine translation<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Motivation<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Lexicography <strong>and</strong> language research<br />

Machine translation<br />

Plagerism detection, detection of cut-<strong>and</strong>-paste problems <strong>and</strong><br />

detection of other intended repetitions<br />

Conclusion


Motivation<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Lexicography <strong>and</strong> language research<br />

Machine translation<br />

Plagerism detection, detection of cut-<strong>and</strong>-paste problems <strong>and</strong><br />

detection of other intended repetitions<br />

Corpus construction/consistency checking


Motivation<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Lexicography <strong>and</strong> language research<br />

Machine translation<br />

Plagerism detection, detection of cut-<strong>and</strong>-paste problems <strong>and</strong><br />

detection of other intended repetitions<br />

Corpus construction/consistency checking<br />

Text categorization (see MA thesis of Maria Tchalakova)


Motivation<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Lexicography <strong>and</strong> language research<br />

Machine translation<br />

Plagerism detection, detection of cut-<strong>and</strong>-paste problems <strong>and</strong><br />

detection of other intended repetitions<br />

Corpus construction/consistency checking<br />

Text categorization (see MA thesis of Maria Tchalakova)<br />

Morphology induction


Motivation<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Lexicography <strong>and</strong> language research<br />

Machine translation<br />

Plagerism detection, detection of cut-<strong>and</strong>-paste problems <strong>and</strong><br />

detection of other intended repetitions<br />

Corpus construction/consistency checking<br />

Text categorization (see MA thesis of Maria Tchalakova)<br />

Morphology induction<br />

German compound noun splitting, Chinese segmentation.


Terminology<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The sequence “from one” is the left part, “to the other” is the<br />

right part <strong>and</strong> the set of values <strong>for</strong> X is the bridge. The<br />

parameter d specifies the length of the longest sequence in X .<br />

Procedurely, the part that is found first is called the initial part.<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Terminology<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

The sequence “from one” is the left part, “to the other” is the<br />

right part <strong>and</strong> the set of values <strong>for</strong> X is the bridge. The<br />

parameter d specifies the length of the longest sequence in X .<br />

Procedurely, the part that is found first is called the initial part.<br />

A phrase is a sequence of tokens which has occurrences in a<br />

text. A phrase is saturated or maximal if it has at least two<br />

occurrences <strong>and</strong> cannot be extended without losing some<br />

occurrences. Suppose, <strong>for</strong> example, that mumbo <strong>and</strong> mumbo<br />

jumbo each have 5 occurrences, but any extended phrase, such<br />

as speak mumbo jumbo has fewer than 5 occurrences. Then<br />

mumbo jumbo is maximal, but mumbo is not.


Terminology<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

The sequence “from one” is the left part, “to the other” is the<br />

right part <strong>and</strong> the set of values <strong>for</strong> X is the bridge. The<br />

parameter d specifies the length of the longest sequence in X .<br />

Procedurely, the part that is found first is called the initial part.<br />

A phrase is a sequence of tokens which has occurrences in a<br />

text. A phrase is saturated or maximal if it has at least two<br />

occurrences <strong>and</strong> cannot be extended without losing some<br />

occurrences. Suppose, <strong>for</strong> example, that mumbo <strong>and</strong> mumbo<br />

jumbo each have 5 occurrences, but any extended phrase, such<br />

as speak mumbo jumbo has fewer than 5 occurrences. Then<br />

mumbo jumbo is maximal, but mumbo is not.<br />

Maximal phrases are interesting <strong>for</strong> two reasons:<br />

more efficient to only use maximal phrases as initial part<br />

non-maximal phrases may be linguistically incomplete


Combinatorics<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

A text of length n has a maximum of ( )<br />

n+1<br />

2 n-grams.<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Combinatorics<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

A text of length n has a maximum of ( )<br />

n+1<br />

2 n-grams.<br />

But the number of maximal phrases with at least 2 occurrences<br />

is no greater than n. Yamamoto <strong>and</strong> Church (2001) prove this<br />

<strong>for</strong> right maximal phrases, <strong>for</strong> maximal phrases there are even<br />

fewer. A phrase is right maximal if it cannot be extended to<br />

the right without losing some of its occurrences.<br />

Conclusion


Combinatorics<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

A text of length n has a maximum of ( )<br />

n+1<br />

2 n-grams.<br />

But the number of maximal phrases with at least 2 occurrences<br />

is no greater than n. Yamamoto <strong>and</strong> Church (2001) prove this<br />

<strong>for</strong> right maximal phrases, <strong>for</strong> maximal phrases there are even<br />

fewer. A phrase is right maximal if it cannot be extended to<br />

the right without losing some of its occurrences.<br />

The set of all right right maximal phrases is easy to represent<br />

with a suffix tree or a suffix array. To get just the maximal<br />

phrases requires some filtering. Abouelhoda et al. (2004) use a<br />

Burrows <strong>and</strong> Wheeler trans<strong>for</strong>mation table <strong>for</strong> this purpose.


Combinatorics<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

A text of length n has a maximum of ( )<br />

n+1<br />

2 n-grams.<br />

But the number of maximal phrases with at least 2 occurrences<br />

is no greater than n. Yamamoto <strong>and</strong> Church (2001) prove this<br />

<strong>for</strong> right maximal phrases, <strong>for</strong> maximal phrases there are even<br />

fewer. A phrase is right maximal if it cannot be extended to<br />

the right without losing some of its occurrences.<br />

The set of all right right maximal phrases is easy to represent<br />

with a suffix tree or a suffix array. To get just the maximal<br />

phrases requires some filtering. Abouelhoda et al. (2004) use a<br />

Burrows <strong>and</strong> Wheeler trans<strong>for</strong>mation table <strong>for</strong> this purpose.<br />

Term <strong>and</strong> document frequencies are calculated <strong>for</strong> equivalence<br />

classes of phrases. Statistical measures <strong>for</strong> determining<br />

“interesting” phrases are based on these frequencies.


Terminology <strong>for</strong> <strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

A gappy phrase can be extended in 4 different ways, by<br />

extending the left or right parts to the left or right.<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Terminology <strong>for</strong> <strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

A gappy phrase can be extended in 4 different ways, by<br />

extending the left or right parts to the left or right.<br />

It is possible that the left <strong>and</strong> right parts are individually<br />

maximal, <strong>and</strong> yet the combination l . . . r is not.<br />

Bridge<br />

Partners<br />

Conclusion


Terminology <strong>for</strong> <strong>Gappy</strong> <strong>Phrase</strong>s<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

A gappy phrase can be extended in 4 different ways, by<br />

extending the left or right parts to the left or right.<br />

It is possible that the left <strong>and</strong> right parts are individually<br />

maximal, <strong>and</strong> yet the combination l . . . r is not.<br />

Example: accept. . . changes<br />

it is recommended that one accept all the defaults<br />

cut, copy, paste, undo changes <strong>and</strong> so on<br />

The user then may accept or reject all changes or<br />

the user can accept or reject the changes individually or<br />

in order to turn off the function or accept or reject changes


Nearness Condition<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

How many occurrences of row. . . your boat are there in row,<br />

row, row your boat? The nearness condition states that only<br />

the nearest occurrence of “row” counts, so that there is only<br />

one occurrence.<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Nearness Condition<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

How many occurrences of row. . . your boat are there in row,<br />

row, row your boat? The nearness condition states that only<br />

the nearest occurrence of “row” counts, so that there is only<br />

one occurrence.<br />

It is not clear that that the nearness condition is always<br />

desirable. For the phrase nail the X into the wall, it would seem<br />

reasonable <strong>for</strong> one instantiation of X to be nail.


Nearness Condition<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

How many occurrences of row. . . your boat are there in row,<br />

row, row your boat? The nearness condition states that only<br />

the nearest occurrence of “row” counts, so that there is only<br />

one occurrence.<br />

It is not clear that that the nearness condition is always<br />

desirable. For the phrase nail the X into the wall, it would seem<br />

reasonable <strong>for</strong> one instantiation of X to be nail.<br />

The nearness condition is tricky to implement, involving two<br />

binary searches. For example, when “row” is found, a binary<br />

search is made to find the closest instance of “your boat,” <strong>and</strong><br />

then a second binary search to the left is per<strong>for</strong>med to see if<br />

there is a closer instance of “row.” For an alternative method,<br />

see Apostolico <strong>and</strong> Satta (2009).


Nearness Condition Continued<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Example where nearness condition interacts with maximality:<br />

The following step-by-step procedure shows<br />

The following brief step-by-step example<br />

The following is a step-by-step procedure<br />

The following provides a step-by-step procedure<br />

Conclusion


Nearness Condition Continued<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Example where nearness condition interacts with maximality:<br />

The following step-by-step procedure shows<br />

The following brief step-by-step example<br />

The following is a step-by-step procedure<br />

The following provides a step-by-step procedure<br />

It appears that step is maximal, since it occurs with several<br />

right contexts: {procedure,example,-}


Nearness Condition Continued<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Example where nearness condition interacts with maximality:<br />

The following step-by-step procedure shows<br />

The following brief step-by-step example<br />

The following is a step-by-step procedure<br />

The following provides a step-by-step procedure<br />

It appears that step is maximal, since it occurs with several<br />

right contexts: {procedure,example,-}<br />

But the instances with following procedure or example fail the<br />

nearness condition. The correct maximal phrase <strong>for</strong> the right<br />

part is step-by-step.


Overlap<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Assuming that “to the” <strong>and</strong> “the ground” are both maximal,<br />

then here we have overlap:<br />

to the ground<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Overlap<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Assuming that “to the” <strong>and</strong> “the ground” are both maximal,<br />

then here we have overlap:<br />

to the ground<br />

Note that the overlap alternates with non-overlapping<br />

instances:<br />

to the one on the ground<br />

to the pail on the ground


Overlap<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Assuming that “to the” <strong>and</strong> “the ground” are both maximal,<br />

then here we have overlap:<br />

to the ground<br />

Note that the overlap alternates with non-overlapping<br />

instances:<br />

to the one on the ground<br />

to the pail on the ground<br />

Another example:<br />

really like to know<br />

really like to get to know


More Overlap<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Commonly involved in overlap:<br />

determiners: the, a<br />

prepositions: of, on<br />

infinitive marker: to<br />

punctuation: comma<br />

Bridge<br />

Partners<br />

Conclusion


More Overlap<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Commonly involved in overlap:<br />

determiners: the, a<br />

prepositions: of, on<br />

infinitive marker: to<br />

punctuation: comma<br />

More unusual examples:<br />

long click the<br />

long click or short click the<br />

extra credit card<br />

old men ’s gym


More Overlap<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Commonly involved in overlap:<br />

determiners: the, a<br />

prepositions: of, on<br />

infinitive marker: to<br />

punctuation: comma<br />

More unusual examples:<br />

long click the<br />

long click or short click the<br />

extra credit card<br />

old men ’s gym<br />

Current version deals with overlap in asymmetric, ad hoc way.<br />

Possibly better approach in Apostolico <strong>and</strong> Satta (2009).


Lcp-Interval Tree <strong>for</strong>: mining engineering<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

e<br />

g<br />

ing<br />

in<br />

e e e g g g i i i i m n n n n n r $<br />

e n r i $ n n n n i e g g g i i e<br />

r g i n e e g g i n e i $ n n n<br />

i i n e n e $ n i r n e g g g<br />

n n g e g r e g n i e n $ i<br />

. . . . . . . . . . . . . .<br />

ɛ<br />

n<br />

ng


Lcp-Interval Tree <strong>for</strong>: mining engineering<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

e<br />

g<br />

ing<br />

in<br />

e e e g g g i i i i m n n n n n r $<br />

e n r i $ n n n n i e g g g i i e<br />

r g i n e e g g i n e i $ n n n<br />

i i n e n e $ n i r n e g g g<br />

n n g e g r e g n i e n $ i<br />

. . . . . . . . . . . . . .<br />

ɛ<br />

n<br />

ng<br />

Red nodes are <strong>for</strong> binary search through possibly large<br />

alphabet; black nodes represent longest-common-prefix<br />

intervals (Kim et al., 2008).


The KS Algorithm (Greatly Simplified)<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Given a text as a sequence of integers, build an<br />

lcp-interval tree (triple of arrays) in linear time.<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


The KS Algorithm (Greatly Simplified)<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Given a text as a sequence of integers, build an<br />

lcp-interval tree (triple of arrays) in linear time.<br />

Traverse interval tree. Upon encountering a (left-maximal)<br />

black node, record set of positions of phrases in set p1.<br />

Recursively build lcp-interval tree <strong>for</strong> the set of positions<br />

within distance d to the right of a position in p1. (left<br />

case is similar)<br />

Conclusion


The KS Algorithm (Greatly Simplified)<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Given a text as a sequence of integers, build an<br />

lcp-interval tree (triple of arrays) in linear time.<br />

Traverse interval tree. Upon encountering a (left-maximal)<br />

black node, record set of positions of phrases in set p1.<br />

Recursively build lcp-interval tree <strong>for</strong> the set of positions<br />

within distance d to the right of a position in p1. (left<br />

case is similar)<br />

Traverse the recursively built interval tree. Upon<br />

encountering a black node, record set of positions of<br />

phrases in set p2. For each position in p2, find (using<br />

binary search) a corresponding position in p1 that satisfies<br />

the nearness condition. Record these positions in p3. For<br />

each position in t ∈ p3 find (binary search again) the<br />

corresponding position s ∈ p2 <strong>and</strong> record the pair (s, t) in<br />

p4. The set p4 now contains indices of the left <strong>and</strong> right<br />

parts of c<strong>and</strong>idate gappy phrases.


The KS Algorithm (Greatly Simplified)<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Filter out non-maximal gapppy phrases.<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


The KS Algorithm (Greatly Simplified)<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Filter out non-maximal gapppy phrases.<br />

Record term frequencies <strong>and</strong> document frequencies <strong>for</strong> the<br />

gappy phrase as <strong>and</strong> also <strong>for</strong> the left <strong>and</strong> right parts as<br />

st<strong>and</strong>alone phrases. These frequencies could be used to<br />

filter out cases that occur by chance according to some<br />

probability model.


The KS Algorithm (Greatly Simplified)<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Filter out non-maximal gapppy phrases.<br />

Record term frequencies <strong>and</strong> document frequencies <strong>for</strong> the<br />

gappy phrase as <strong>and</strong> also <strong>for</strong> the left <strong>and</strong> right parts as<br />

st<strong>and</strong>alone phrases. These frequencies could be used to<br />

filter out cases that occur by chance according to some<br />

probability model.<br />

Note: Finding tf <strong>and</strong> df <strong>for</strong> the right part involves binary<br />

search through the interval tree. So it is important to use<br />

a data structure that is adapted to a large alphabet. (Kim<br />

et al., 2008)


A Toy Corpus<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

A corpus of 4 texts, with a sentinel at the end of each:<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


A Toy Corpus<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

A corpus of 4 texts, with a sentinel at the end of each:<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

A regular expression <strong>for</strong> tokenizing:<br />

\$\s |<br />

\#\s |<br />

\%\s |<br />

\&\s |<br />

[\w\W]


A Toy Corpus<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

A corpus of 4 texts, with a sentinel at the end of each:<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

A regular expression <strong>for</strong> tokenizing:<br />

\$\s |<br />

\#\s |<br />

\%\s |<br />

\&\s |<br />

[\w\W]<br />

Other characters, such as punctuation <strong>and</strong> paragraph<br />

boundaries, can also be treated like sentinels.


Initial <strong>Phrase</strong> <strong>Discovery</strong><br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Non-gappy phrase discovery finds the “phrase” se:<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&


Initial <strong>Phrase</strong> <strong>Discovery</strong><br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Non-gappy phrase discovery finds the “phrase” se:<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

Note se is maximal; se is supermaximal.<br />

A phrase is supermaximal if it is maximal, <strong>and</strong> not a proper<br />

substring of another maximal phrase.


leftward Search Space<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Select prefix end points within distance d of the initial phrase.<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

Bridge<br />

Partners<br />

Conclusion


leftward Search Space<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Select prefix end points within distance d of the initial phrase.<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

These prefixes are represented by the following numbers, <strong>and</strong><br />

the numbers are sorted rather than the prefixes themselves.<br />

tor #3<br />

tort #4<br />

torto #5<br />

tortoi #6<br />

. . . the #57<br />

etc


leftward Search Space<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

We discover that se is preceded 3 times by ho, <strong>and</strong> 2 times by<br />

the longer phrase ho,<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&


leftward Search Space<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

We discover that se is preceded 3 times by ho, <strong>and</strong> 2 times by<br />

the longer phrase ho,<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

Note: If any phrase were found from rtoi, tho <strong>and</strong> hou, then<br />

this phrase would have to be rejected. Such a phrase would be<br />

rediscovered when searching leftward from se .


Overlapping leftward Search Space<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

What happens when the leftward search spaces overlap?<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

Bridge<br />

Partners<br />

Conclusion


Overlapping leftward Search Space<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

What happens when the leftward search spaces overlap?<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

If d = 4, we get the following search space:<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&


Overlapping leftward Search Space<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

What happens when the leftward search spaces overlap?<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

If d = 4, we get the following search space:<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

The phrase t is found twice within this search space. These 2<br />

occurrences need to be matched with corresponding right parts.


Too close to the edge<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Should we “discover” that ea combines with<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

th to the left?


Too close to the edge<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Should we “discover” that ea combines with<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

th to the left?<br />

To rule this out, we need to say that “%” <strong>and</strong> other sentinels<br />

are illegal as part of a bridge alternative.


Too close to the edge<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Should we “discover” that ea combines with<br />

tortoise that supports the earth$<br />

put the cart be<strong>for</strong>e the horse#<br />

a comedy to those who think%<br />

eat you out of house <strong>and</strong> home&<br />

th to the left?<br />

To rule this out, we need to say that “%” <strong>and</strong> other sentinels<br />

are illegal as part of a bridge alternative.<br />

Additional illegal bridge elements can be specified with a<br />

regular expression.


Reversing the Problem<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

It is trivial to adapt the algorithm to find pairs of phrases that<br />

coocur in the bridge.<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Reversing the Problem<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

It is trivial to adapt the algorithm to find pairs of phrases that<br />

coocur in the bridge.<br />

have a [concussion|favorite word|life|. . . |thesis] or<br />

life <strong>and</strong> thesis occur<br />

in same bridge<br />

Bridge<br />

Partners<br />

Conclusion


Reversing the Problem<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

It is trivial to adapt the algorithm to find pairs of phrases that<br />

coocur in the bridge.<br />

have a [concussion|favorite word|life|. . . |thesis] or<br />

life <strong>and</strong> thesis occur<br />

in same bridge<br />

Definition: two phrases occurring as different alternatives in a<br />

bridge are called bridge partners.<br />

than the [life|thesis] of a<br />

life <strong>and</strong> thesis occur<br />

in same bridge again


Reversing the Problem<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

It is trivial to adapt the algorithm to find pairs of phrases that<br />

coocur in the bridge.<br />

have a [concussion|favorite word|life|. . . |thesis] or<br />

life <strong>and</strong> thesis occur<br />

in same bridge<br />

Definition: two phrases occurring as different alternatives in a<br />

bridge are called bridge partners.<br />

than the [life|thesis] of a<br />

life <strong>and</strong> thesis occur<br />

in same bridge again<br />

It can’t be a coincidence, can it?


R<strong>and</strong>om Sample Output from 6000 lines<br />

Science humor collected by Joachim Verhagen<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

(asked|trying)/[(was~to explain), (who was~to)]<br />

(making|the)/[(,~light of the), (of~use of)]<br />

(had to|have)/[(I~also), (I~come up with)]<br />

(matter|the problem)/[(that~cannot be), (the nature of<br />

(don’t|really don’t)/[(I~care), (I~know .), (I~need to<br />

(made|told)/[(I~it to), (were~to use)]<br />

(obvious|possible)/[(, it is~that), (it is~. "), (some<br />

(a specific|the)/[(do~job), (with~focus)]<br />

(answer|one)/[(be the~to), (the correct~.)]<br />

(building|bulb)/[(hold the~<strong>and</strong>), (on the~at the top)]<br />

(<strong>for</strong>|to get)/[(enough~all), (enough~it to)]<br />

(You|you)/[("~need to), (" Do~believe in), ((~could),<br />

(first|top)/[(His~-), (of the~ten)]<br />

(statement|story)/[(a true~.), (true~:)]<br />

(certain|sure)/[(I can’t be~.), (make~that), (never~wh


Big, Little, Large, Small<br />

See: Lynne Murphy, Semantic Relations <strong>and</strong> the Lexicon<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

(big|great)/[(a~deal), (is a~place)]<br />

(bigger|equal)/[(are~, choose), (may be~, or)]<br />

(big|real)/[(Coulomb got a~charge out of the), (the~deal), (you could ge<br />

(big|large)/[(a~ball of fire), (a~charge), (a~lump)]<br />

(bigger|greater)/[(have a~chance of), (times~than)]<br />

(big|good)/[(After a~meal), (a~deal), (a~f * *)]<br />

(little|lot)/[(a~faster), (be a~more)]<br />

(little|lot of)/[(a~fun .), (a~heat)]<br />

(bit of|little)/[(a~fun .), (a~help from)]<br />

(a little|the)/[(. What~acorn), (<strong>for</strong>~known), (lecture ,~old)]<br />

(large chunk|piece)/[(a~of potassium), (such a~of)]<br />

(a large|the)/[(at~university , the), (in~empty), (to~engineering), (to~<br />

(large|real)/[(a~beaker), (a~charge)]<br />

(large|small)/[(a~South), (a~beaker), (a~fence), (a~group of), (a~order)<br />

(late|small)/[(too~)), (too~<strong>for</strong> it)]<br />

(short|small)/[(a~step), (so~that you)]<br />

(a small|the)/[(of~South), (puts~fence around)]<br />

(problems|small children)/[(had~with him), (he had~with)]


Jason Riggle’s use of the word “set”<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

(generator|set)/[(c<strong>and</strong>idate~<strong>and</strong>), (c<strong>and</strong>idate~with a), (on the c<strong>and</strong>idate~<br />

(number of|set of)/[(in the~nodes), (the~ERCs that), (the~actual), (the~<br />

(machine|set of constraints)/[(a~, it is), (the entire~.), (the intersec<br />

(pair|set)/[(, another~of), (a~of nodes), (a new~of nodes at), (each~of<br />

(ranked set|set)/[(a~of constraints), (takes a~of), (with a~of)]<br />

(constraints|set)/[(how the~of), (the~corresponds to), (the~that make up<br />

(set of|sets of)/[(finite~c<strong>and</strong>idates), (ranked~constraints), (the~cost v<br />

(infinite set|set)/[(the~of c<strong>and</strong>idates), (the~of input), (the~of paths),<br />

(constraints|set of constraints)/[(by the~.), (intersected~.), (ranked~.<br />

(cost|set)/[(, once the~of), (<strong>and</strong> finds the~of), (find the~of)]<br />

(most harmonic|set of)/[(the~final states), (the~paths through M), (the~<br />

(number|set)/[(exactly the~of), (get the same~of violations), (in the~of<br />

(set|sets)/[(finite~. If), (finite~of c<strong>and</strong>idates), (finite c<strong>and</strong>idate~.),<br />

(set|string)/[(of a~of), (the empty~<strong>and</strong>), (with the empty~as)]<br />

(machine|set H)/[(added to the~.), (in the~. The), (of the nodes in the~<br />

(same set|set)/[(exactly the~of), (the~of contenders), (the~of nodes), (<br />

(cascade|set)/[(a~of finite), (of a~of), (with a~of)]<br />

(range|set)/[(generates an infinite~of), (in a~of), (of the infinite~of<br />

(optimal|set of)/[(in the~c<strong>and</strong>idates), (the~c<strong>and</strong>idates .), (the~paths th<br />

(set of|single)/[(a~finite state), (a~optimal), (the~cheapest)]<br />

(bound|set)/[(particular~of), (to a~of), (violation~. The)]<br />

(representation|set)/[(, but the~of), (a finite~of), (the size of the~of


Applications of Bridge Partners<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Could bridge partners be used as features <strong>for</strong> machine<br />

learning algorithms?<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Applications of Bridge Partners<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Could bridge partners be used as features <strong>for</strong> machine<br />

learning algorithms?<br />

Could bridge partners be useful <strong>for</strong> named entity<br />

recognition? A capitalized word with 3rd, sing, masc<br />

pronoun as bridge partner is likely to be the name of a<br />

male person.<br />

(Ñòåôàí|Òîé)/[(. ~ ñå êà÷è â), (Ìàêñ . ~ ñå), (óäîâîëñòâèå . ~ áåøå)]<br />

Conclusion


Applications of Bridge Partners<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Could bridge partners be used as features <strong>for</strong> machine<br />

learning algorithms?<br />

Could bridge partners be useful <strong>for</strong> named entity<br />

recognition? A capitalized word with 3rd, sing, masc<br />

pronoun as bridge partner is likely to be the name of a<br />

male person.<br />

(Ñòåôàí|Òîé)/[(. ~ ñå êà÷è â), (Ìàêñ . ~ ñå), (óäîâîëñòâèå . ~ áåøå)]<br />

Could bridge partners be useful <strong>for</strong> learning morphology?<br />

Note, <strong>for</strong> example, that corresponding singular <strong>and</strong> plural<br />

<strong>for</strong>ms tend to occur as partners:<br />

(grammar|grammars)/[(. Gerdemann <strong>and</strong> Van Noord call the resulting˜<strong>for</strong><br />

such cases), (the˜defined by), (the˜won’t)]<br />

(model|models)/[(Gerdemann <strong>and</strong> Van Noord’s˜is), (computational˜of<br />

Optimality Theory), (finite state˜of OT)]


Applications of Bridge Partners (Continued)<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

Could educational programs use bridge partners as indicators<br />

that a learner relates the concepts? Consider the pair machine<br />

<strong>and</strong> set of constraints:<br />

(machine|set of constraints)/[(a˜, it is), (the entire˜.), (the intersected˜.)]<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Applications of Bridge Partners (Continued)<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Could educational programs use bridge partners as indicators<br />

that a learner relates the concepts? Consider the pair machine<br />

<strong>and</strong> set of constraints:<br />

(machine|set of constraints)/[(a˜, it is), (the entire˜.), (the intersected˜.)]<br />

Does this mean that Jason’s concept of a machine is similar to<br />

his concept <strong>for</strong> a set of constraints?<br />

Bridge<br />

Partners<br />

Conclusion


Applications of Bridge Partners (Continued)<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Could educational programs use bridge partners as indicators<br />

that a learner relates the concepts? Consider the pair machine<br />

<strong>and</strong> set of constraints:<br />

(machine|set of constraints)/[(a˜, it is), (the entire˜.), (the intersected˜.)]<br />

Does this mean that Jason’s concept of a machine is similar to<br />

his concept <strong>for</strong> a set of constraints?<br />

Alternatives:<br />

LSA Use a term-document matrix with Latent<br />

Semantic Analysis <strong>for</strong> dimensionality reduction.<br />

Disadvantage: One needs a large number of texts<br />

from the learner.<br />

Proximity A simple indicator of relatedness is mention of<br />

both concepts within a given radius. Ex: Bézier<br />

curve <strong>and</strong> control point in the OpenOffice user<br />

manual.


Conclusions<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Extraction of gapped phrases is feasible using suffix/prefix<br />

arrays.<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion


Conclusions<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Extraction of gapped phrases is feasible using suffix/prefix<br />

arrays.<br />

Linear time?<br />

Bridge<br />

Partners<br />

Conclusion


Conclusions<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Extraction of gapped phrases is feasible using suffix/prefix<br />

arrays.<br />

Linear time?<br />

Gappped phrases should be maximal/saturated.<br />

Conclusion


Conclusions<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Extraction of gapped phrases is feasible using suffix/prefix<br />

arrays.<br />

Linear time?<br />

Gappped phrases should be maximal/saturated.<br />

There are various kinds of gapped phrases.


Conclusions<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Extraction of gapped phrases is feasible using suffix/prefix<br />

arrays.<br />

Linear time?<br />

Gappped phrases should be maximal/saturated.<br />

There are various kinds of gapped phrases.<br />

Gap fillers (bridge partners) are also interesting.


Conclusions<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Introduction<br />

Terminology<br />

The Algorithm<br />

Toy Corpus<br />

Examples<br />

Bridge<br />

Partners<br />

Conclusion<br />

Extraction of gapped phrases is feasible using suffix/prefix<br />

arrays.<br />

Linear time?<br />

Gappped phrases should be maximal/saturated.<br />

There are various kinds of gapped phrases.<br />

Gap fillers (bridge partners) are also interesting.<br />

Work is needed primarily on filtering out uninteresting<br />

cases,


For Further Reading I<br />

<strong>Gappy</strong> <strong>Phrase</strong><br />

<strong>Discovery</strong><br />

Gerdemann<br />

Appendix<br />

For Further<br />

Reading<br />

References<br />

Mohamed Ibrahim Abouelhoda, Stefan Kurtz, <strong>and</strong> Enno<br />

Ohlebusch. Replacing suffix trees with enhanced suffix<br />

arrays. J. of Discrete Algorithms, 2(1):53–86, 2004.<br />

Alberto Apostolico <strong>and</strong> Giorgio Satta. Discovering subword<br />

associations in strings in time linear in the output size. J. of<br />

Discrete Algorithms, 7(2):227–238, 2009.<br />

Dong Kyue Kim, Minhwan Kim, <strong>and</strong> Heejin Park. Linearized<br />

suffix tree: an efficient index data structure with the<br />

capabilities of suffix trees <strong>and</strong> suffix arrays. Algorithmica, 52<br />

(3):350–377, 2008.<br />

Mikio Yamamoto <strong>and</strong> Kenneth W. Church. Using suffix arrays<br />

to compute term frequency <strong>and</strong> document frequency <strong>for</strong> all<br />

substrings in a corpus. Comput. Linguist., 27(1):1–30, 2001.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!