Caravan: Sequential Pattern Mining in OCaml - The Lack Thereof
Caravan: Sequential Pattern Mining in OCaml - The Lack Thereof
Caravan: Sequential Pattern Mining in OCaml - The Lack Thereof
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Caravan</strong>: <strong>Sequential</strong> <strong>Pattern</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> <strong>OCaml</strong><br />
Brock Wilcox<br />
wilcox6@uiuc.edu<br />
ABSTRACT<br />
When attempt<strong>in</strong>g to expand upon exist<strong>in</strong>g <strong>Sequential</strong> <strong>Pattern</strong><br />
<strong>M<strong>in</strong><strong>in</strong>g</strong> software, namely the PrefixSpan, CloSpan, and<br />
CISpan series of algorithms, I found that the source code for<br />
each didn’t readily lend itself to analysis and improvement.<br />
Each of these algorithms builds upon the previous, and there<br />
are experiments <strong>in</strong> between each that rema<strong>in</strong> <strong>in</strong> the code.<br />
Ideally algorithms like these would be implemented <strong>in</strong> a<br />
way that allows <strong>in</strong>dividual improvements to be added, while<br />
keep<strong>in</strong>g the overlapp<strong>in</strong>g components shared. Additionally I<br />
found that none of these algorithms have been implemented<br />
<strong>in</strong> <strong>OCaml</strong>, which I prefer for implement<strong>in</strong>g complex datastructure<br />
manipulat<strong>in</strong>g software.<br />
In this paper I present <strong>Caravan</strong>, the foundation for an <strong>OCaml</strong>based<br />
sequential pattern m<strong>in</strong><strong>in</strong>g framework, which aims to<br />
avoid the shortcom<strong>in</strong>gs of exist<strong>in</strong>g implementations and provide<br />
the first sequential pattern m<strong>in</strong><strong>in</strong>g implementation written<br />
<strong>in</strong> <strong>OCaml</strong>.<br />
1. INTRODUCTION<br />
While attempt<strong>in</strong>g to make some changes to the CISpan [9]<br />
algorithm, I found that the current code was too <strong>in</strong>tegrated<br />
with the author’s CPM<strong>in</strong>er [5] experiment to be useful outside<br />
of that context. Similarly, I began work on the CloSpan<br />
algorithm on which CISpan is based, and found that the<br />
code was filled with experiments and alternative implementations,<br />
mak<strong>in</strong>g it difficult to work with. Additionally the<br />
code uses po<strong>in</strong>ter arithmetic extensively, mak<strong>in</strong>g it very difficult<br />
to verify the accuracy of the implementation. Based<br />
on this experience, I created <strong>Caravan</strong> as a way to explore the<br />
systematic overlap of the components of common sequential<br />
pattern m<strong>in</strong><strong>in</strong>g algorithms.<br />
One of the basic sequential pattern m<strong>in</strong><strong>in</strong>g algorithms is<br />
PrefixSpan [6], which constructs a prefix tree of discovered<br />
sequential patterns along with projected databases of postfix<br />
sequences to explore. Build<strong>in</strong>g on top of that, both algorithmically<br />
and implementation-wise, is CloSpan [8]. CloSpan<br />
adds several prun<strong>in</strong>g techniques to narrow the search space.<br />
<strong>The</strong> two primary additions are (1) efficiently detect<strong>in</strong>g identical<br />
projected databases and then only construct<strong>in</strong>g and<br />
stor<strong>in</strong>g them a s<strong>in</strong>gle time (us<strong>in</strong>g a prefix lattice <strong>in</strong>stead of<br />
a prefix tree), and (2) avoid<strong>in</strong>g search<strong>in</strong>g through projected<br />
databases of non-closed patterns when the closed version can<br />
be detected. <strong>The</strong>se additions improve both the time and<br />
space efficiency of PrefixSpan by an order of magnitude, allow<strong>in</strong>g<br />
CloSpan to explore longer sequential patterns than<br />
possible <strong>in</strong> PrefixSpan.<br />
F<strong>in</strong>ally, CISpan [9] augments CloSpan with the ability to<br />
handle a new <strong>in</strong>cremental version of the database. CISpan<br />
does this by only deal<strong>in</strong>g with <strong>in</strong>serts and deletes, construct<strong>in</strong>g<br />
a second lattice of just these database entries and then<br />
merg<strong>in</strong>g the orig<strong>in</strong>al lattice with the <strong>in</strong>cremental lattice.<br />
Each of these three algorithms resides <strong>in</strong> the same codebase,<br />
along with several smaller experiments on alternate techniques.<br />
<strong>The</strong>y are structured as IFDEF blocks <strong>in</strong> the C++,<br />
with each algorithm overlaid directly on top of the previous<br />
one. Unfortunately, this makes follow<strong>in</strong>g the logic of the algorithm<br />
<strong>in</strong> the code very difficult, and makes add<strong>in</strong>g to the<br />
algorithms even more so. With this project, <strong>Caravan</strong>, I hope<br />
to explore the possibility of utiliz<strong>in</strong>g shared data structures<br />
<strong>in</strong>stead of reimplement<strong>in</strong>g them or mangl<strong>in</strong>g them for each<br />
new algorithm.<br />
2. RELATED WORK<br />
A significant <strong>in</strong>spiration for this framework is SPMF [3], a<br />
java-based collection of data m<strong>in</strong><strong>in</strong>g algorithms with similar<br />
orig<strong>in</strong>s. SPMF consists of a collection of algorithms,<br />
which overlap only to a limited extent. For example, there<br />
are several nearly identical implementations of the Item and<br />
Itemset classes. However, the way that classes are structured<br />
and how they relate to each other is fairly consistent<br />
across the codebase.<br />
Another collection of algorithms can be found <strong>in</strong> Illim<strong>in</strong>e<br />
[4], which extends well beyond frequent pattern m<strong>in</strong><strong>in</strong>g. It<br />
is, however, just a packag<strong>in</strong>g of <strong>in</strong>dividual implementations.<br />
None of the algorithms share source code (some don’t even<br />
have source code), mean<strong>in</strong>g none of them can benefit from<br />
improvements <strong>in</strong> shared libraries.<br />
Though both of these collections are partially open source,<br />
neither provides a convenient way for community contributions<br />
or expansion, beyond submitt<strong>in</strong>g patches directly to
Figure 1: Basic shared structures.<br />
Figure 2: Performance of <strong>Caravan</strong> vs SPMF<br />
the ma<strong>in</strong>ta<strong>in</strong>ers.<br />
3. ARCHITECTURE<br />
3.1 <strong>OCaml</strong><br />
I’ve utilized the <strong>OCaml</strong> programm<strong>in</strong>g language and libraries<br />
for this framework. <strong>OCaml</strong> is a functional language <strong>in</strong> the<br />
ML programm<strong>in</strong>g language family. Besides my personal<br />
preference for the language, it has many beneficial characteristics<br />
when it comes to manipulat<strong>in</strong>g complex data structures<br />
such as the ones I use <strong>in</strong> data m<strong>in</strong><strong>in</strong>g. Dur<strong>in</strong>g development<br />
of this framework, I found that the strong type<br />
system helped quickly identify issues <strong>in</strong> the code that might<br />
have taken much longer to otherwise discover. <strong>OCaml</strong> is also<br />
known for it’s speed; for example, s<strong>in</strong>ce all types are known<br />
at compile time, all types can be stripped and do not need<br />
to be checked at run time.<br />
One possible drawback is that <strong>OCaml</strong> uses automated memory<br />
management (garbage collection). This adds some overhead<br />
to data structures so that they can be organized <strong>in</strong><br />
memory, decreas<strong>in</strong>g the theoretical limit of how much raw<br />
data I can hold <strong>in</strong> memory. It will be <strong>in</strong>terest<strong>in</strong>g to see how<br />
well data m<strong>in</strong><strong>in</strong>g algorithms can scale, as their use tends to<br />
push the edges of hardware and software capabilities.<br />
3.2 Basic Shared Structures<br />
<strong>The</strong> goal <strong>in</strong> the architecture is to promote reuse of code<br />
between algorithms, while still allow<strong>in</strong>g for extendability.<br />
Here I use Functors (parameterized, polymorphic modules)<br />
and objects to create the <strong>in</strong>dividual data structures necessary<br />
for m<strong>in</strong><strong>in</strong>g sequential patterns. <strong>The</strong> <strong>in</strong>dividual functors<br />
can be composed to build the f<strong>in</strong>al datastructures needed by<br />
<strong>in</strong>dividual algorithms.<br />
In Figure 1 I’ve identified the core structures which are<br />
shared by a variety of sequential pattern m<strong>in</strong><strong>in</strong>g algorithms.<br />
Most of this structure comes directly from the problem space<br />
itself, which consists of f<strong>in</strong>d<strong>in</strong>g patterns <strong>in</strong> an ordered list of<br />
itemsets. <strong>The</strong>se datastructures are based on the PrefixSpan,<br />
CloSpan, and CISpan algorithms. As more algorithms<br />
are added, these shared structures will expand. For example,<br />
I anticipate add<strong>in</strong>g a Lattice structure to accommodate<br />
CloSpan and CISpan, even though it isn’t required for PrefixSpan.<br />
Each of these structures is encapsulated <strong>in</strong> a way to allow<br />
changes without alter<strong>in</strong>g their <strong>in</strong>direct dependencies. It<br />
would be trivial, for example, to allow str<strong>in</strong>g-based items <strong>in</strong>stead<br />
of the <strong>in</strong>tegers that are used by default. More importantly,<br />
each component <strong>in</strong>cludes the logic for manipulat<strong>in</strong>g<br />
it’s own <strong>in</strong>ternal state. This allows for future improvements,<br />
such as parallel process<strong>in</strong>g, transparent to the larger algorithm<br />
that uses the components.<br />
Currently I’m us<strong>in</strong>g classes (objects) for each of these components,<br />
though I’ve done some experiments with us<strong>in</strong>g Functors<br />
(parameterized polymorphic modules) <strong>in</strong>stead, which<br />
might allow for more complex composit<strong>in</strong>g of the components.<br />
As is, each component is at least to some degree polymorphic.<br />
<strong>The</strong> Database, for example, can hold either normal<br />
sequences or pseudo-sequences. Pseudo-sequences refer to a<br />
normal sequence, but with a specific itemset and item offset,<br />
allow<strong>in</strong>g projected databases to take significantly less space<br />
(as per the orig<strong>in</strong>al PrefixSpan paper). <strong>The</strong> database component<br />
will not, however, allow you to mix Pseudo-sequences<br />
with normal sequences, which I believe is a beneficial trait.<br />
4. PREFIXSPAN IMPLEMENTATION<br />
<strong>The</strong> implementation of PrefixSpan started as a direct translation<br />
of the implementation <strong>in</strong> SPMF from Java to <strong>OCaml</strong>.<br />
I was, however, able to elim<strong>in</strong>ate several classes by mak<strong>in</strong>g<br />
exist<strong>in</strong>g classes polymorphic. <strong>The</strong> best example of this is<br />
the use of the Database module for hold<strong>in</strong>g different types<br />
of sequences. At various po<strong>in</strong>ts <strong>in</strong> the program it holds (1)<br />
the orig<strong>in</strong>al sequences (2) the projected pseudo-sequences,<br />
and (3) the f<strong>in</strong>al patterns.<br />
5. PERFORMANCE<br />
I compare <strong>Caravan</strong>’s PrefixSpan performance to the PrefixSpan<br />
implementation <strong>in</strong> the SPMF framework. To do this, I<br />
created 9 <strong>in</strong>put files us<strong>in</strong>g the seq data generator from the<br />
Illim<strong>in</strong>e distribution. All parameters were held constant except<br />
for ’ncust’, the number of customers. <strong>The</strong> benchmark<br />
was run several times, all with very similar results.<br />
In Figure 2 you can see that there is a significant difference<br />
between the two implementations, even though they<br />
have a relatively similar structure and follow the same basic<br />
algorithm. In this case the hypothesis that <strong>OCaml</strong> would<br />
be faster, s<strong>in</strong>ce it is a compiled language and otherwise has
similar memory management as Java, is disproved.<br />
I believe that there is a memory leak <strong>in</strong> <strong>Caravan</strong>, caus<strong>in</strong>g it<br />
to slow down <strong>in</strong> general. Unfortunately none of the memory<br />
profil<strong>in</strong>g techniques I tried provided accurate results, but I’ll<br />
cont<strong>in</strong>ue to <strong>in</strong>vestigate. It is also possible that the algorithm<br />
is somehow implemented <strong>in</strong>correctly, though this can only<br />
be partially true s<strong>in</strong>ce the actual discovered patterns are<br />
identical between <strong>Caravan</strong> and SPMF.<br />
While it is disappo<strong>in</strong>t<strong>in</strong>g that <strong>Caravan</strong> didn’t match, if not<br />
surpass, the performance of SPMF, it is an <strong>in</strong>terest<strong>in</strong>g illustration<br />
of how compet<strong>in</strong>g implementations of an identical<br />
algorithm can have significant performance differences.<br />
6. FUTURE WORK<br />
My hope is that this effort can lay the groundwork for implement<strong>in</strong>g<br />
many other sequential pattern m<strong>in</strong><strong>in</strong>g algorithms <strong>in</strong><br />
<strong>OCaml</strong>. In the immediate future I plan on the addition of<br />
CloSpan, CISpan, and BIDE [7], with a careful eye toward<br />
reus<strong>in</strong>g exist<strong>in</strong>g data structures and algorithms as much as<br />
possible. One important aspect of this which the current<br />
codebase lacks is standardized logg<strong>in</strong>g and benchmark<strong>in</strong>g. I<br />
found that logg<strong>in</strong>g the steps of the implementation greatly<br />
aided <strong>in</strong> determ<strong>in</strong><strong>in</strong>g its accuracy, and hav<strong>in</strong>g a common<br />
approach would be helpful.<br />
Secondly I hope to leverage the common parts of various algorithms<br />
to improve efficiency. Keep<strong>in</strong>g the same <strong>in</strong>terface<br />
but chang<strong>in</strong>g the <strong>in</strong>ternal design should make it possible<br />
to speed up and scale <strong>in</strong>dividual components without alter<strong>in</strong>g<br />
higher level algorithms. <strong>The</strong> ma<strong>in</strong> example of this is<br />
allow<strong>in</strong>g transparent disk-based projected databases. <strong>The</strong><br />
extreme example would be allow<strong>in</strong>g for parallel process<strong>in</strong>g<br />
at a low level. Some work to exam<strong>in</strong>e for parallel process<strong>in</strong>g<br />
of sequential pattern m<strong>in</strong><strong>in</strong>g is [1], and I should exam<strong>in</strong>e the<br />
possibility of us<strong>in</strong>g JoCaml [2] to manage parallel execution.<br />
Ultimately, I’d like to see <strong>OCaml</strong> <strong>Caravan</strong> expanded from<br />
be<strong>in</strong>g a <strong>Sequential</strong> <strong>Pattern</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> Framework to be a more<br />
general Data <strong>M<strong>in</strong><strong>in</strong>g</strong> and Mach<strong>in</strong>e Learn<strong>in</strong>g framework. I’d<br />
also like to have other parties contribute to the system. In<br />
order to promote improvements from such sources, I’ve created<br />
<strong>in</strong>ternal documentation on the structure of the system<br />
and put the code repository up on my website, along with a<br />
mirror on github. See http://thelackthereof.org/<strong>Caravan</strong>.<br />
8. ACKNOWLEDGMENT<br />
I especially appreciate be<strong>in</strong>g able to learn from exist<strong>in</strong>g implementations.<br />
Thanks to D<strong>in</strong>g Yuan for provid<strong>in</strong>g the CISpan<br />
source, the Illim<strong>in</strong>e project for their codebase which <strong>in</strong>cludes<br />
CloSpan, and Philippe Fournier-Viger for provid<strong>in</strong>g<br />
source code to SPMF.<br />
9. REFERENCES<br />
[1] S. Cong, J. Han, and D. Padua. Parallel m<strong>in</strong><strong>in</strong>g of<br />
closed sequential patterns. In Proceed<strong>in</strong>gs of the<br />
eleventh ACM SIGKDD <strong>in</strong>ternational conference on<br />
Knowledge discovery <strong>in</strong> data m<strong>in</strong><strong>in</strong>g, page 567, 2005.<br />
[2] C. Fournet, F. L. Fessant, L. Maranget, and<br />
A. Schmitt. JoCaml: a language for concurrent<br />
distributed and mobile programm<strong>in</strong>g. Advanced<br />
Functional Programm<strong>in</strong>g, page 1948–1948, 2003.<br />
[3] P. Fournier-Viger, R. Nkambou, and E. Nguifo. A<br />
knowledge discovery framework for learn<strong>in</strong>g task<br />
models from user <strong>in</strong>teractions <strong>in</strong> <strong>in</strong>telligent tutor<strong>in</strong>g<br />
systems. MICAI 2008: Advances <strong>in</strong> Artificial<br />
Intelligence, page 765–778, 2008.<br />
[4] J. Han. Data <strong>M<strong>in</strong><strong>in</strong>g</strong> Group. IlliM<strong>in</strong>e project.<br />
University of Ill<strong>in</strong>ois Urbana-Champaign Database and<br />
Information Systems Laboratory. 2005.<br />
[5] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-M<strong>in</strong>er: a<br />
tool for f<strong>in</strong>d<strong>in</strong>g copy-paste and related bugs <strong>in</strong><br />
operat<strong>in</strong>g system code. 2006.<br />
[6] J. Pei, J. Han, B. Mortazavi-Asl, H. P<strong>in</strong>to, Q. Chen,<br />
U. Dayal, and M. C. Hsu. PrefixSpan: m<strong>in</strong><strong>in</strong>g<br />
sequential patterns efficiently by prefix-projected<br />
pattern growth. In icccn, page 0215, 2001.<br />
[7] J. Wang and J. Han. BIDE: efficient m<strong>in</strong><strong>in</strong>g of frequent<br />
closed sequences. In Proceed<strong>in</strong>gs of the 20th<br />
International Conference on Data Eng<strong>in</strong>eer<strong>in</strong>g,<br />
page 79, 2004.<br />
[8] X. Yan, J. Han, and R. Afshar. CloSpan: m<strong>in</strong><strong>in</strong>g closed<br />
sequential patterns <strong>in</strong> large datasets. In Proc. of SIAM<br />
Int. Conf. on Data <strong>M<strong>in</strong><strong>in</strong>g</strong>, 2003.<br />
[9] D. Yuan, K. Lee, H. Cheng, G. Krishna, Z. Li, X. Ma,<br />
Y. Zhou, and J. Han. CISpan: comprehensive<br />
<strong>in</strong>cremental m<strong>in</strong><strong>in</strong>g algorithms of closed sequential<br />
patterns for Multi-Versional software m<strong>in</strong><strong>in</strong>g. 2008.<br />
7. CONCLUSION<br />
I have presented a new framework, <strong>Caravan</strong>, for implement<strong>in</strong>g<br />
sequential pattern m<strong>in</strong><strong>in</strong>g algorithms, and eventually<br />
data m<strong>in</strong><strong>in</strong>g algorithms <strong>in</strong> general, written <strong>in</strong> <strong>OCaml</strong>. I implemented<br />
PrefixSpan as a demonstration of the framework,<br />
show<strong>in</strong>g that the core architecture is sound. However, when<br />
compar<strong>in</strong>g the performance of PrefixSpan implemented on<br />
<strong>Caravan</strong> to the same algorithm implemented on SPMF (a<br />
Java-based implementation), I found <strong>Caravan</strong> to be significantly<br />
slower.