17.11.2014 Views

Caravan: Sequential Pattern Mining in OCaml - The Lack Thereof

Caravan: Sequential Pattern Mining in OCaml - The Lack Thereof

Caravan: Sequential Pattern Mining in OCaml - The Lack Thereof

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Caravan</strong>: <strong>Sequential</strong> <strong>Pattern</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> <strong>in</strong> <strong>OCaml</strong><br />

Brock Wilcox<br />

wilcox6@uiuc.edu<br />

ABSTRACT<br />

When attempt<strong>in</strong>g to expand upon exist<strong>in</strong>g <strong>Sequential</strong> <strong>Pattern</strong><br />

<strong>M<strong>in</strong><strong>in</strong>g</strong> software, namely the PrefixSpan, CloSpan, and<br />

CISpan series of algorithms, I found that the source code for<br />

each didn’t readily lend itself to analysis and improvement.<br />

Each of these algorithms builds upon the previous, and there<br />

are experiments <strong>in</strong> between each that rema<strong>in</strong> <strong>in</strong> the code.<br />

Ideally algorithms like these would be implemented <strong>in</strong> a<br />

way that allows <strong>in</strong>dividual improvements to be added, while<br />

keep<strong>in</strong>g the overlapp<strong>in</strong>g components shared. Additionally I<br />

found that none of these algorithms have been implemented<br />

<strong>in</strong> <strong>OCaml</strong>, which I prefer for implement<strong>in</strong>g complex datastructure<br />

manipulat<strong>in</strong>g software.<br />

In this paper I present <strong>Caravan</strong>, the foundation for an <strong>OCaml</strong>based<br />

sequential pattern m<strong>in</strong><strong>in</strong>g framework, which aims to<br />

avoid the shortcom<strong>in</strong>gs of exist<strong>in</strong>g implementations and provide<br />

the first sequential pattern m<strong>in</strong><strong>in</strong>g implementation written<br />

<strong>in</strong> <strong>OCaml</strong>.<br />

1. INTRODUCTION<br />

While attempt<strong>in</strong>g to make some changes to the CISpan [9]<br />

algorithm, I found that the current code was too <strong>in</strong>tegrated<br />

with the author’s CPM<strong>in</strong>er [5] experiment to be useful outside<br />

of that context. Similarly, I began work on the CloSpan<br />

algorithm on which CISpan is based, and found that the<br />

code was filled with experiments and alternative implementations,<br />

mak<strong>in</strong>g it difficult to work with. Additionally the<br />

code uses po<strong>in</strong>ter arithmetic extensively, mak<strong>in</strong>g it very difficult<br />

to verify the accuracy of the implementation. Based<br />

on this experience, I created <strong>Caravan</strong> as a way to explore the<br />

systematic overlap of the components of common sequential<br />

pattern m<strong>in</strong><strong>in</strong>g algorithms.<br />

One of the basic sequential pattern m<strong>in</strong><strong>in</strong>g algorithms is<br />

PrefixSpan [6], which constructs a prefix tree of discovered<br />

sequential patterns along with projected databases of postfix<br />

sequences to explore. Build<strong>in</strong>g on top of that, both algorithmically<br />

and implementation-wise, is CloSpan [8]. CloSpan<br />

adds several prun<strong>in</strong>g techniques to narrow the search space.<br />

<strong>The</strong> two primary additions are (1) efficiently detect<strong>in</strong>g identical<br />

projected databases and then only construct<strong>in</strong>g and<br />

stor<strong>in</strong>g them a s<strong>in</strong>gle time (us<strong>in</strong>g a prefix lattice <strong>in</strong>stead of<br />

a prefix tree), and (2) avoid<strong>in</strong>g search<strong>in</strong>g through projected<br />

databases of non-closed patterns when the closed version can<br />

be detected. <strong>The</strong>se additions improve both the time and<br />

space efficiency of PrefixSpan by an order of magnitude, allow<strong>in</strong>g<br />

CloSpan to explore longer sequential patterns than<br />

possible <strong>in</strong> PrefixSpan.<br />

F<strong>in</strong>ally, CISpan [9] augments CloSpan with the ability to<br />

handle a new <strong>in</strong>cremental version of the database. CISpan<br />

does this by only deal<strong>in</strong>g with <strong>in</strong>serts and deletes, construct<strong>in</strong>g<br />

a second lattice of just these database entries and then<br />

merg<strong>in</strong>g the orig<strong>in</strong>al lattice with the <strong>in</strong>cremental lattice.<br />

Each of these three algorithms resides <strong>in</strong> the same codebase,<br />

along with several smaller experiments on alternate techniques.<br />

<strong>The</strong>y are structured as IFDEF blocks <strong>in</strong> the C++,<br />

with each algorithm overlaid directly on top of the previous<br />

one. Unfortunately, this makes follow<strong>in</strong>g the logic of the algorithm<br />

<strong>in</strong> the code very difficult, and makes add<strong>in</strong>g to the<br />

algorithms even more so. With this project, <strong>Caravan</strong>, I hope<br />

to explore the possibility of utiliz<strong>in</strong>g shared data structures<br />

<strong>in</strong>stead of reimplement<strong>in</strong>g them or mangl<strong>in</strong>g them for each<br />

new algorithm.<br />

2. RELATED WORK<br />

A significant <strong>in</strong>spiration for this framework is SPMF [3], a<br />

java-based collection of data m<strong>in</strong><strong>in</strong>g algorithms with similar<br />

orig<strong>in</strong>s. SPMF consists of a collection of algorithms,<br />

which overlap only to a limited extent. For example, there<br />

are several nearly identical implementations of the Item and<br />

Itemset classes. However, the way that classes are structured<br />

and how they relate to each other is fairly consistent<br />

across the codebase.<br />

Another collection of algorithms can be found <strong>in</strong> Illim<strong>in</strong>e<br />

[4], which extends well beyond frequent pattern m<strong>in</strong><strong>in</strong>g. It<br />

is, however, just a packag<strong>in</strong>g of <strong>in</strong>dividual implementations.<br />

None of the algorithms share source code (some don’t even<br />

have source code), mean<strong>in</strong>g none of them can benefit from<br />

improvements <strong>in</strong> shared libraries.<br />

Though both of these collections are partially open source,<br />

neither provides a convenient way for community contributions<br />

or expansion, beyond submitt<strong>in</strong>g patches directly to


Figure 1: Basic shared structures.<br />

Figure 2: Performance of <strong>Caravan</strong> vs SPMF<br />

the ma<strong>in</strong>ta<strong>in</strong>ers.<br />

3. ARCHITECTURE<br />

3.1 <strong>OCaml</strong><br />

I’ve utilized the <strong>OCaml</strong> programm<strong>in</strong>g language and libraries<br />

for this framework. <strong>OCaml</strong> is a functional language <strong>in</strong> the<br />

ML programm<strong>in</strong>g language family. Besides my personal<br />

preference for the language, it has many beneficial characteristics<br />

when it comes to manipulat<strong>in</strong>g complex data structures<br />

such as the ones I use <strong>in</strong> data m<strong>in</strong><strong>in</strong>g. Dur<strong>in</strong>g development<br />

of this framework, I found that the strong type<br />

system helped quickly identify issues <strong>in</strong> the code that might<br />

have taken much longer to otherwise discover. <strong>OCaml</strong> is also<br />

known for it’s speed; for example, s<strong>in</strong>ce all types are known<br />

at compile time, all types can be stripped and do not need<br />

to be checked at run time.<br />

One possible drawback is that <strong>OCaml</strong> uses automated memory<br />

management (garbage collection). This adds some overhead<br />

to data structures so that they can be organized <strong>in</strong><br />

memory, decreas<strong>in</strong>g the theoretical limit of how much raw<br />

data I can hold <strong>in</strong> memory. It will be <strong>in</strong>terest<strong>in</strong>g to see how<br />

well data m<strong>in</strong><strong>in</strong>g algorithms can scale, as their use tends to<br />

push the edges of hardware and software capabilities.<br />

3.2 Basic Shared Structures<br />

<strong>The</strong> goal <strong>in</strong> the architecture is to promote reuse of code<br />

between algorithms, while still allow<strong>in</strong>g for extendability.<br />

Here I use Functors (parameterized, polymorphic modules)<br />

and objects to create the <strong>in</strong>dividual data structures necessary<br />

for m<strong>in</strong><strong>in</strong>g sequential patterns. <strong>The</strong> <strong>in</strong>dividual functors<br />

can be composed to build the f<strong>in</strong>al datastructures needed by<br />

<strong>in</strong>dividual algorithms.<br />

In Figure 1 I’ve identified the core structures which are<br />

shared by a variety of sequential pattern m<strong>in</strong><strong>in</strong>g algorithms.<br />

Most of this structure comes directly from the problem space<br />

itself, which consists of f<strong>in</strong>d<strong>in</strong>g patterns <strong>in</strong> an ordered list of<br />

itemsets. <strong>The</strong>se datastructures are based on the PrefixSpan,<br />

CloSpan, and CISpan algorithms. As more algorithms<br />

are added, these shared structures will expand. For example,<br />

I anticipate add<strong>in</strong>g a Lattice structure to accommodate<br />

CloSpan and CISpan, even though it isn’t required for PrefixSpan.<br />

Each of these structures is encapsulated <strong>in</strong> a way to allow<br />

changes without alter<strong>in</strong>g their <strong>in</strong>direct dependencies. It<br />

would be trivial, for example, to allow str<strong>in</strong>g-based items <strong>in</strong>stead<br />

of the <strong>in</strong>tegers that are used by default. More importantly,<br />

each component <strong>in</strong>cludes the logic for manipulat<strong>in</strong>g<br />

it’s own <strong>in</strong>ternal state. This allows for future improvements,<br />

such as parallel process<strong>in</strong>g, transparent to the larger algorithm<br />

that uses the components.<br />

Currently I’m us<strong>in</strong>g classes (objects) for each of these components,<br />

though I’ve done some experiments with us<strong>in</strong>g Functors<br />

(parameterized polymorphic modules) <strong>in</strong>stead, which<br />

might allow for more complex composit<strong>in</strong>g of the components.<br />

As is, each component is at least to some degree polymorphic.<br />

<strong>The</strong> Database, for example, can hold either normal<br />

sequences or pseudo-sequences. Pseudo-sequences refer to a<br />

normal sequence, but with a specific itemset and item offset,<br />

allow<strong>in</strong>g projected databases to take significantly less space<br />

(as per the orig<strong>in</strong>al PrefixSpan paper). <strong>The</strong> database component<br />

will not, however, allow you to mix Pseudo-sequences<br />

with normal sequences, which I believe is a beneficial trait.<br />

4. PREFIXSPAN IMPLEMENTATION<br />

<strong>The</strong> implementation of PrefixSpan started as a direct translation<br />

of the implementation <strong>in</strong> SPMF from Java to <strong>OCaml</strong>.<br />

I was, however, able to elim<strong>in</strong>ate several classes by mak<strong>in</strong>g<br />

exist<strong>in</strong>g classes polymorphic. <strong>The</strong> best example of this is<br />

the use of the Database module for hold<strong>in</strong>g different types<br />

of sequences. At various po<strong>in</strong>ts <strong>in</strong> the program it holds (1)<br />

the orig<strong>in</strong>al sequences (2) the projected pseudo-sequences,<br />

and (3) the f<strong>in</strong>al patterns.<br />

5. PERFORMANCE<br />

I compare <strong>Caravan</strong>’s PrefixSpan performance to the PrefixSpan<br />

implementation <strong>in</strong> the SPMF framework. To do this, I<br />

created 9 <strong>in</strong>put files us<strong>in</strong>g the seq data generator from the<br />

Illim<strong>in</strong>e distribution. All parameters were held constant except<br />

for ’ncust’, the number of customers. <strong>The</strong> benchmark<br />

was run several times, all with very similar results.<br />

In Figure 2 you can see that there is a significant difference<br />

between the two implementations, even though they<br />

have a relatively similar structure and follow the same basic<br />

algorithm. In this case the hypothesis that <strong>OCaml</strong> would<br />

be faster, s<strong>in</strong>ce it is a compiled language and otherwise has


similar memory management as Java, is disproved.<br />

I believe that there is a memory leak <strong>in</strong> <strong>Caravan</strong>, caus<strong>in</strong>g it<br />

to slow down <strong>in</strong> general. Unfortunately none of the memory<br />

profil<strong>in</strong>g techniques I tried provided accurate results, but I’ll<br />

cont<strong>in</strong>ue to <strong>in</strong>vestigate. It is also possible that the algorithm<br />

is somehow implemented <strong>in</strong>correctly, though this can only<br />

be partially true s<strong>in</strong>ce the actual discovered patterns are<br />

identical between <strong>Caravan</strong> and SPMF.<br />

While it is disappo<strong>in</strong>t<strong>in</strong>g that <strong>Caravan</strong> didn’t match, if not<br />

surpass, the performance of SPMF, it is an <strong>in</strong>terest<strong>in</strong>g illustration<br />

of how compet<strong>in</strong>g implementations of an identical<br />

algorithm can have significant performance differences.<br />

6. FUTURE WORK<br />

My hope is that this effort can lay the groundwork for implement<strong>in</strong>g<br />

many other sequential pattern m<strong>in</strong><strong>in</strong>g algorithms <strong>in</strong><br />

<strong>OCaml</strong>. In the immediate future I plan on the addition of<br />

CloSpan, CISpan, and BIDE [7], with a careful eye toward<br />

reus<strong>in</strong>g exist<strong>in</strong>g data structures and algorithms as much as<br />

possible. One important aspect of this which the current<br />

codebase lacks is standardized logg<strong>in</strong>g and benchmark<strong>in</strong>g. I<br />

found that logg<strong>in</strong>g the steps of the implementation greatly<br />

aided <strong>in</strong> determ<strong>in</strong><strong>in</strong>g its accuracy, and hav<strong>in</strong>g a common<br />

approach would be helpful.<br />

Secondly I hope to leverage the common parts of various algorithms<br />

to improve efficiency. Keep<strong>in</strong>g the same <strong>in</strong>terface<br />

but chang<strong>in</strong>g the <strong>in</strong>ternal design should make it possible<br />

to speed up and scale <strong>in</strong>dividual components without alter<strong>in</strong>g<br />

higher level algorithms. <strong>The</strong> ma<strong>in</strong> example of this is<br />

allow<strong>in</strong>g transparent disk-based projected databases. <strong>The</strong><br />

extreme example would be allow<strong>in</strong>g for parallel process<strong>in</strong>g<br />

at a low level. Some work to exam<strong>in</strong>e for parallel process<strong>in</strong>g<br />

of sequential pattern m<strong>in</strong><strong>in</strong>g is [1], and I should exam<strong>in</strong>e the<br />

possibility of us<strong>in</strong>g JoCaml [2] to manage parallel execution.<br />

Ultimately, I’d like to see <strong>OCaml</strong> <strong>Caravan</strong> expanded from<br />

be<strong>in</strong>g a <strong>Sequential</strong> <strong>Pattern</strong> <strong>M<strong>in</strong><strong>in</strong>g</strong> Framework to be a more<br />

general Data <strong>M<strong>in</strong><strong>in</strong>g</strong> and Mach<strong>in</strong>e Learn<strong>in</strong>g framework. I’d<br />

also like to have other parties contribute to the system. In<br />

order to promote improvements from such sources, I’ve created<br />

<strong>in</strong>ternal documentation on the structure of the system<br />

and put the code repository up on my website, along with a<br />

mirror on github. See http://thelackthereof.org/<strong>Caravan</strong>.<br />

8. ACKNOWLEDGMENT<br />

I especially appreciate be<strong>in</strong>g able to learn from exist<strong>in</strong>g implementations.<br />

Thanks to D<strong>in</strong>g Yuan for provid<strong>in</strong>g the CISpan<br />

source, the Illim<strong>in</strong>e project for their codebase which <strong>in</strong>cludes<br />

CloSpan, and Philippe Fournier-Viger for provid<strong>in</strong>g<br />

source code to SPMF.<br />

9. REFERENCES<br />

[1] S. Cong, J. Han, and D. Padua. Parallel m<strong>in</strong><strong>in</strong>g of<br />

closed sequential patterns. In Proceed<strong>in</strong>gs of the<br />

eleventh ACM SIGKDD <strong>in</strong>ternational conference on<br />

Knowledge discovery <strong>in</strong> data m<strong>in</strong><strong>in</strong>g, page 567, 2005.<br />

[2] C. Fournet, F. L. Fessant, L. Maranget, and<br />

A. Schmitt. JoCaml: a language for concurrent<br />

distributed and mobile programm<strong>in</strong>g. Advanced<br />

Functional Programm<strong>in</strong>g, page 1948–1948, 2003.<br />

[3] P. Fournier-Viger, R. Nkambou, and E. Nguifo. A<br />

knowledge discovery framework for learn<strong>in</strong>g task<br />

models from user <strong>in</strong>teractions <strong>in</strong> <strong>in</strong>telligent tutor<strong>in</strong>g<br />

systems. MICAI 2008: Advances <strong>in</strong> Artificial<br />

Intelligence, page 765–778, 2008.<br />

[4] J. Han. Data <strong>M<strong>in</strong><strong>in</strong>g</strong> Group. IlliM<strong>in</strong>e project.<br />

University of Ill<strong>in</strong>ois Urbana-Champaign Database and<br />

Information Systems Laboratory. 2005.<br />

[5] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-M<strong>in</strong>er: a<br />

tool for f<strong>in</strong>d<strong>in</strong>g copy-paste and related bugs <strong>in</strong><br />

operat<strong>in</strong>g system code. 2006.<br />

[6] J. Pei, J. Han, B. Mortazavi-Asl, H. P<strong>in</strong>to, Q. Chen,<br />

U. Dayal, and M. C. Hsu. PrefixSpan: m<strong>in</strong><strong>in</strong>g<br />

sequential patterns efficiently by prefix-projected<br />

pattern growth. In icccn, page 0215, 2001.<br />

[7] J. Wang and J. Han. BIDE: efficient m<strong>in</strong><strong>in</strong>g of frequent<br />

closed sequences. In Proceed<strong>in</strong>gs of the 20th<br />

International Conference on Data Eng<strong>in</strong>eer<strong>in</strong>g,<br />

page 79, 2004.<br />

[8] X. Yan, J. Han, and R. Afshar. CloSpan: m<strong>in</strong><strong>in</strong>g closed<br />

sequential patterns <strong>in</strong> large datasets. In Proc. of SIAM<br />

Int. Conf. on Data <strong>M<strong>in</strong><strong>in</strong>g</strong>, 2003.<br />

[9] D. Yuan, K. Lee, H. Cheng, G. Krishna, Z. Li, X. Ma,<br />

Y. Zhou, and J. Han. CISpan: comprehensive<br />

<strong>in</strong>cremental m<strong>in</strong><strong>in</strong>g algorithms of closed sequential<br />

patterns for Multi-Versional software m<strong>in</strong><strong>in</strong>g. 2008.<br />

7. CONCLUSION<br />

I have presented a new framework, <strong>Caravan</strong>, for implement<strong>in</strong>g<br />

sequential pattern m<strong>in</strong><strong>in</strong>g algorithms, and eventually<br />

data m<strong>in</strong><strong>in</strong>g algorithms <strong>in</strong> general, written <strong>in</strong> <strong>OCaml</strong>. I implemented<br />

PrefixSpan as a demonstration of the framework,<br />

show<strong>in</strong>g that the core architecture is sound. However, when<br />

compar<strong>in</strong>g the performance of PrefixSpan implemented on<br />

<strong>Caravan</strong> to the same algorithm implemented on SPMF (a<br />

Java-based implementation), I found <strong>Caravan</strong> to be significantly<br />

slower.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!