Caravan: Sequential Pattern Mining in OCaml - The Lack Thereof

Caravan: Sequential Pattern Mining in OCaml 

Brock Wilcox 

wilcox6@uiuc.edu 

ABSTRACT 

When attempting to expand upon existing Sequential Pattern 

Mining software, namely the PrefixSpan, CloSpan, and 

CISpan series of algorithms, I found that the source code for 

each didn’t readily lend itself to analysis and improvement. 

Each of these algorithms builds upon the previous, and there 

are experiments in between each that remain in the code. 

Ideally algorithms like these would be implemented in a 

way that allows individual improvements to be added, while 

keeping the overlapping components shared. Additionally I 

found that none of these algorithms have been implemented 

in OCaml, which I prefer for implementing complex datastructure 

manipulating software. 

In this paper I present Caravan, the foundation for an OCamlbased 

sequential pattern mining framework, which aims to 

avoid the shortcomings of existing implementations and provide 

the first sequential pattern mining implementation written 

in OCaml. 

1. INTRODUCTION 

While attempting to make some changes to the CISpan [9] 

algorithm, I found that the current code was too integrated 

with the author’s CPMiner [5] experiment to be useful outside 

of that context. Similarly, I began work on the CloSpan 

algorithm on which CISpan is based, and found that the 

code was filled with experiments and alternative implementations, 

making it difficult to work with. Additionally the 

code uses pointer arithmetic extensively, making it very difficult 

to verify the accuracy of the implementation. Based 

on this experience, I created Caravan as a way to explore the 

systematic overlap of the components of common sequential 

pattern mining algorithms. 

One of the basic sequential pattern mining algorithms is 

PrefixSpan [6], which constructs a prefix tree of discovered 

sequential patterns along with projected databases of postfix 

sequences to explore. Building on top of that, both algorithmically 

and implementation-wise, is CloSpan [8]. CloSpan 

adds several pruning techniques to narrow the search space. 

The two primary additions are (1) efficiently detecting identical 

projected databases and then only constructing and 

storing them a single time (using a prefix lattice instead of 

a prefix tree), and (2) avoiding searching through projected 

databases of non-closed patterns when the closed version can 

be detected. These additions improve both the time and 

space efficiency of PrefixSpan by an order of magnitude, allowing 

CloSpan to explore longer sequential patterns than 

possible in PrefixSpan. 

Finally, CISpan [9] augments CloSpan with the ability to 

handle a new incremental version of the database. CISpan 

does this by only dealing with inserts and deletes, constructing 

a second lattice of just these database entries and then 

merging the original lattice with the incremental lattice. 

Each of these three algorithms resides in the same codebase, 

along with several smaller experiments on alternate techniques. 

They are structured as IFDEF blocks in the C++, 

with each algorithm overlaid directly on top of the previous 

one. Unfortunately, this makes following the logic of the algorithm 

in the code very difficult, and makes adding to the 

algorithms even more so. With this project, Caravan, I hope 

to explore the possibility of utilizing shared data structures 

instead of reimplementing them or mangling them for each 

new algorithm. 

2. RELATED WORK 

A significant inspiration for this framework is SPMF [3], a 

java-based collection of data mining algorithms with similar 

origins. SPMF consists of a collection of algorithms, 

which overlap only to a limited extent. For example, there 

are several nearly identical implementations of the Item and 

Itemset classes. However, the way that classes are structured 

and how they relate to each other is fairly consistent 

across the codebase. 

Another collection of algorithms can be found in Illimine 

[4], which extends well beyond frequent pattern mining. It 

is, however, just a packaging of individual implementations. 

None of the algorithms share source code (some don’t even 

have source code), meaning none of them can benefit from 

improvements in shared libraries. 

Though both of these collections are partially open source, 

neither provides a convenient way for community contributions 

or expansion, beyond submitting patches directly to

Figure 1: Basic shared structures. 

Figure 2: Performance of Caravan vs SPMF 

the maintainers. 

3. ARCHITECTURE 

3.1 OCaml 

I’ve utilized the OCaml programming language and libraries 

for this framework. OCaml is a functional language in the 

ML programming language family. Besides my personal 

preference for the language, it has many beneficial characteristics 

when it comes to manipulating complex data structures 

such as the ones I use in data mining. During development 

of this framework, I found that the strong type 

system helped quickly identify issues in the code that might 

have taken much longer to otherwise discover. OCaml is also 

known for it’s speed; for example, since all types are known 

at compile time, all types can be stripped and do not need 

to be checked at run time. 

One possible drawback is that OCaml uses automated memory 

management (garbage collection). This adds some overhead 

to data structures so that they can be organized in 

memory, decreasing the theoretical limit of how much raw 

data I can hold in memory. It will be interesting to see how 

well data mining algorithms can scale, as their use tends to 

push the edges of hardware and software capabilities. 

3.2 Basic Shared Structures 

The goal in the architecture is to promote reuse of code 

between algorithms, while still allowing for extendability. 

Here I use Functors (parameterized, polymorphic modules) 

and objects to create the individual data structures necessary 

for mining sequential patterns. The individual functors 

can be composed to build the final datastructures needed by 

individual algorithms. 

In Figure 1 I’ve identified the core structures which are 

shared by a variety of sequential pattern mining algorithms. 

Most of this structure comes directly from the problem space 

itself, which consists of finding patterns in an ordered list of 

itemsets. These datastructures are based on the PrefixSpan, 

CloSpan, and CISpan algorithms. As more algorithms 

are added, these shared structures will expand. For example, 

I anticipate adding a Lattice structure to accommodate 

CloSpan and CISpan, even though it isn’t required for PrefixSpan. 

Each of these structures is encapsulated in a way to allow 

changes without altering their indirect dependencies. It 

would be trivial, for example, to allow string-based items instead 

of the integers that are used by default. More importantly, 

each component includes the logic for manipulating 

it’s own internal state. This allows for future improvements, 

such as parallel processing, transparent to the larger algorithm 

that uses the components. 

Currently I’m using classes (objects) for each of these components, 

though I’ve done some experiments with using Functors 

(parameterized polymorphic modules) instead, which 

might allow for more complex compositing of the components. 

As is, each component is at least to some degree polymorphic. 

The Database, for example, can hold either normal 

sequences or pseudo-sequences. Pseudo-sequences refer to a 

normal sequence, but with a specific itemset and item offset, 

allowing projected databases to take significantly less space 

(as per the original PrefixSpan paper). The database component 

will not, however, allow you to mix Pseudo-sequences 

with normal sequences, which I believe is a beneficial trait. 

4. PREFIXSPAN IMPLEMENTATION 

The implementation of PrefixSpan started as a direct translation 

of the implementation in SPMF from Java to OCaml. 

I was, however, able to eliminate several classes by making 

existing classes polymorphic. The best example of this is 

the use of the Database module for holding different types 

of sequences. At various points in the program it holds (1) 

the original sequences (2) the projected pseudo-sequences, 

and (3) the final patterns. 

5. PERFORMANCE 

I compare Caravan’s PrefixSpan performance to the PrefixSpan 

implementation in the SPMF framework. To do this, I 

created 9 input files using the seq data generator from the 

Illimine distribution. All parameters were held constant except 

for ’ncust’, the number of customers. The benchmark 

was run several times, all with very similar results. 

In Figure 2 you can see that there is a significant difference 

between the two implementations, even though they 

have a relatively similar structure and follow the same basic 

algorithm. In this case the hypothesis that OCaml would 

be faster, since it is a compiled language and otherwise has

similar memory management as Java, is disproved. 

I believe that there is a memory leak in Caravan, causing it 

to slow down in general. Unfortunately none of the memory 

profiling techniques I tried provided accurate results, but I’ll 

continue to investigate. It is also possible that the algorithm 

is somehow implemented incorrectly, though this can only 

be partially true since the actual discovered patterns are 

identical between Caravan and SPMF. 

While it is disappointing that Caravan didn’t match, if not 

surpass, the performance of SPMF, it is an interesting illustration 

of how competing implementations of an identical 

algorithm can have significant performance differences. 

6. FUTURE WORK 

My hope is that this effort can lay the groundwork for implementing 

many other sequential pattern mining algorithms in 

OCaml. In the immediate future I plan on the addition of 

CloSpan, CISpan, and BIDE [7], with a careful eye toward 

reusing existing data structures and algorithms as much as 

possible. One important aspect of this which the current 

codebase lacks is standardized logging and benchmarking. I 

found that logging the steps of the implementation greatly 

aided in determining its accuracy, and having a common 

approach would be helpful. 

Secondly I hope to leverage the common parts of various algorithms 

to improve efficiency. Keeping the same interface 

but changing the internal design should make it possible 

to speed up and scale individual components without altering 

higher level algorithms. The main example of this is 

allowing transparent disk-based projected databases. The 

extreme example would be allowing for parallel processing 

at a low level. Some work to examine for parallel processing 

of sequential pattern mining is [1], and I should examine the 

possibility of using JoCaml [2] to manage parallel execution. 

Ultimately, I’d like to see OCaml Caravan expanded from 

being a Sequential Pattern Mining Framework to be a more 

general Data Mining and Machine Learning framework. I’d 

also like to have other parties contribute to the system. In 

order to promote improvements from such sources, I’ve created 

internal documentation on the structure of the system 

and put the code repository up on my website, along with a 

mirror on github. See http://thelackthereof.org/Caravan. 

8. ACKNOWLEDGMENT 

I especially appreciate being able to learn from existing implementations. 

Thanks to Ding Yuan for providing the CISpan 

source, the Illimine project for their codebase which includes 

CloSpan, and Philippe Fournier-Viger for providing 

source code to SPMF. 

9. REFERENCES 

[1] S. Cong, J. Han, and D. Padua. Parallel mining of 

closed sequential patterns. In Proceedings of the 

eleventh ACM SIGKDD international conference on 

Knowledge discovery in data mining, page 567, 2005. 

[2] C. Fournet, F. L. Fessant, L. Maranget, and 

A. Schmitt. JoCaml: a language for concurrent 

distributed and mobile programming. Advanced 

Functional Programming, page 1948–1948, 2003. 

[3] P. Fournier-Viger, R. Nkambou, and E. Nguifo. A 

knowledge discovery framework for learning task 

models from user interactions in intelligent tutoring 

systems. MICAI 2008: Advances in Artificial 

Intelligence, page 765–778, 2008. 

[4] J. Han. Data Mining Group. IlliMine project. 

University of Illinois Urbana-Champaign Database and 

Information Systems Laboratory. 2005. 

[5] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: a 

tool for finding copy-paste and related bugs in 

operating system code. 2006. 

[6] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, 

U. Dayal, and M. C. Hsu. PrefixSpan: mining 

sequential patterns efficiently by prefix-projected 

pattern growth. In icccn, page 0215, 2001. 

[7] J. Wang and J. Han. BIDE: efficient mining of frequent 

closed sequences. In Proceedings of the 20th 

International Conference on Data Engineering, 

page 79, 2004. 

[8] X. Yan, J. Han, and R. Afshar. CloSpan: mining closed 

sequential patterns in large datasets. In Proc. of SIAM 

Int. Conf. on Data Mining, 2003. 

[9] D. Yuan, K. Lee, H. Cheng, G. Krishna, Z. Li, X. Ma, 

Y. Zhou, and J. Han. CISpan: comprehensive 

incremental mining algorithms of closed sequential 

patterns for Multi-Versional software mining. 2008. 

7. CONCLUSION 

I have presented a new framework, Caravan, for implementing 

sequential pattern mining algorithms, and eventually 

data mining algorithms in general, written in OCaml. I implemented 

PrefixSpan as a demonstration of the framework, 

showing that the core architecture is sound. However, when 

comparing the performance of PrefixSpan implemented on 

Caravan to the same algorithm implemented on SPMF (a 

Java-based implementation), I found Caravan to be significantly 

slower.

Caravan: Sequential Pattern Mining in OCaml - The Lack Thereof

Create successful ePaper yourself

Delete template?

Save as template?