27.12.2014 Views

Assignment 2 - The Lack Thereof

Assignment 2 - The Lack Thereof

Assignment 2 - The Lack Thereof

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Assignment</strong> 2<br />

(Question 1)<br />

(1.1) <strong>The</strong>re are 2 10 nonempty cuboids.<br />

(1.2) First lets list out all of the relevant aggregate selection cases. We only have to select on the first three<br />

dimensions since all other dimensions are the same for each of our base cells.<br />

Case:<br />

count:<br />

(*, *, *) 3<br />

(*, *, c3) 1<br />

(*, *, b3) 2<br />

(*, b2, *) 1<br />

(*, c2, *) 2<br />

(a1, *, *) 2<br />

(b2, *, *) 1<br />

(*, b2, c3) 1<br />

(*, c2, b3) 2<br />

(a1, *, c3) 1<br />

(a1, *, b3) 1<br />

(b1, *, b3) 1<br />

(a1, b2, *) 1<br />

(a1, c2, *) 1<br />

(b1, c2, *) 1<br />

Cells in base cuboid: 3<br />

Total cells: 3*2^10<br />

Common dimensions: 7<br />

Max count cells (count 3 cells): 2^7 (these overlap twice)<br />

Count 2 cells: 4*2^7 (these overlap once)<br />

Nonempty aggregate cells:<br />

Total – 2* (overlap twice) – 1*(overlap once) – base cells<br />

= 3*2^10 – 2*(2^7) – 1*(4*2^7) – 3<br />

= 3*2^10 – 6*2^7 – 3<br />

(1.3) Iceberg cells (where count >= 2):<br />

count 3 cells + count 2 cells<br />

= 2^7 + 4*2^7<br />

= 5*2^7<br />

(1.4) I think there are 6 closed cells: (a1,b2,c3), (a1,c2,b3), (b1,c2,b3), (a1,*,*), (*,c2,b3), and (*,*,*).<br />

Brock Wilcox<br />

wilcox6@uiuc.edu<br />

CS 412<br />

2009-10-18<br />

(Question 2)<br />

(2.1) Construct a 4-D array from the base data, and we then partition each dimension into 2 parts, meaning we


will have a total of 8 chunks. Each cell within a chunk can be assigned a unique offset index. <strong>The</strong>n, in order to<br />

compress this structure (especially important for even more sparse data), we use this offset to represent each<br />

populated cell as a (chunk_id + offset). Now we must decide the most optimal way to visit these cells, so that<br />

we can effeciently (using the least amount of memory) build up more general agregated data – the 2-D, 1-D, and<br />

'all' cuboids.<br />

To do this we must find what the size of each 3-D piece will be, so that we can fill keep as little of the<br />

biggest part in memory as possible.<br />

ABC = 2*4*2 = 16<br />

ABD = 2*4*3 = 24<br />

ACD = 2*2*3 = 12<br />

BCD = 4*2*3 = 24<br />

So we will try to fill ABD, then BCD, then ABC, and lastly ACD, with to goal of minimizing the amount of<br />

memory we'd consume at any point in time. To do this, we'd first move along the C dimension, then the A<br />

dimension, then the D dimension, and finally the B dimension. i.e. (A0,B0,C0,D0), (A0,B0,C1,D0),<br />

(A1,B0,C0,D0), (A1,B0,C1,D0), and so on. We'd repeat this process to determine the traversal order of each 3-<br />

D, 2-D, and 1-D cuboid to continue minimizing memory usage.<br />

(2.2) To be most efficient, start by noting that from most to least ordinality, we will work with B, then D, then<br />

A, and then C. Using the Counting Sort algorithm, sort all of the base cells by dimension B. <strong>The</strong> counting<br />

algorithm will tell us how many instances of each we end up with, so 2 for B3 for example. If we were building<br />

a iceberg cube we could at this step know whether or not we'd need to calculate further cuboids descended from<br />

the 1-D B3 cuboid (this is the Apriori property).<br />

We now consider each ordinal value of B to be a way to partition descendent cuboids. We'd next work<br />

within each partition one at a time, sorting on the D dimension, and so on. When working within a partition, all<br />

of the recursive work doesn't have to re-partition the top level, and thus saves work.<br />

(2.3) For Star-Cubing, we first construct a star-tree. In this case we don't need a star table, because we aren't<br />

doing any pruning (so we don't have any star nodes).


oot:5<br />

b1:1 b2:1 b3:2 b4:1<br />

d1:1<br />

d1:1<br />

d2:1<br />

d3:1<br />

d2:1<br />

a1:1<br />

a1:1<br />

a1:1<br />

a2:1<br />

a2:1<br />

c1:1<br />

c2:1<br />

c1:1<br />

c2:1<br />

c1:1<br />

Without having any star-nodes, we are losing out on a big part of the usefulness of the Star-Cubing<br />

algorithm. To start building up our agregate data, we traverse the tree bottom-up and depth-first. We descend to<br />

b1:1, d1:1, a1:1, and then c1:1. While we go we are building several subtrees – DAC, ABC/B, BCD/BD, and<br />

BDA/BDA. At that point we've reached a leaf and return back up the tree. We can't actually even complete the<br />

BDA branch yet, unfortunately, likely because of our lack of stars. However, we will be able to complete the<br />

BDA cuboid as soon as we hit the bottom of the second main tree. We continue like this, filling in cuboids as we<br />

are able, and keeping track of the current counts as we go.<br />

(Question 3)<br />

Note: some of the information for answer #3 came from Xin, Dong et al. “Star-Cubing: Computing Iceberg<br />

Cubes by Top-Down and Bottom-Up Integration”<br />

(3.1) <strong>The</strong> MultiWay algorithm is fairly efficient for large datasets with a small number of dimensions, because it<br />

is able to fill in overlapping parts of cuboids. By choosing a good traversal order, it is able to keep the amount<br />

of data held in memory small. Data skew can actually make MultiWay perform better, since the overlapping<br />

parts each benefit from skipped cells during computation.<br />

<strong>The</strong> BUC algorithm works on one dimension at a time, and allows for iceberg pruning. It doesn't,<br />

however, have a mechanism for using the overlapping parts of cuboids. It must instead effectively recalculate<br />

each cuboid. For especially dense data, BUC performs poorly compared to MultiWay or Star-Cubing.<br />

Star-Cubing has efficiency similar to the MultiWay algorithm since it avoids doing duplicate<br />

calculations. Plus it has pruning abilities similar to BUC that allow it to work with iceberg cubes. It effectively<br />

keeps memory usage low while still avoiding duplicate aggregate computation.<br />

(3.2) <strong>The</strong> MultiWay algorithm can't be used for computing only iceberg cubes, since it starts at the base cuboid<br />

it doesn't know it should prune a part of the lattice until it has already been computed.<br />

<strong>The</strong> BUC algorithm does better, in that it can prune iceberg cubes as it goes. It performs best when the<br />

data isn't too skewed, and when it can start with the dimension with highest cardnality and move to dimensions


with lower cardnality. With real world data, this order is the most likely to allow for pruning. However, very<br />

skewed data – where a dimension has 3 possible values but 99% are contained in one of them for example, can<br />

take away from this efficient pruning.<br />

Star-Cubing is also good at pruning iceberg cubes as it constructs its cuboids. Before it expands child<br />

nodes, it verifies that the node itself meets the iceberg condition. If it does not, then the algorithm won't recurse<br />

to its child nodes.<br />

(3.3) <strong>The</strong> basic idea of high-dimensional OLAP is to create verticle partitions of dimensions, building smaller<br />

cubes (called shell fragments) which contain some extra data (inverted indexes, aka tid-list), and then use these<br />

shells to efficiently answer queries. For the given example, (a1, , , *, c5, *, …, *) can be computed by first<br />

getting the shell fragments containing both A and C – if there are shell fragments that contain both then we take<br />

the the AC cuboid from that shell. If they aren't in the same shell fragment, then we'll take the tid-list from each<br />

and intersect them to use as our base table.<br />

<strong>The</strong> for the two dimensions, we take all possible values and their associated tid-lists (again taking<br />

higher cuboids from a cube that contains both), and again intersect those tid-lists with our base cuboid. Using<br />

this new base cuboid, we build up a new data cube (using MultiWay or Star-Cubing for example). This data<br />

cube is effectively a way to explore the two for any value, along with the locked-down a1 and c5 values.<br />

(Question 4)<br />

(4.1)<br />

Items: A,C,D,E,I,K,M,N,O,U, and Y. Absolute support 5*0.6 = 3.<br />

Now, using the Apriori algorithm, we go through two phases, building a list of candidate sets, and then based on<br />

their counts meeting the support threshold we build the itemsets.<br />

C1 = {A}:1, {C}:2, {D}:1, {E}:4, {I}:1, {K}:5, {M}:3, {N}:2, {O}:3, {U}:1, {Y}:3<br />

L1 = {E}:4, {K}:5, {M}:3, {O}:3, {Y}:3<br />

C2 = {E,K}:4, {E,M}:2, {E,O}:3, {E,Y}:2, {K,M}:3, {K,O}:3, {K,Y}:3, {M,O}:1, {M,Y}:2, {O,Y}:2<br />

L2 = {E,K}:4, {E,O}:3, {K,M}:3, {K,O}:3, {K,Y}:3<br />

C3 = {E,K,O}:3 (also eliminated previously eliminated subsets)<br />

L3 = {E,K,O}:3<br />

Using the FP-growth algorithm, we take the L1 from above, and sort by descending frequency:<br />

L = { {K}:5, {E}:4, {O}:3, {M}:3, {Y}:3 }<br />

We then sort the unique list of items in each transaction:<br />

TID items bought (sorted)<br />

T100 K, E, O, M, Y<br />

T200 K, E, O, Y<br />

T300 K, E, M,<br />

T400 K, M, Y<br />

T500 K, E, O<br />

K 5<br />

E 4<br />

O 3<br />

M 3<br />

Y 3<br />

O:3<br />

E:4<br />

null{}<br />

K:5<br />

M:1<br />

M:1<br />

Y:1<br />

M:1<br />

Y:1<br />

Y:1


<strong>The</strong>n we build the FP-tree, as you can see above. Once we've constructed the tree, we can use it directly to<br />

extract our frequent itemsets. We then begin with the least-frequent item, and build up the list as in this table:<br />

Item Conditional Pattern Base Conditional FP-tree Frequent Patterns<br />

Y { {K,E,O,M}:1, {K,E,O}:1, {K,M}:1 } {K, Y}:3<br />

M { {K,E,O}:1, {K,E}:1, {K}:1 } {K, M}:3<br />

O { {K,E}:3 } {K,O}:3, {E,O}:3, {K,E,O}:3<br />

E { {K}:4 } < {K}:4 > {K,E}:4<br />

Which gives us the same patterns and the same counts for each as we got using Apriori, but with only a single<br />

traversal of the dataset. <strong>The</strong> tree keeps us from re-calculating the subsets, making the FP-tree much more<br />

efficient.<br />

(4.2) <strong>The</strong>re is only one itemset entry that has 3 items, {K,E,O}, so we will use that to generate matches to the<br />

given metarule. <strong>The</strong> support in each case will be 3/5 (60%), since that is how often we see {K,E,O}. <strong>The</strong><br />

confidence in each case is thus 3 divided by the frequency of item3. This gives us:<br />

buys(X, K) and buys(X, E) -> buys(X, O) [60%, 75%] (does not meet min confidence)<br />

buys(X, K) and buys(X, O) -> buys(X, E) [60%, 100%]<br />

buys(X, E) and buys(X, O) -> buys(X, K) [60%, 100%]<br />

Since the first rule doesn't meet the minimum confidence, only the last two association rules are the final list<br />

matching the metarule.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!