Assignment 2 - The Lack Thereof

Assignment 2 

(Question 1) 

(1.1) There are 2 10 nonempty cuboids. 

(1.2) First lets list out all of the relevant aggregate selection cases. We only have to select on the first three 

dimensions since all other dimensions are the same for each of our base cells. 

Case: 

count: 

(*, *, *) 3 

(*, *, c3) 1 

(*, *, b3) 2 

(*, b2, *) 1 

(*, c2, *) 2 

(a1, *, *) 2 

(b2, *, *) 1 

(*, b2, c3) 1 

(*, c2, b3) 2 

(a1, *, c3) 1 

(a1, *, b3) 1 

(b1, *, b3) 1 

(a1, b2, *) 1 

(a1, c2, *) 1 

(b1, c2, *) 1 

Cells in base cuboid: 3 

Total cells: 3*2^10 

Common dimensions: 7 

Max count cells (count 3 cells): 2^7 (these overlap twice) 

Count 2 cells: 4*2^7 (these overlap once) 

Nonempty aggregate cells: 

Total – 2* (overlap twice) – 1*(overlap once) – base cells 

= 3*2^10 – 2*(2^7) – 1*(4*2^7) – 3 

= 3*2^10 – 6*2^7 – 3 

(1.3) Iceberg cells (where count >= 2): 

count 3 cells + count 2 cells 

= 2^7 + 4*2^7 

= 5*2^7 

(1.4) I think there are 6 closed cells: (a1,b2,c3), (a1,c2,b3), (b1,c2,b3), (a1,*,*), (*,c2,b3), and (*,*,*). 

Brock Wilcox 

wilcox6@uiuc.edu 

CS 412 

2009-10-18 

(Question 2) 

(2.1) Construct a 4-D array from the base data, and we then partition each dimension into 2 parts, meaning we

will have a total of 8 chunks. Each cell within a chunk can be assigned a unique offset index. Then, in order to 

compress this structure (especially important for even more sparse data), we use this offset to represent each 

populated cell as a (chunk_id + offset). Now we must decide the most optimal way to visit these cells, so that 

we can effeciently (using the least amount of memory) build up more general agregated data – the 2-D, 1-D, and 

'all' cuboids. 

To do this we must find what the size of each 3-D piece will be, so that we can fill keep as little of the 

biggest part in memory as possible. 

ABC = 2*4*2 = 16 

ABD = 2*4*3 = 24 

ACD = 2*2*3 = 12 

BCD = 4*2*3 = 24 

So we will try to fill ABD, then BCD, then ABC, and lastly ACD, with to goal of minimizing the amount of 

memory we'd consume at any point in time. To do this, we'd first move along the C dimension, then the A 

dimension, then the D dimension, and finally the B dimension. i.e. (A0,B0,C0,D0), (A0,B0,C1,D0), 

(A1,B0,C0,D0), (A1,B0,C1,D0), and so on. We'd repeat this process to determine the traversal order of each 3- 

D, 2-D, and 1-D cuboid to continue minimizing memory usage. 

(2.2) To be most efficient, start by noting that from most to least ordinality, we will work with B, then D, then 

A, and then C. Using the Counting Sort algorithm, sort all of the base cells by dimension B. The counting 

algorithm will tell us how many instances of each we end up with, so 2 for B3 for example. If we were building 

a iceberg cube we could at this step know whether or not we'd need to calculate further cuboids descended from 

the 1-D B3 cuboid (this is the Apriori property). 

We now consider each ordinal value of B to be a way to partition descendent cuboids. We'd next work 

within each partition one at a time, sorting on the D dimension, and so on. When working within a partition, all 

of the recursive work doesn't have to re-partition the top level, and thus saves work. 

(2.3) For Star-Cubing, we first construct a star-tree. In this case we don't need a star table, because we aren't 

doing any pruning (so we don't have any star nodes).

oot:5 

b1:1 b2:1 b3:2 b4:1 

d1:1 

d1:1 

d2:1 

d3:1 

d2:1 

a1:1 

a1:1 

a1:1 

a2:1 

a2:1 

c1:1 

c2:1 

c1:1 

c2:1 

c1:1 

Without having any star-nodes, we are losing out on a big part of the usefulness of the Star-Cubing 

algorithm. To start building up our agregate data, we traverse the tree bottom-up and depth-first. We descend to 

b1:1, d1:1, a1:1, and then c1:1. While we go we are building several subtrees – DAC, ABC/B, BCD/BD, and 

BDA/BDA. At that point we've reached a leaf and return back up the tree. We can't actually even complete the 

BDA branch yet, unfortunately, likely because of our lack of stars. However, we will be able to complete the 

BDA cuboid as soon as we hit the bottom of the second main tree. We continue like this, filling in cuboids as we 

are able, and keeping track of the current counts as we go. 

(Question 3) 

Note: some of the information for answer #3 came from Xin, Dong et al. “Star-Cubing: Computing Iceberg 

Cubes by Top-Down and Bottom-Up Integration” 

(3.1) The MultiWay algorithm is fairly efficient for large datasets with a small number of dimensions, because it 

is able to fill in overlapping parts of cuboids. By choosing a good traversal order, it is able to keep the amount 

of data held in memory small. Data skew can actually make MultiWay perform better, since the overlapping 

parts each benefit from skipped cells during computation. 

The BUC algorithm works on one dimension at a time, and allows for iceberg pruning. It doesn't, 

however, have a mechanism for using the overlapping parts of cuboids. It must instead effectively recalculate 

each cuboid. For especially dense data, BUC performs poorly compared to MultiWay or Star-Cubing. 

Star-Cubing has efficiency similar to the MultiWay algorithm since it avoids doing duplicate 

calculations. Plus it has pruning abilities similar to BUC that allow it to work with iceberg cubes. It effectively 

keeps memory usage low while still avoiding duplicate aggregate computation. 

(3.2) The MultiWay algorithm can't be used for computing only iceberg cubes, since it starts at the base cuboid 

it doesn't know it should prune a part of the lattice until it has already been computed. 

The BUC algorithm does better, in that it can prune iceberg cubes as it goes. It performs best when the 

data isn't too skewed, and when it can start with the dimension with highest cardnality and move to dimensions

with lower cardnality. With real world data, this order is the most likely to allow for pruning. However, very 

skewed data – where a dimension has 3 possible values but 99% are contained in one of them for example, can 

take away from this efficient pruning. 

Star-Cubing is also good at pruning iceberg cubes as it constructs its cuboids. Before it expands child 

nodes, it verifies that the node itself meets the iceberg condition. If it does not, then the algorithm won't recurse 

to its child nodes. 

(3.3) The basic idea of high-dimensional OLAP is to create verticle partitions of dimensions, building smaller 

cubes (called shell fragments) which contain some extra data (inverted indexes, aka tid-list), and then use these 

shells to efficiently answer queries. For the given example, (a1, , , *, c5, *, …, *) can be computed by first 

getting the shell fragments containing both A and C – if there are shell fragments that contain both then we take 

the the AC cuboid from that shell. If they aren't in the same shell fragment, then we'll take the tid-list from each 

and intersect them to use as our base table. 

The for the two dimensions, we take all possible values and their associated tid-lists (again taking 

higher cuboids from a cube that contains both), and again intersect those tid-lists with our base cuboid. Using 

this new base cuboid, we build up a new data cube (using MultiWay or Star-Cubing for example). This data 

cube is effectively a way to explore the two for any value, along with the locked-down a1 and c5 values. 

(Question 4) 

(4.1) 

Items: A,C,D,E,I,K,M,N,O,U, and Y. Absolute support 5*0.6 = 3. 

Now, using the Apriori algorithm, we go through two phases, building a list of candidate sets, and then based on 

their counts meeting the support threshold we build the itemsets. 

C1 = {A}:1, {C}:2, {D}:1, {E}:4, {I}:1, {K}:5, {M}:3, {N}:2, {O}:3, {U}:1, {Y}:3 

L1 = {E}:4, {K}:5, {M}:3, {O}:3, {Y}:3 

C2 = {E,K}:4, {E,M}:2, {E,O}:3, {E,Y}:2, {K,M}:3, {K,O}:3, {K,Y}:3, {M,O}:1, {M,Y}:2, {O,Y}:2 

L2 = {E,K}:4, {E,O}:3, {K,M}:3, {K,O}:3, {K,Y}:3 

C3 = {E,K,O}:3 (also eliminated previously eliminated subsets) 

L3 = {E,K,O}:3 

Using the FP-growth algorithm, we take the L1 from above, and sort by descending frequency: 

L = { {K}:5, {E}:4, {O}:3, {M}:3, {Y}:3 } 

We then sort the unique list of items in each transaction: 

TID items bought (sorted) 

T100 K, E, O, M, Y 

T200 K, E, O, Y 

T300 K, E, M, 

T400 K, M, Y 

T500 K, E, O 

K 5 

E 4 

O 3 

M 3 

Y 3 

O:3 

E:4 

null{} 

K:5 

M:1 

M:1 

Y:1 

M:1 

Y:1 

Y:1

Then we build the FP-tree, as you can see above. Once we've constructed the tree, we can use it directly to 

extract our frequent itemsets. We then begin with the least-frequent item, and build up the list as in this table: 

Item Conditional Pattern Base Conditional FP-tree Frequent Patterns 

Y { {K,E,O,M}:1, {K,E,O}:1, {K,M}:1 } {K, Y}:3 

M { {K,E,O}:1, {K,E}:1, {K}:1 } {K, M}:3 

O { {K,E}:3 } {K,O}:3, {E,O}:3, {K,E,O}:3 

E { {K}:4 } < {K}:4 > {K,E}:4 

Which gives us the same patterns and the same counts for each as we got using Apriori, but with only a single 

traversal of the dataset. The tree keeps us from re-calculating the subsets, making the FP-tree much more 

efficient. 

(4.2) There is only one itemset entry that has 3 items, {K,E,O}, so we will use that to generate matches to the 

given metarule. The support in each case will be 3/5 (60%), since that is how often we see {K,E,O}. The 

confidence in each case is thus 3 divided by the frequency of item3. This gives us: 

buys(X, K) and buys(X, E) -> buys(X, O) [60%, 75%] (does not meet min confidence) 

buys(X, K) and buys(X, O) -> buys(X, E) [60%, 100%] 

buys(X, E) and buys(X, O) -> buys(X, K) [60%, 100%] 

Since the first rule doesn't meet the minimum confidence, only the last two association rules are the final list 

matching the metarule.

Assignment 2 - The Lack Thereof

Create successful ePaper yourself

Delete template?

Save as template?