31.10.2012 Views

NON-ADDICTIVE SAMPLE BASED FACTS - Deetc

NON-ADDICTIVE SAMPLE BASED FACTS - Deetc

NON-ADDICTIVE SAMPLE BASED FACTS - Deetc

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

attribute keys and K is the remaining attribute. From<br />

an OLAP point of view, X1, X2, ..., Xn turn to be the<br />

axis of the cube and each value of attribute K is in<br />

the intersection of those axis. We can express K as a<br />

function<br />

K : f (X1,X2,...,Xn) (5)<br />

If one of the axis is omitted from (5), e.g Xn, we<br />

are performing a similar operation as a dimensional<br />

reduction<br />

Knew : f (X1,X2,...,Xn−1) (6)<br />

However, in (5), the result set does not contain<br />

duplicates, because the key set is contained in<br />

the projection result. In (6), to guarantee the<br />

distinction, K needs to be aggregated for each<br />

distinct tuple of the projection set PS set defined as<br />

(x1,x2,x3,...,xn)|xn ∈ dom(Xn). Using the sum as the<br />

aggregation function, the attribute K is summarised as<br />

∑<br />

xn∈dom(Xn)<br />

f (x1,x2,...,xn−1,xn) (7)<br />

The facts in the OLAP Cube does not always directly<br />

point to an a single attribute. Their definition can be<br />

based upon a mathematical operation applied to one<br />

or more data elements. These are generally referred<br />

to as derived data or derived facts [5]. If the operation<br />

defines a ratio3 , the application of (7) results in a nonmeaningful<br />

value, because ratios are non-addictive.<br />

The performance indicators cannot be pre-calculated<br />

and stored directly in the fact table; it must be implemented<br />

as a derived fact.<br />

To be able to properly calculate the performance<br />

indicators for the domain, e.g. rat, a corrective<br />

weight must be used, as can be see in (4). Since<br />

query’s targets are user depended, determined in<br />

runtime by the users’ restrictions over the Sociodemographic<br />

dimension, one can’t pre-determine the<br />

right weight to use. It’s necessary to store all the<br />

possible weights, using (3) to calculate them. For a 4<br />

attribute key cube, we are dealing with a theoretical<br />

value of 15 possible combinations, sum 4 k=1 (4 i<br />

). If one<br />

decided to stored all the possible weights as facts, for<br />

each tuple of the fact table 14 unnecessary values are<br />

stored, as only one is valid for each query. Besides,<br />

the number of possible targets increase this number.<br />

This approach is also not feasible because, for some<br />

indicators, it is necessary to store the weights for<br />

non-viewers, that is, individuals that didn’t watch<br />

television during the analysis period. Fact tables store<br />

only one type of occurrence, in this case, the fact that<br />

some individual watched television; it’s not a good<br />

3 Ratios are not the only type of non-addictive facts.<br />

practice to store the opposite fact too. One must use<br />

a more straightforward approach, using some domain<br />

knowledge.<br />

The data are quota sample-based, which means the<br />

sample was designed to be a representative subset of<br />

the population, regarding some descriptive characteristics.<br />

In this sense, it’s impossible to have an individual<br />

to support simultaneously two or more values for<br />

an attribute. For example, an individual that is present<br />

in the target males with ages ranging from 4 to 14 cannot<br />

be part of other target, females with ages ranging<br />

from 4 to 14.<br />

Property 1. Let Ta be a target with a restriction over<br />

a socio-demographic attribute a,<br />

ta=v1 ∩ta=v2 = /0<br />

If the weights are normalised, which can be done during<br />

the ETL 4 proccess, for any two or more disjoint<br />

subsets, their weights sum up to one.<br />

Property 2. Let Tb be a target with a restriction<br />

over socio-demographic two valued attribute b,<br />

∑ Ta=0∈PW(i) + ∑ Ta=1∈PW(i) = 1 is always true.<br />

Knowing that, it’s possible to create a fact table that<br />

store the weights for all possible targets, including<br />

the non-viewers individuals. That fact table shares<br />

the data and socio-demographic dimensions. Each tuple<br />

in the fact table represent the reference value for<br />

a specific combination of socio-demographic values,<br />

for a given day, applying (3) for each distinct combination.<br />

Figure 2 illustrate the Contact star-schema.<br />

Figure 2: Illustration of the contact starschema model.<br />

The former model provide the weights for all the contacted<br />

individuals, viewers or not, for one day. To calculate<br />

the performance indicators for the domain, e.g.<br />

rating, it’s also necessary to determine the weights of<br />

the viewers. To address this issue, it’s necessary to<br />

create another fact table, that store the viewers’ daily<br />

corrective weight. Since this table is indexed by all<br />

of the dimensions discussed so far, Date, Time, Program<br />

and Socio-Demographic, a tuple represents<br />

4 Extraction Transformation and Loading

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!