NON-ADDICTIVE SAMPLE BASED FACTS - Deetc

More documents

Recommendations

Info

attribute keys and K is the remaining attribute. From an OLAP point of view, X1, X2, ..., Xn turn to be the axis of the cube and each value of attribute K is in the intersection of those axis. We can express K as a function K : f (X1,X2,...,Xn) (5) If one of the axis is omitted from (5), e.g Xn, we are performing a similar operation as a dimensional reduction Knew : f (X1,X2,...,Xn−1) (6) However, in (5), the result set does not contain duplicates, because the key set is contained in the projection result. In (6), to guarantee the distinction, K needs to be aggregated for each distinct tuple of the projection set PS set defined as (x1,x2,x3,...,xn)|xn ∈ dom(Xn). Using the sum as the aggregation function, the attribute K is summarised as ∑ xn∈dom(Xn) f (x1,x2,...,xn−1,xn) (7) The facts in the OLAP Cube does not always directly point to an a single attribute. Their definition can be based upon a mathematical operation applied to one or more data elements. These are generally referred to as derived data or derived facts [5]. If the operation defines a ratio3 , the application of (7) results in a nonmeaningful value, because ratios are non-addictive. The performance indicators cannot be pre-calculated and stored directly in the fact table; it must be implemented as a derived fact. To be able to properly calculate the performance indicators for the domain, e.g. rat, a corrective weight must be used, as can be see in (4). Since query’s targets are user depended, determined in runtime by the users’ restrictions over the Sociodemographic dimension, one can’t pre-determine the right weight to use. It’s necessary to store all the possible weights, using (3) to calculate them. For a 4 attribute key cube, we are dealing with a theoretical value of 15 possible combinations, sum 4 k=1 (4 i ). If one decided to stored all the possible weights as facts, for each tuple of the fact table 14 unnecessary values are stored, as only one is valid for each query. Besides, the number of possible targets increase this number. This approach is also not feasible because, for some indicators, it is necessary to store the weights for non-viewers, that is, individuals that didn’t watch television during the analysis period. Fact tables store only one type of occurrence, in this case, the fact that some individual watched television; it’s not a good 3 Ratios are not the only type of non-addictive facts. practice to store the opposite fact too. One must use a more straightforward approach, using some domain knowledge. The data are quota sample-based, which means the sample was designed to be a representative subset of the population, regarding some descriptive characteristics. In this sense, it’s impossible to have an individual to support simultaneously two or more values for an attribute. For example, an individual that is present in the target males with ages ranging from 4 to 14 cannot be part of other target, females with ages ranging from 4 to 14. Property 1. Let Ta be a target with a restriction over a socio-demographic attribute a, ta=v1 ∩ta=v2 = /0 If the weights are normalised, which can be done during the ETL 4 proccess, for any two or more disjoint subsets, their weights sum up to one. Property 2. Let Tb be a target with a restriction over socio-demographic two valued attribute b, ∑ Ta=0∈PW(i) + ∑ Ta=1∈PW(i) = 1 is always true. Knowing that, it’s possible to create a fact table that store the weights for all possible targets, including the non-viewers individuals. That fact table shares the data and socio-demographic dimensions. Each tuple in the fact table represent the reference value for a specific combination of socio-demographic values, for a given day, applying (3) for each distinct combination. Figure 2 illustrate the Contact star-schema. Figure 2: Illustration of the contact starschema model. The former model provide the weights for all the contacted individuals, viewers or not, for one day. To calculate the performance indicators for the domain, e.g. rating, it’s also necessary to determine the weights of the viewers. To address this issue, it’s necessary to create another fact table, that store the viewers’ daily corrective weight. Since this table is indexed by all of the dimensions discussed so far, Date, Time, Program and Socio-Demographic, a tuple represents 4 Extraction Transformation and Loading
a contact of a set of viewers, sharing equal sociodemographic values, for one minute, a particular program/channel and a specific day. By observation of (4), it’s necessary to store the weights and the number of minutes of the interval. Figure 3 illustrate the audience star-schema, the main portion of the overall model. Figure 3: Illustration of the audience starschema model. 3.3 Optimization of the model The model is not optimized. The number of theoretical tuples per day are nearly 9,6 millions (1440 minutes x 6720 socio-demographic combinations). Even for the more realistic 1 3 of that value, the table grows 40 million tuples per month. Any optimization of the model must be thought in terms of gain vs benefit, as is necessary to create auxiliary aggregate models, based on requirements. The first aggregate take advantage of the typical targets used in TV analysis. Not all of the sociodemographic combinations are interesting, so, only 8 targets were considered here: Universe, all of the viewers; Class AB, the most wealthy viewers; 4:14, young children and teenagers; Housewife, viewers that are responsible for acquiring essential products for the house; Adults, viewers above 14 years old; ABC1 25:34, young active working viewers, from wealthiest social classes, ranging from 25 to 34 years old; ABC1 15:34, Young viewers from wealthiest social classes, ranging from 15 to 34 years old; ABC 25:54, active working viewers, from all social stratus, except lower classes. This aggregate represents a 70% reduction of the tuples needed, compared to the one illustrated on figure 3. It is necessary to create a new dimension to accommodate the previous targets. The aggregate’s model share the Date dimension with the others. The fact table store the sum of the daily weight for the target. Other possible optimization is to aggregate by time periods. Several reports calculate the indicators for specific periods, e.g. prime-time. In this model, the Time dimension is replaced by a new dimension TimePeriod, a projection of the former, with only the 8 value attribute Period. Consequently, the Program dimension is incompatible with the new aggregate’s granularity. From Program, is derived another new dimension, Channel, with only one attribute, channel. This represent a reduction of 180% in the first dimension and 60% in the second. The remaining dimensions of the main model (figure 3), are compatible. The fact table store the sum of the daily weight, for a specific channel, period, date and sociodemographic combination, and the time interval, in minutes. Figure 4 illustrates the overall model. The darker rectangles represent dimensions; the remaining are the fact tables. Figure 4: Illustration of the overall model. Bounded tables are aggregate specific and only exists to increase the models’ performance. 3.4 Evaluation The evaluation of the model was not based on performance but on flexibility and simplicity instead. One of the main goal was to prove the usability of generic OLAP tools fulfil the needs of the audience analysis domain, with the same degree of freedom and leading to the same results. A series of reports were made using Telereport [13] and the results kept as reference values. Those reports are expected to be representative of the daily necessities of an audience analyst. For lack of space, we do not address the ETL process, nor it’s evaluation. It suffice to say the OLAP cubes were created and populated with one year data, using SQL Server Data Transformations Services [12], for the ETL, and SQL Server Analysis Services [11] as a multidimensional engine. For each report, an MDX query was developed to mimic it, and then executed against our data. Every execution confirm the expected reference values, demonstrating that, for the tested reports, the model is adequate. The benefit of the aggregates were not tested in performance, but rather in simplicity. Listing 1 illustrate the necessary
Page 1 and 2: NON-ADDICTIVE SAMPLE BASED FACTS A
Page 3: and/or program, for a set of target

NON-ADDICTIVE SAMPLE BASED FACTS - Deetc

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?