<strong>NON</strong>-<strong>ADDICTIVE</strong> <strong>SAMPLE</strong> <strong>BASED</strong> <strong>FACTS</strong><br />

A dimensional model applied to audience analysis<br />

Nuno Datia, Helder Pita<br />

Instituto Superior de Engenharia de Lisboa (ISEL),<br />

Departamento de Engenharia Electrónica e Telecomunicações e de Computadores (DEETC), Lisboa, Portugal<br />

datia@isel.ipl.pt,hp@isel.ipl.pt<br />

João Moura-Pires<br />

Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa (FCT-UNL),<br />

Departamento de Informática, Monte da Caparica, Portugal<br />

jmp@di.fct.unl.pt<br />

Keywords: Data warehouse, data mart, dimension modelling, non-additive facts, online analytical processing, people<br />

meter data, decision support.<br />

Abstract: In every Online Analytical Processing (OLAP) cube, there are certain type of measures, generally referred<br />

as facts, that must be treated carefully, due to their non-addictive nature. This means they can’t be added<br />

directly to present a summarised result. In this paper, the authors focused on a specific type of non-addictive<br />

facts, quota sampled based. These are, generally, ratios that need to be normalised against a reference value<br />

calculated from a subset of the quota sample. This process guarantee the representativeness of the measure.<br />

However, the reference value is not static, as it changes with the chosen subset. This subset is user depended,<br />

as result of restrictions the user impose the data, through the analysis interface. Using the audience analysis<br />

domain, where almost all performance indicators are non-addictive, this paper discuss a specific dimensional<br />

model, OLAP oriented, capable of address the non-addictive facts’ singularities. The model is star schema<br />

based, that addresses both efficiency and simplicity, and is targeted to audience analysis of TV generic programs.<br />


The Datawarehouse along with the operational system,<br />

are, today, the foundation of an organisation data<br />

centre. Where the first will seek to hold accurate,<br />

current data, the data warehouse will seek a much<br />

broader job: hold a series of snapshots of data over<br />

time. The subject is sufficient mature and its possible<br />

to resume the most valuable work to just two schools<br />

of thought: the bottom-up approach proposed, by<br />

Kimball [9], and the top down approach, proposed<br />

by Inmom [6]. However, despite their divergences,<br />

both agree the operational system and datawarehouse<br />

world is different. Those differences lead to a change<br />

on the both the methodology of development and the<br />

model used to store the data. The key characteristics<br />

of a datawarehouse are [6] : (i) subject-oriented; (ii)<br />

integrated; (iii) time variant; and (iv) non-volatile.<br />

Dimension modelling consists on a simple, performance<br />

oriented, relational model, capable of stored<br />

the happenings of the business. It consists of two<br />

types of tables:<br />

• Fact tables, where the numerical performance<br />

measurements of the business are stored;<br />

• Dimension tables, that store textual attributes,<br />

that give context to the facts.<br />

The fact table expresses the many-to-many relationships<br />

between dimensions. The primary key of the<br />

fact table are, generally, the set of foreign keys to the<br />

dimensions. So, dimensions give context to measures<br />

stored at the fact table.<br />

From a conceptual point of view, a dimension model<br />

can be represented as N dimensional sparse space,<br />

where axis represents dimensions and the intersections<br />

of those axis represent measures. This space<br />

is normally referred to as Online Analytic Processing<br />

Cube [2]. The manipulation of this cube is carried<br />

through a series of operations [7] (slice, pivot, drill<br />

down and up), and an aggregation function to summarise<br />

facts, whose nature dictate the set of functions<br />

that can be use. Facts can be [9]:<br />

• Additive, can be summed up through all of the<br />


• Semi-additive, can be summed up only for a subset<br />

of the dimensions;<br />

• Non-additive, cannot be summed for any of the<br />

dimensions.<br />

Non-additive does not directly relates with the measure<br />

being numerical; it relates with the usefulness<br />

(and accuracy) of the summed value. It may simply<br />

not make sense to the business. Ratios are an example<br />

of a non-addictive fact.<br />

The audience analysis domain shared with OLAP and<br />

Datawarehouse, all of the characteristics listed above.<br />

The volume of data are too similar in size. However,<br />

audience analysis world tends to be closed and proprietary<br />

oriented, and it’s applications do not apply,<br />

directly, generic OLAP techniques. Programs like<br />

[10] and [13] use pivot tables like interfaces, typical<br />

in OLAP, but do not use more of the OLAP reporting<br />

facilities. The former, [13], is developed by one of the<br />

leading software houses in audience analysis; it uses<br />

a proprietary text file format to store information, not<br />

an OLAP engine. In this paper, the authors evaluate<br />

the feasibility of general OLAP techniques applied to<br />

audience analysis domain. The purpose is to determine<br />

if it’s possible to use general, non proprietary<br />

solutions to achieve the same degree of freedom in<br />

audience analysis, as the referred tools. Particularly,<br />

we discuss a dimension model that addresses the singularity<br />

of the audience analysis performance indicators,<br />

almost all of them quota sample based and nonaddictive.<br />

Some of the necessary values to compute<br />

these indicators, mainly related to representativeness,<br />

are only known at runtime, after the users finished the<br />

analysis’ restrictions.<br />

One couldn’t find any related work, regarding<br />

dimension model and sample quota based facts. The<br />

main datawarehouse and olap literature [6, 9] do not<br />

address this kind of facts. And being a relative closed<br />

world, audience analysis are not an ease domain to<br />

build a state of the art [14]. The few articles publicly<br />

available are related to datamining and audience<br />

patterns analysis and not to datawarehouse and olap<br />

techniques appliance.<br />

This paper is organised as follows: section 2 describes<br />

the data, the requirements and some performance<br />

indicators used in audience analysis; section 3<br />

present the dimensional model, using Kimball’s approach,<br />

for a generic TV content analysis datamart;<br />

finally, in section 4 the authors draw some conclusions<br />

of their work.<br />


2.1 The Data<br />

The meter system seldom produces the data for the<br />

entire viewers’ panel viewers, since several practical<br />

problems may occur which range from meter misuse,<br />

to data communication problems, which endanger<br />

the desired panel representativeness. In order to<br />

adjust the panel representativeness, a weighting procedure<br />

is used which attempts to correct undesirable<br />

non-representative data tendencies. This is the Rim<br />

Weighting algorithm [4] (also known as Iterative Proportional<br />

Fitting [1]), which provides a daily weight<br />

for each viewer trying to recover the panel sample<br />

representativeness.<br />

People meter data have three distinct components:<br />

(i) socio-demographic information; (ii) TV content<br />

data; (iii) visualisation data. The socio-demographic,<br />

which include characteristics such as age, occupation<br />

and social class, provide important information to determine<br />

the panel representativity, but may also be<br />

used for other purposes (mostly for advertisement).<br />

The TV content is characterized by their type of contents,<br />

duration and corresponding TV channels. Finally,<br />

meter data measures the viewing behavior of<br />

each one of the viewers, represented by a sequence of<br />

watch/non watch indicators referred to each second of<br />

the day, as illustrated by figure 1. These indicators are<br />

registered independently for each channel.<br />

Figure 1: Representation of a daily viewing pattern for an<br />

individual.<br />

2.2 Requirements<br />

The requirements were gathered through the analysis<br />

of several audience reports, and for on site<br />

day by day usage of the audience analysis program<br />

Telereport[13]. The reports are sufficient broad to embrace<br />

a series of heterogeneous users, ranging from<br />

television to advertising. They cover the most important<br />

aspects of general TV content performance analysis.<br />

However, they do not address the singularities<br />

of advertising spots. The audience analyst are interested,<br />

mainly, to determine the audiences by channel

and/or program, for a set of targets and time periods;<br />

the analysis are never carried through individual<br />

viewers. Viewers are arranged in targets, that closely<br />

relate to socio-demographic information. For example,<br />

a typical target is the AB, consisting of high income<br />

viewers. Time periods range from minute by<br />

minute analysis, to standard audience periods, with 2<br />

or more hours. Periods with more than a day, must<br />

be treated carefully, as the corrective weight change<br />

daily. More details can be found in the work of [3].<br />

2.3 Performance Indicators<br />

Audience analysis define more than a dozen performance<br />

indicators [3]. However, as a regular basis,<br />

audience analysts tend to use only a few. For demonstration<br />

purposes, these paper use only one: TV rating<br />

(rat).<br />

TV rating give us the percentage of the population<br />

that, for a given period, have watched a program/channel.<br />

It tell us the probability of an individual<br />

from the population become a viewer. Let P be<br />

the daily viewers’ panel and W the set of corrective<br />

weights. The rating for a single minute can be calculated<br />

as<br />

|V |<br />

ratminute = ∑<br />

i=0<br />

Vi ∗W(i),V ⊂ P (1)<br />

where V is the subset of individuals that watch television,<br />

on the a specific channel. (1) can be extend to<br />

accommodate a set of minutes M<br />

ratperiod =<br />

∑ ratminute(m)<br />

m∈M<br />

(2)<br />

|M|<br />

However, (2) does not fully address the representativeness<br />

for a target. By definition, most of the performance<br />

indicators relate to a reference value and can<br />

be displayed as percentages. When one say that program<br />

X got a 20% rating, those are the percentage of<br />

the viewers from the panel that share the desire sociodemographic<br />

value, and happened to be watching TV<br />

during the analysis period. Let Tsd be the subset of<br />

the panel that share the desire socio-demographic values<br />

sd. One can determine the sum of the corrective<br />

weigths TW for the target as follows<br />

TW = ∑ W(i) (3)<br />

Tsd∈P<br />

Normalising (2) using (3) give us the representative<br />

value for the desired target<br />

rat = ratperiod<br />

(4)<br />

tw<br />

The rating calculated from (4) is valid measure, not<br />

only for the panel’s viewers, but also for the entire<br />

population of TV viewers.<br />

3 THE MODEL<br />

3.1 Dimensions<br />

From the requirements, each performance indicator is<br />

given context by: (i) Time, detailed to the minute;<br />

(ii) Date, giving access to the daily corrective weight;<br />

(iii) Program, the object of the performance analysis;<br />

and (iv) Socio-demographic information, enabling<br />

target analysis. Date is a regular dimension,<br />

whose characteristics and typical attributes can be observed<br />

in [9] and [6]. Time dimension represents the<br />

daily regularity, but also, typical time periods and a<br />

domain depended referential. The considered periods<br />

are the TV standard 1 : ”02h30-08h00”, ”08h00-<br />

12h00”, ”12h00-14h30”, ”14h30-18h30”, ”18h30-<br />

20h00”, ”20h00-23h00”, ”23h00-24h30”, ”24h30-<br />

26h30”.<br />

Socio-demographic dimension encloses some of<br />

the most used demographic information, such as<br />

Genre, Region, Social Class, Occupation, and Housewife<br />

status. The age information is generally divided<br />

from 5 to 7 intervals. The authors decided<br />

to use the seven interval division: ”[4-14]”,”[15-<br />

24]”, ”[25-34]”, ”[35-44]”, ”[45-54]”, ”[55-64]”,<br />

”>64”. The dimension is populated with all the possible<br />

attribute’s values combination. For the used<br />

data 2 , the value is 6720.<br />

Program dimension encloses all the information<br />

related to a TV show; their name, description, type,<br />

classification and duration. TV shows are classified<br />

with a three level taxonomy, at most. The first level<br />

indicates the topmost classification, e.g. Fiction. Second<br />

level classify the show inside the first level, e.g.<br />

Fiction-Film. Finally, the last level give the maximum<br />

detailed classification for the show, e.g. Fiction-Film-<br />

Comedy. Notice that not all of the shows have a 2nd<br />

or 3rd level of detail. Classification is an enclosed hierarchy<br />

inside the program dimension [9]. However,<br />

it’s not very deep nor its value is unknown; it can be<br />

modelled simply with 3 attributes, one for each level,<br />

with no lack of generality.<br />

With these dimensions, the granularity of the fact<br />

table was fixed at the minute. Note that such a model<br />

doesn’t not addresses the necessities of advertising<br />

spots, which is out of the scope of this work.<br />

3.2 Facts<br />

From a relational point of view, an OLAP cube is the<br />

projection of a relation R, where X1, X2, ..., Xn are<br />

1 For Portuguese TV audience analysis, by the year 2006<br />

2 From 2001

attribute keys and K is the remaining attribute. From<br />

an OLAP point of view, X1, X2, ..., Xn turn to be the<br />

axis of the cube and each value of attribute K is in<br />

the intersection of those axis. We can express K as a<br />

function<br />

K : f (X1,X2,...,Xn) (5)<br />

If one of the axis is omitted from (5), e.g Xn, we<br />

are performing a similar operation as a dimensional<br />

reduction<br />

Knew : f (X1,X2,...,Xn−1) (6)<br />

However, in (5), the result set does not contain<br />

duplicates, because the key set is contained in<br />

the projection result. In (6), to guarantee the<br />

distinction, K needs to be aggregated for each<br />

distinct tuple of the projection set PS set defined as<br />

(x1,x2,x3,...,xn)|xn ∈ dom(Xn). Using the sum as the<br />

aggregation function, the attribute K is summarised as<br />

∑<br />

xn∈dom(Xn)<br />

f (x1,x2,...,xn−1,xn) (7)<br />

The facts in the OLAP Cube does not always directly<br />

point to an a single attribute. Their definition can be<br />

based upon a mathematical operation applied to one<br />

or more data elements. These are generally referred<br />

to as derived data or derived facts [5]. If the operation<br />

defines a ratio3 , the application of (7) results in a nonmeaningful<br />

value, because ratios are non-addictive.<br />

The performance indicators cannot be pre-calculated<br />

and stored directly in the fact table; it must be implemented<br />

as a derived fact.<br />

To be able to properly calculate the performance<br />

indicators for the domain, e.g. rat, a corrective<br />

weight must be used, as can be see in (4). Since<br />

query’s targets are user depended, determined in<br />

runtime by the users’ restrictions over the Sociodemographic<br />

dimension, one can’t pre-determine the<br />

right weight to use. It’s necessary to store all the<br />

possible weights, using (3) to calculate them. For a 4<br />

attribute key cube, we are dealing with a theoretical<br />

value of 15 possible combinations, sum 4 k=1 (4 i<br />

). If one<br />

decided to stored all the possible weights as facts, for<br />

each tuple of the fact table 14 unnecessary values are<br />

stored, as only one is valid for each query. Besides,<br />

the number of possible targets increase this number.<br />

This approach is also not feasible because, for some<br />

indicators, it is necessary to store the weights for<br />

non-viewers, that is, individuals that didn’t watch<br />

television during the analysis period. Fact tables store<br />

only one type of occurrence, in this case, the fact that<br />

some individual watched television; it’s not a good<br />

3 Ratios are not the only type of non-addictive facts.<br />

practice to store the opposite fact too. One must use<br />

a more straightforward approach, using some domain<br />

knowledge.<br />

The data are quota sample-based, which means the<br />

sample was designed to be a representative subset of<br />

the population, regarding some descriptive characteristics.<br />

In this sense, it’s impossible to have an individual<br />

to support simultaneously two or more values for<br />

an attribute. For example, an individual that is present<br />

in the target males with ages ranging from 4 to 14 cannot<br />

be part of other target, females with ages ranging<br />

from 4 to 14.<br />

Property 1. Let Ta be a target with a restriction over<br />

a socio-demographic attribute a,<br />

ta=v1 ∩ta=v2 = /0<br />

If the weights are normalised, which can be done during<br />

the ETL 4 proccess, for any two or more disjoint<br />

subsets, their weights sum up to one.<br />

Property 2. Let Tb be a target with a restriction<br />

over socio-demographic two valued attribute b,<br />

∑ Ta=0∈PW(i) + ∑ Ta=1∈PW(i) = 1 is always true.<br />

Knowing that, it’s possible to create a fact table that<br />

store the weights for all possible targets, including<br />

the non-viewers individuals. That fact table shares<br />

the data and socio-demographic dimensions. Each tuple<br />

in the fact table represent the reference value for<br />

a specific combination of socio-demographic values,<br />

for a given day, applying (3) for each distinct combination.<br />

Figure 2 illustrate the Contact star-schema.<br />

Figure 2: Illustration of the contact starschema model.<br />

The former model provide the weights for all the contacted<br />

individuals, viewers or not, for one day. To calculate<br />

the performance indicators for the domain, e.g.<br />

rating, it’s also necessary to determine the weights of<br />

the viewers. To address this issue, it’s necessary to<br />

create another fact table, that store the viewers’ daily<br />

corrective weight. Since this table is indexed by all<br />

of the dimensions discussed so far, Date, Time, Program<br />

and Socio-Demographic, a tuple represents<br />

4 Extraction Transformation and Loading

a contact of a set of viewers, sharing equal sociodemographic<br />

values, for one minute, a particular program/channel<br />

and a specific day. By observation of<br />

(4), it’s necessary to store the weights and the number<br />

of minutes of the interval. Figure 3 illustrate the<br />

audience star-schema, the main portion of the overall<br />

model.<br />

Figure 3: Illustration of the audience starschema model.<br />

3.3 Optimization of the model<br />

The model is not optimized. The number of theoretical<br />

tuples per day are nearly 9,6 millions (1440 minutes<br />

x 6720 socio-demographic combinations). Even<br />

for the more realistic 1 3 of that value, the table grows<br />

40 million tuples per month. Any optimization of the<br />

model must be thought in terms of gain vs benefit,<br />

as is necessary to create auxiliary aggregate models,<br />

based on requirements.<br />

The first aggregate take advantage of the typical<br />

targets used in TV analysis. Not all of the sociodemographic<br />

combinations are interesting, so, only<br />

8 targets were considered here: Universe, all of the<br />

viewers; Class AB, the most wealthy viewers; 4:14,<br />

young children and teenagers; Housewife, viewers<br />

that are responsible for acquiring essential products<br />

for the house; Adults, viewers above 14 years old;<br />

ABC1 25:34, young active working viewers, from<br />

wealthiest social classes, ranging from 25 to 34 years<br />

old; ABC1 15:34, Young viewers from wealthiest social<br />

classes, ranging from 15 to 34 years old; ABC<br />

25:54, active working viewers, from all social stratus,<br />

except lower classes.<br />

This aggregate represents a 70% reduction of the<br />

tuples needed, compared to the one illustrated on<br />

figure 3. It is necessary to create a new dimension to<br />

accommodate the previous targets. The aggregate’s<br />

model share the Date dimension with the others. The<br />

fact table store the sum of the daily weight for the<br />

target.<br />

Other possible optimization is to aggregate by time<br />

periods. Several reports calculate the indicators for<br />

specific periods, e.g. prime-time. In this model,<br />

the Time dimension is replaced by a new dimension<br />

TimePeriod, a projection of the former, with only the<br />

8 value attribute Period. Consequently, the Program<br />

dimension is incompatible with the new aggregate’s<br />

granularity. From Program, is derived another new<br />

dimension, Channel, with only one attribute, channel.<br />

This represent a reduction of 180% in the first<br />

dimension and 60% in the second. The remaining<br />

dimensions of the main model (figure 3), are compatible.<br />

The fact table store the sum of the daily<br />

weight, for a specific channel, period, date and sociodemographic<br />

combination, and the time interval, in<br />

minutes. Figure 4 illustrates the overall model. The<br />

darker rectangles represent dimensions; the remaining<br />

are the fact tables.<br />

Figure 4: Illustration of the overall model. Bounded tables<br />

are aggregate specific and only exists to increase the models’<br />

performance.<br />

3.4 Evaluation<br />

The evaluation of the model was not based on performance<br />

but on flexibility and simplicity instead. One<br />

of the main goal was to prove the usability of generic<br />

OLAP tools fulfil the needs of the audience analysis<br />

domain, with the same degree of freedom and leading<br />

to the same results. A series of reports were made<br />

using Telereport [13] and the results kept as reference<br />

values. Those reports are expected to be representative<br />

of the daily necessities of an audience analyst.<br />

For lack of space, we do not address the ETL process,<br />

nor it’s evaluation. It suffice to say the OLAP cubes<br />

were created and populated with one year data, using<br />

SQL Server Data Transformations Services [12],<br />

for the ETL, and SQL Server Analysis Services [11]<br />

as a multidimensional engine. For each report, an<br />

MDX query was developed to mimic it, and then executed<br />

against our data. Every execution confirm the<br />

expected reference values, demonstrating that, for the<br />

tested reports, the model is adequate. The benefit of<br />

the aggregates were not tested in performance, but<br />

rather in simplicity. Listing 1 illustrate the necessary

code to implement one of the report. In the particular<br />

case, the query execution display the rating, for the<br />

main targets, by time periods, for one specific channel.<br />

WITH MEMBER Measures . TargetTotal as<br />

’ LookupCube (" ContactTarget ",<br />

"( Measures . weight ,"+ membertostr (<br />

Target . currentmember )+")") ’<br />

MEMBER Measures . rat as<br />

’ Measures . weight / Measures . TargetTotal<br />

/ Measures . num_min ’<br />

, FORMAT_STRING =’ Percent ’<br />

SELECT { Target . currentmember } ON COLUMNS ,<br />

time . Period . Members ON ROWS<br />

FROM AudienceTarget<br />

WHERE ( Measures .rat , Program . Canal .[2])<br />

Listing 1: The rating for the main targets by time periods<br />

for channel 2 using the target aggregate<br />

The code is rather simple because each target<br />

weight is pre-calculated in the ContactTarget aggregate.<br />

The LookupCube function lookup the value for<br />

each target querying it. With the absence of this cube,<br />

each target weight is calculate in runtime, looking up<br />

each socio-demographic variable that made up the target.<br />

Not only the execution times rise up, but also the<br />

query’ code.<br />

4 Conclusion<br />

This work presents a dimensional model capable<br />

to address the specificities of quota sample based data.<br />

The goal was twofold; first, to demonstrate that is<br />

possible to address audience analysis requirements,<br />

using non-proprietary repositories and technologies;<br />

second, to present a possible solution to other domains<br />

where data is also quota sample based. In the<br />

audience domain, the data tendencies are corrected by<br />

a daily individual weight, that must be taken into account<br />

if the indicators are meant to be representative<br />

to the entire population, not just the panel’s individuals.<br />

To deal with the audience performance indicators,<br />

mostly non-addictive and quota depended, is necessary<br />

to normalise each one with a reference value, calculated<br />

from the corrective daily weights. The present<br />

solution lay on the creation of an auxiliary contact table<br />

to store daily weights for each possible combination<br />

of socio-demographis values. The authors transform<br />

this way the non-addictive facts into addictive<br />

ones, sacrificing the capability of pre-calculated their<br />

values and store them into a fact table directly. All of<br />

the calculus must be done in runtime. To ensure an<br />

efficient solution, is necessary to create a series of domain<br />

dependant aggregates, with specific dimension<br />

models. The performance indicators test results, using<br />

both proprietary program and generic OLAP tools<br />

with the discussed model, have matched.<br />

Authors think the same methodology is appropriate<br />

to other domains if the data is quota sample based<br />

and the performance indicators values are always relative<br />

to the subset of the sample used in their calculus.<br />


