SAE Manual Sections 1 to 4_1 (May 06).pdf - National Statistical ...

A Guide to Small Area Estimation - Version 1.1 

Key Clients: National Statistical Centres and Client Services 

May 2006

A Guide to Small Area Estimation - Version 1.1 05/05/2006 

The ABS intends to periodically update this manual. Therefore, the 

ABS would welcome any comments and suggestions from users. 

Readers who would like more information or who would like to 

forward comments on this manual may contact any of the 

following ABS officers: 

Location Contact Name Phone Number Email address 

Central Office Daniel Elazar +61 2 6252 6962 daniel.elazar@abs.gov.au 

NSW Edward Szoldra +61 2 9268 4214 edward.szoldra@abs.gov.au 

QLD Brett Frazer 

John Preston 

SA Justin Lokhorst 

Philip Bell 

+61 7 3222 6028 

+61 7 3222 6229 

+61 8 8237 7476 

+61 8 8237 7304 

brett.frazer@abs.gov.au 

john.preston@abs.gov.au 

justin.lokhorst@abs.gov.au 

philip.bell@abs.gov.au 

TAS Keith Farwell +61 3 6222 5889 keith.farwell@abs.gov.au 

VIC Elsa Lapiz +61 3 9615 7364 elsa.lapiz@abs.gov.au 

WA Carl Mackin +61 8 9360 5250 carl.mackin@abs.gov.au 

Australian Bureau of Statistics 2


Contents 

1 Introduction 5 

1.1 What are Small Area Estimates? 5 

1.2 Background of the Small Area Practice Manual 6 

1.3 Purpose 7 

1.4 What are the primary uses for Small Area Estimates? 9 

1.5 When should Small Area Estimates be Produced? 9 

2 Assessing User Requirements 10 

2.1 User Requirements 10 

3 Some issues in Small Area Estimation 14 

3.1 Sources of Additional Information 14 

3.2 Basic Conditions for Success 18 

3.3 Choice of Small Area 20 

3.4 Variable of Interest 22 

3.5 Quality of Auxiliary Data 22 

3.6 Confidentiality 24 

4 Choice of Small Area Techniques 25 

4.1 Types of Small Area Estimation Techniques 25 

4.1.1 Simple Small Area Methods 25 

4.1.2 Regression Methods 26 

4.2 The Modelling Framework 28 

4.3 Trade-off between Quality, Cost, Time and Effort 33 

5 Case Studies of Small Area Applications 36 

5.1 Simple Small Area Models 36 

5.1.1 Broad Area Ratio Estimator with No Auxiliary Data 36 

5.1.2 Broad Area Ratio Estimator with Auxiliary Data 42 

5.2 Regression Based Models 46 

5.2.1 Overview 46 

5.2.2 Framework for Regression Based Models 48 

5.2.3 Regression Based Synthetic Estimates 51 

5.2.4 Generating Small Area Estimates from 

Person Level Models 56 

5.2.5 Discussion of Examples 1-4 60 



6 Diagnostics For the Quality Small Area Estimates 62 

6.1 Introduction 62 

6.2 Diagnostics From Case Study 62 

6.3 Assessment of Models Against Diagnostics 76 

7 Communicating Quality to Users 78 

7.1 Introduction 78 

7.2 Sources of Error 80 

7.3 Impact of Errors 83 

7.4 Explanatory Notes 85 

8 Summary 88 

8.1 Points to Consider 88 

8.2 What Areas of the ABS Provide Small Area Estimates 88 

9 Frequently Asked Questions 90 

APPENDICES 92 

Appendix 1: List of Previous Small Area Work 92 

Appendix 2: Technical Notes of Estimators 96 

Appendix 3: SAS Datasets and Codes 100 

Appendix 4: Diagnostics Graphs 101 

Appendix 5: Explanatory Notes 103 

Appendix 6: Quality Declaration 127 

BIBLIOGRAPHY 139 

LIST OF ACRONYMS 140 



1.1 What are Small Area Estimates? 

1. Introduction 

Most ABS surveys are designed to provide statistically reliable, design based estimates 

only at the national and/or state/territory geographic levels. The sheer practical 

difficulties and cost of implementing and conducting sample surveys that would provide 

reliable estimates at levels finer than state/territory are generally prohibitive, both in 

terms of the increased sample size required and the added burden on providers of 

survey data (respondents). For purposes of this manual, small area estimation refers 

to methods of producing sufficiently reliable estimates for geographic areas that are too 

fine to obtain with precision, using direct survey estimation methods. By direct 

estimation we mean classical design based survey estimation methods (Saei and 

Chambers, 2003) that utilise only the sample units contained in each small area. Small 

area estimation methods are used to overcome the problem of small samples sizes to 

produce small area estimates that improve upon the quality of direct survey estimates 

obtained from the sample in each small area. The more sophisticated of these methods 

work by taking advantage of various relationships in the data, and involve, either 

implicitly or explicitly, a statistical model 1 to describe these relationships. (See Sections 

4.1 & 4.2 for further discussion). 

Although conceptually similar, small domain estimates refers to those disaggregated 

to fine classificatory levels, such as by socioeconomic status, income, labour force status 

or industry. It is important to note that we have not undertaken any empirical study for 

small domain estimation methods for this manual, although intuitively we would expect 

that most techniques covered in this manual would still apply. The empirical analysis of 

this manual is based on knowledge and experience derived from only one empirical 

study, this being a study of the incidence of disability in Australia. This study uses data 

from the Survey of Disability, Ageing and Carers (SDAC) (see ABS (2003) for more 

details). 

1 A statistical model is a mathematical representation of the relationship we assume to exist 

between the variable we are interested in predicting (known as the response or dependent 

variable) and other associated variables (known as the auxiliary, explanatory or independent 

variable). A model is then fitted to data that contains observed values for both the 

dependent variable and the auxiliary variables for each unit. The fitting process produces 

estimates of the model parameters such as intercepts and slopes. The unit here may be a 

person, a business or a small area itself, depending upon the level at which we wish to fit the 

model. The model also includes one or more error terms to describe the degree of 

stochastic or random variation with which predicted values for the response variable deviate 

from the observed values. 



1.2 Background to the Small Area Practice Manual 

The small area practice manual project was developed to give a simple and clear guide of 

how to undertake small area estimation. The ABS has previously carried out a number of 

small area estimates projects (See Appendix 1). In recent years user demand for these 

kinds of statistics has increased. Most of this increase in demand has become apparent 

during specific consultations between the ABS and key government users to gauge and 

assess users’ medium and long term statistical data requirements. Consolidated 

examples of these can be found in the Information Development Plan (IDP) (ABS, 2005) 

(Catalogue 1362.0 to be released in early 2006) and State Statistical Priorities (ABS 

Corporate Information - State Statistical Forum 16 February 2005) 

This reflects the growing statistical sophistication of users. Also local government bodies 

such as cities, councils and shires are taking on a greater role in the long term planning 

and socio-economic development of their regions. This increase in demand for small 

area data is occurring globally. In response, more advanced methods for producing 

reliable small area statistics are being developed and are gaining methodological 

acceptance. This was recognised at an international conference held on small area 

statistics in Riga, Latvia in 1999 where the Deputy Australian Statistician encouraged 

National Statistical Organisations to make greater use of model-based methods to 

produce small area statistics (Trewin 1999). The paper also noted that explaining quality 

is an especially important issue for a National Statistics Office when producing these 

types of estimates and products. 

Various areas within the ABS have been involved in the provision of small area estimates 

to varying levels of sophistication in both the methods used and the quality of the 

estimates produced. Table A.1 of Appendix 1 contains a selection of the major pieces of 

small area work that have been conducted to date. In addition, there has been no 

definitive set of clear, ABS wide guidelines on how to assess the quality of small area 

estimates and what should be the agreed minimum level of quality required before 

releasing small area statistics to external clients. In other words, what needs to be 

developed is a cohesive, coordinated approach to the production of small area 

estimates. 

There is a strong need to set up a framework for the practice of small area estimation at 

all levels of involvement in the small area statistical process. These include client services 

areas in regional offices or Central Office (CO), Methodology Division, National 

Statistical Centres and senior managers responsible for clearing and releasing small area 

output. Such a framework is important for ensuring that consistent practices are used 

across the ABS in producing small area estimates and that these practices accord with 

best practices used both in the ABS and in statistical agencies overseas. 



A consistent approach to the production of small area estimates is important for the 

following reasons: 

o 

o 

o 

o 

a need for the ABS to more precisely understand users' small area needs, ie how they 

utilise small area estimates in their decision making. Getting this right at the outset 

will ensure effort is efficiently directed to producing small area estimates that are fit 

for purpose. 

to ensure that small area estimates are produced with sufficient quality and are 

appropriate to user requirements. 

to ensure that users fully understand the assumptions and conditions underpinning 

output data and the fitness for use. 

to ensure small area estimation methodologies are sound, robust and practicable for 

a large range of small area estimation problems. 

Linked to this is the broader issue of the circumstances in which the ABS should or 

should not be producing small area estimates. These decisions need to be made by 

determining the risk that the provision of such data will detract from informed decision 

making. 

This manual has been prepared on the basis of work done on small area estimates of 

disability. Although the wording of the manual inadvertently reflects the context of a 

household based population survey, the small area methods described can also be 

applied to the context of economic/business collections. As further empirical studies are 

applied to other data contexts we anticipate that the manual will be expanded and 

adapted to include examples relating to economic data. 

1.3 Purpose 

This volume of the manual, which is the first of two volumes, the second of which will 

contain a more technical treatment, aims to provide a simple non-technical guide on the 

production, uses, quality and validation of small area estimates. The intended audience 

includes survey practitioners, consultants, methodologists and users of small area data. 

The broad objectives of the Small Area Estimation Practice Manual are as follows: 

o 

o 

o 

o 

To build a stable bridge between the knowledge, the theory and the practice of small 

area estimation while taking account of ABS priorities and policies with regards to 

the production of small area statistics. This should result in a more consistent and 

quality assured approach to producing small area estimates within the ABS. 

To realise a quantum increase in the level of ABS knowledge and understanding of 

small area estimation techniques, how and under what conditions they can be 

applied, and how to measure or assess the quality of the small area estimates 

produced. 

To provide coherent, relevant, accurate and accessible information on small area 

estimation practices and techniques which are used regularly by their intended 

audiences and updated to reflect increases in knowledge and understanding. 

To ensure that practitioners within the ABS have a clear understanding of the quality 

and assumptions underpinning the small area estimates produced and that these are 

clearly communicated to users so that small area estimates are used appropriately 

and for the purposes intended. 

Intended audience 



o 

o 

o 

This guide aims to give advice for National Statistical Centres and regional offices on 

how to advise, respond to and incorporate small area estimates into their work, so 

they can apply simple models themselves and know when to draw on 

methodological skills for more complex models. 

A second volume of the manual will cover the more technical aspects of small area 

estimation and will be primarily aimed at methodologists and technical analysts 

involved in producing modeled small area estimates. The technical manual will cover 

in more detail the methodological and statistical issues that arise in small area 

estimation. 

The content of this manual contains material on the application of basic statistical 

models. The manual therefore assumes the reader has a basic familiarity with the 

theory and application of such models. Some parts of the manual contain references 

to somewhat more advanced methods. In such instances warning boxes strongly 

recommend to the reader that further methodological advice should be obtained 

from Methodology Division (ABS) before applying such techniques. 

What it is - A Guide to: 

o 

o 

o 

o 

o 

o 

what issues need to be thought through before undertaking a small area exercise, 

the methods and techniques available in small area estimation, the relative 

advantages and disadvantages and assumptions involved in each, 

who to talk to, who has implemented specific approaches into practice already and 

where to find relevant documentation, 

the trips and traps of putting various techniques into practice, 

how to best measure the reliability of small area predictions, 

how to detect model miss-specification and what diagnostics are available for 

assessing the overall quality of small area estimates. 

What it is not 

o 

An up-to- date encyclopedia of all the literature on small area techniques. The focus 

of this manual is much more on the practice of small area estimation in the 

production of government statistics. Compiling and maintaining an up-to-date 

summary of the technical literature would be highly resource intensive as the field is 

relatively new and rapidly evolving. It would also make it more difficult for the 

practitioner to access. 

Finally, we emphasise that this manual has been written under the assumption that the 

primary goal of small area data users is to obtain descriptive statistics of the relative 

characteristics of small areas rather than obtain the form of some dynamic structural 

process which generates those small area characteristics. The manual is therefore 

premised upon a descriptive framework for the ultimate decision making objectives, 

even though analytical methods are used to construct the models used to predict those 

small area characteristics. In other words, we assume users are primarily interested in 

the predictions from those models, not just the form and structure of the models per se. 



1.4 What are the primary uses for Small Area Estimates? 

Federal, state and local government bodies involved in program funding / evaluation or 

regional planning are typically the primary users of ABS small area data. They require 

estimates of specified accuracy to assist them in making informed decisions on how to 

allocate resources or apply for additional resources. The need for government services 

to justify their decision making and be accountable to the community is seen as a very 

important factor. 

Small area estimates are often used by program administrators to determine or 

benchmark their funding allocations. Without the small area information, the 

administrators have difficulty in assessing the actual need for goods and services in each 

area. This can result in undesirable scenarios such as "the squeaky wheel gets the 

grease", whereby interest groups or areas which are most vocal receive a greater share of 

the funding allocations. Small area estimates provide detailed information on each area 

allowing for objective and informed decision making. 

Local government demand for small area data has also increased as they become 

increasingly aware and interested in the role statistics can play in informing them about 

what is happening in their own jurisdictions. 

1.5 When should Small Area Estimates be Produced? 

Small area estimates should only be produced when there is strong and justified user 

demand as well as no alternate data at the small area level that will serve the required 

purpose. In addition there needs to be adequate survey and auxiliary data to ensure that 

the outputs produced will be of sufficient quality to fit their intended purpose. 

Small area estimates should primarily be considered where key policy making decisions 

require discerning between relative needs of different small areas and such information 

does not currently exist or requires updating (eg. Disability data). To develop small area 

estimates, significant resources in staff time to develop, check and get approval for 

release is needed. The complexity of most small area estimation exercises and the 

difficulty in validating the reliability of the output makes it very difficult to fully automate 

the production process. To a large extent, each small area undertaking has to be tailored 

to the nature and specifics of the problem at hand. Therefore, care needs to be taken to 

ensure the need for the small area estimates warrants the effort required. 

The first step is to discuss with the users to see if state or part of state estimates would 

be adequate. If there is not much variation between the small areas then more broad 

estimates would be adequate. It is also worth investigating any sources of administrative 

data that can be used as auxiliary data for a small area model. Finally, it is worthwhile 

checking that the chosen small area model fitted to the data is appropriate for that data 

and inherent assumptions in the model do at least approximately hold. For example 

fitting a linear model to the data would require that the errors are identically and 

independently distributed with zero mean and constant variance. It is therefore prudent 

to check such assumptions are reasonable and have been satisfied before estimating the 

model. 



2. Assessing User Requirements 

2.1 User Requirements 

Understanding user requirements for small area estimates is paramount for providing a 

high quality small area product that meets the client's decision making requirements. 

The importance of gaining a thorough understanding of user requirements at this initial 

phase cannot be over emphasised. Shortcuts taken at this phase will often lead to an 

inferior quality product and/or valuable time and resources lost along the way. With the 

right questions, users will be able to give a clear indication as to what information is 

critical in their decision making. Users are also a valuable resource in helping to 

determine the best potential sources for borrowing strength. 

Where complex techniques need to be applied, Methodology Division (MD) staff will 

need to be involved in performing the methodological work and it is highly 

recommended that MD staff are directly involved in the discussion with clients at the 

earliest possible opportunity. 

Table 2.1 below displays a checklist of the key questions to ask clients when 

commencing a small area exercise. 

Table 2.1 Checklist of Questions to Ask Users. 

Question 

A) What are the key policy making or program funding decisions that require small area data ? 

B) What are the organisation's strategic context, goals and desired outcomes, in which these 

decision making requirements are nested ? 

C) What small area data do users think would best meet their decision making requirements 

and what level of geography is required ? 

D) What are the consequences for users’ decision making outcomes if the small area data is 

incorrect, say, by 5%, 10%, 20%, etc? Which small area estimates have the greatest priority in 

terms of accuracy requirements ? 

E) Are there any conceptual models, either social or economic, that are believed to describe the 

process which influences the variable(s) for which we are to calculate small area estimates ? 

F) What administrative data is available and relevant as auxiliary information to support the 

modeling of the small area estimates? How is this data collected, for what purpose is it used, 

and how accurate is it likely to be ? 

G) Will small area estimates be required to be disaggregated by other categories ? 

H) What previous studies have been used, if any, to undertake the policy/funding decision for 

which small area estimates are required ? 



A) What are the key policy making or program funding decisions that 

require small area data? 

Knowing how the small area data will be used as input to user’s decision making process 

is essential in ensuring the small area output meets user requirements. User decision 

making requirements can vary considerably. Some may be quite sophisticated and 

quantitatively based. Others may be quite informal and qualitatively based. In the former 

case, the decision making process should be identified and well understood as inherent 

assumptions may help determine just how accurate small area data really needs to be. It 

is also important to ensure, where possible, that the small area data is consistent and 

compatible with the users’ decision making process, and that the output of this process 

meets user expectations, not just the ABS small area output. A quality assessment should 

include measures of the fitness for purpose of small area output. 

However many users do not have sophisticated, quantitatively based decision making 

processes, and may have difficulty in articulating the very nature of the problem they 

wish to solve. 

Before undertaking the project it is worth investigating whether the small area estimates 

requested may suit the needs of a wider range of clients. Quite often similar data is 

required by different clients and can be useful for a wide range of users. By incorporating 

their needs into the project, this increases the value of the final product with minimal 

additional cost. 

B) What are the organisation's strategic context, goals and desired 

outcomes, in which these decision making requirements are nested? 

Need to ask users what the data problem is, why data needs to be obtained, the decision 

making processes used, what the users are trying to find out and why. This can be 

matched up with what is possible to estimate from the available data. Any possible 

limitations then can be identified early and additional information can be sought or the 

user can be made aware. When the final product is created the user has a good 

understanding of the limitations and the product is a close as is possible to what they 

need. 

C) What small area data do users think would best meet their decision 

making requirements and what level of geography is required? 

A minimum level of information on the variable of interest is needed in each small area. 

Given the available data, the user needs to be aware that a given level of the quality for 

the small area estimates is subject to a trade-off between the level of what geographic 

level and level of detail in the data is possible to model. That is, in the context of 

household based collections, a reasonably common characteristic of the variable of 

interest (say, greater than 10%) may be estimated at a reasonably fine level of geography 

such as Statistical Local Area (SLA). However, a variable of interest representing less 

than 1% of the population, can only be reliably estimated at a broader level of geography 

such as Statistical Sub-Division (SSD). For example, in the disability study estimates for 

physical disability (which accounts for more than 10%) could be obtained at a reasonably 



fine geographic and level of detail as compared to psychological disability (which is 

around 1%). This choice also depends on the quality of data which is discussed in the 

next section. 

D) What are the consequences for users’ decision making outcomes if the 

small area data is incorrect, say, by 5%, 10%, 20%, etc? Which small 

area estimates have the greatest priority in terms of accuracy 

requirements? 

The answer to this question will drive the level of quality and hence resources required 

to produce small area estimates of acceptable quality to users. If large funds from a 

government program are to be allocated to regions based on the small area estimates, 

then a high level of quality assurance and validation is required. However, if all that is 

required is an approximate guide to indicate areas where there may be unmet need, say 

for program evaluation purposes, then broad quality checks may be adequate. 

To assess how accurate small area estimates need to be before they start to adversely 

impact upon decision making outcomes, it is important to understand the entire 

decision making process and the way in which small area estimates feeding into that 

process impact upon the outputs. This involves understanding the assumptions implicit 

in the process. The analyst needs to work out, in consultation with users, how accurate 

final decision making outcomes need to be. By working backwards, it may be possible 

to work out what level of accuracy in the small areas estimates will give this level of 

accuracy in the decision making outcomes. A sensitivity analysis is another approach 

that can also be undertaken to determine how sensitive final decisions are to changes in 

the small area estimates. 

Zaslavsky and Schirm (2002) discuss, in the context of funding allocations, how 

interactions between the provisions of the funding formula, data sources and estimation 

procedures used to derive formula inputs can have unanticipated consequences that are 

inconsistent with the policy goals of a program. 

E) Are there any conceptual models, either social or economic, that are 

believed to describe the process which influences the variable(s) for 

which we are to calculate small area estimates 

This is a great opportunity to get expert advice on what variables should have a 

relationship with the population of interest. This will give a theoretical base to look at 

certain variables which can then be confirmed by statistical analysis. A widely accepted 

theoretical model or framework, published in the literature and/or supported by 

empirical investigations can greatly assist in deciding which variables, interaction terms 

and contextual effects should be included in the small area model or in validating the 

predicted estimates. Should you decide to include other variables not included in the 

framework or exclude variables that are included, you are aware of the potential need to 

justify the decision. 



F) What administrative data is available and relevant as auxiliary 

information to support the modeling of the small area estimates? How is 

this data collected, for what purpose is it used, and how accurate is it 

likely to be? 

It is important to cast the net wide in considering all potential sources of auxiliary data 

that may help improve the goodness of fit and specification of the small area model. 

The importance of understanding differences between auxiliary data and the survey data 

cannot be overstated. Administrative datasets may not reflect the entire population of 

interest or be as reliable as it is captured during some other process (ie. tax collection). 

A careful assessment should be made of the differences in: 

o 

o 

o 

o 

o 

o 

o 

concepts, 

data item definitions, 

(standard) classifications used 

scope 

mode of data collection 

reference periods 

editing procedures 

across all the data sources in order to at least understand the limitations of the small 

area model. 

G) Will small area estimates be required to be disaggregated by other 

categories? 

Users often request a whole range of small area data at different levels that may actually 

be superfluous to their needs. Here it is useful to find out what is the minimum level of 

data and geographic detail required to meet their needs. Prioritise any further 

breakdowns either at the geographic or sub-population level so during the modeling 

time is best spent on the essential models. 

H) What previous studies have been used, if any, by the clients to 

undertake the policy/funding decision for which small area estimates are 

required? 

This allows the project to compare results to current or previous studies, which will give 

a good outline if it is consistent with other research. It also allows research into what 

problems have come up in the past. 



3. Some issues in Small Area Estimation 

3.1 Sources of Additional Information 

The aim of small area estimation is to output a set of reliable estimates for each small 

area for the target variable(s) of interest. The challenge therefore, in small area 

estimation, is how best to use innovative approaches that take advantage of additional 

information to circumvent the small sample size problem and provide estimates with 

improved quality. Small area estimation methods are effective when they can draw upon 

intrinsic relationships within and between the survey data and other data sources, from 

which they borrow strength. These relationships, which are schematically represented in 

Figure 3.1, may be found: 

o 

o 

o 

o 

o 

between the survey based direct estimate and auxiliary information available from 

administrative data sources, censuses or other surveys or 

in correlations between direct estimates observed across time or 

in spatial relationships between neighbouring small areas or 

in cross-sectional relationships between units with similar characteristics observed in 

different small areas within some broader region 

or any combinations of the above. 

Figure 3.1: Possible sources of additional information 

Auxiliary Data 

(Demographic 

Information) 

Cross-sectional 

Relationships 

Small 

Area 

Model 

Time Series 

Relationships 

Multivariate 

Correlations 

Spatial 

Effects 

It turns out that, in most cases, by far the most important source from which to borrow 

strength, is the use of auxiliary data. 

Auxiliary data 



One of the more important prerequisites for the successful production of small area 

estimates is the availability of accurate auxiliary data that is well correlated with the 

target variable. By auxiliary data we mean one or more variables obtained from either 

administrative data sources or a census that are included in the model as explanatory 

variables. The auxiliary data should: 

o 

o 

o 

comprehensively cover the entire population scope for which small area estimates 

are required. If an auxiliary data item is not available for the unselected part of the 

population then small area predictions cannot be made and the affected data items 

cannot be included in the model. 

include reliable geographic information so that all units belonging to a small area can 

be accurately identified, and 

be contemporaneous with the target variable and other auxiliary data used in the 

model 

Model based small area estimates are produced by firstly fitting the model to the 

sample data to estimate model parameters, which include the intercept and slope 

parameters. The estimated model is then applied to the population auxiliary data to 

produce the small area predicted estimates. 

In the case of a purely area level model, the target variable and auxiliary variables are 

all at the small area level, so it is relatively straightforward to produce small area 

estimates as described above. However in the case of unit or person level models, the 

second step referred to above is a little more complex as the model fitted to the 

sampled units is generally applied to those population units not selected in the 

sample. Small area estimates are compiled by taking the sum of the sample unit 

values for the target variable (obtained from the survey data) and adding to it the sum 

of the model predictions for the non-sampled units. 

This approach naturally applies if the survey data can be reliably matched to the 

auxiliary information using a hard matching identifier such as Medicare number or tax 

file number. This is common practice in a number of European Union countries 

where national identifiers exist. However due to privacy considerations and related 

issues, this practice rarely occurs in Australia. Where it is not possible to distinguish 

between sampled and non-sampled units on the auxiliary data sources, there are two 

options available: 

- apply the model fitted to the sample data to the entire population data file, or 

- group population units within each small area (eg age by sex), fit a model to the 

small area by sub-group level sample data and then apply this model to make 

predictions for the non-sampled population in each small area sub-group. 

The first approach suffers from the disadvantage that the prediction error for the small 

area estimates will be increased slightly because target variable values for the sampled 

units are predicted from the model, thereby contributing to total model error. It would 

be more preferable, however to make use of the available survey response values which 

are not subject to model error. If the sampling fraction is very small then this should not 

be a major concern. 

The second approach has the advantage that only population counts of the non-sampled 

population in each small area sub-group are required to make predictions. The 

predicted totals for the non-sampled population (at the small area sub-group level) can 



then be added to the corresponding sample totals to form small area estimates. A 

potential disadvantage of this approach is that the small area sub-group level model may 

be less efficient than a unit level model. 

Auxiliary data may be available at area-level or person/unit level or a combination of 

both. However, in practice due to confidentiality or security reasons, data from 

government administrative sources are more likely to be available at some aggregated 

level. The choice between a unit/person level or area level model will depend on the 

level at which data for the variable of interest and explanatory variables are available as 

well as the efficiency of the small area estimates generated. For example, if data for 

the target variable and the auxiliary variables are only available at the area level, fitting 

an area level model will be the only option. However if unit level data is available for 

all variables, either an area level or unit level model is an option. It is also possible to 

fit a model in which the target variable is at the unit level but some auxiliary variables 

are at the unit level while others are at area level. Further discussion on the choice of 

small area model is provided in Section 4.2 below. 

In practice, the efficiency of predicted small area estimates may be improved by 

including some auxiliary variables as small area averages. Such covariates are referred to 

as contextual effects and may be included as an additional covariate even if the variable 

already appears in the model as a unit level auxiliary variable. Contextual effects allow 

differences in the area level characteristics in which a person lives to be accounted for in 

the model. For example, high income earners living in low income areas may have quite 

different characteristics to people on similarly high incomes living in high income areas, 

and it may be important to take account of this in the model. 

We now give an example of the data sources and auxiliary variables that were considered 

for the disability empirical study. The target variable was whether or not a person has a 

disability. The auxiliary data was drawn from the survey, a census as well as 

administrative data sources and comprised: 

- Survey of Disability, Ageing and Carers (SDAC) (ABS, 1998) 

- Census of Population and Housing, 2001 (ABS) 

- Socio-Economic Indexes For Areas (SEIFA) (ABS) 

- Disability Support Pension (DSP) data from Centrelink 

Given these sources of data, the following auxiliary variables were considered: 

- proportion of people in the small area receiving the DSP, 

- age and sex, income, household structure (from SDAC) 

- Socio- Economic Indexes For Area (SEIFA) score for the small area, 

- Indicator of remoteness 

Some of these variables were only available at the area level while those sourced from 

SDAC/Census, for example, age, sex and income, were available at the person level. 

These SDAC variables were chosen subject to the requirement that these variables were 

similarly defined and available from the census. 

Another key issue relating to auxiliary data concerns the case where survey data cannot 

be matched to auxiliary data sources. In order to make predictions for each small area, 

auxiliary variables obtained from the survey must correspond closely with similar data 

items available for the rest of the population. If this is not the case then model 

predictions may be significantly biased. For example in the empirical study of small area 



estimates of disability, we used auxiliary variables such as age, sex, income and 

household structure, found on the SDAC survey file to fit the model and then used the 

corresponding variables on the population census file to make the small area 

predictions. 

When considering potential sources of auxiliary data it is highly advisable to cast a wide 

net and assess the value of data that may not on first reflection appear highly relevant. 

For example, in the context of disability data, an economic variable in addition to health 

related variable may have good predictive power. Some caution however needs to be 

exercised as it is possible that the correlation between the target and some of the more 

tenuous auxiliary variables is more due to coincidence than to an intrinsic real world 

relationship between the two. Such auxiliary variables are referred to as spurious 

auxiliary variables. 

Demographic information is a particular form of auxiliary information, relating to 

population attributes such as age and sex. Many social variables will have some 

relationship to such demographic data thereby necessitating its use. However there is 

another reason for using demographic information and that is where the population size 

or demographic composition of small areas varies considerably. In Australia, with its 

extreme variation in population densities, this is a very common issue. 

Cross-sectional relationships 

Cross-sectional correlations are intrinsic relationships between units (observed at the 

same time point) with similar characteristics, even if they are not in the same small area. 

For example, units with the same age, sex and occupational characteristics may have 

similar health outcomes regardless of whether they live in Sydney or Melbourne. Small 

area methods borrow strength cross-sectionally by pooling sample data across a broader 

area (thus obtaining more statistical reliability) and then adjusting each small area 

estimate according to it's age-sex-occupation profile. In practice, borrowing strength 

cross-sectionally may be restricted to a predefined broader region if it is believed that 

cross-sectional relationships are likely to be different between regions. For example 

exposure to air pollutants is likely to be similar for Sydney and Melbourne but different 

to that of other cities. Hence Sydney and Melbourne may be combined into a broader 

region within which cross-sectional relationships can be drawn upon. 

Time Series Relationships 

Borrowing strength across time enables the practitioner to effectively pool sample data 

across time. The sample in each small area may be very sparse at a given time point, 

however if a sufficiently long time series exists and auto-correlations across time are 

reasonably strong, data from a number of time points can be pooled together giving a 

larger effective sample size to utilize in each small area. Time series auto-correlations are 

utilised to adjust for the degree of similarity or dissimilarity between units observed at 

specified time periods apart. This approach also has the benefit of reducing the impact 

of an observed value that is discordant with its neighbouring values in time. Borrowing 

strength across time adds a considerable degree of complexity to small area estimation 

and should only be contemplated where statistical expertise is available. 



Spatial Relationships 

Spatial relationships in the data can be harnessed in much the same way that time series 

relationships can be. Thus, if we hypothesize that different units bear some relationship 

to each other that depends upon the distance and direction between them, units can 

then be pooled together to give a greater effective sample size for each small area 

estimate. This approach also has the benefit of reducing the impact of the odd unit value 

that is discordant with its neighbouring values. Spatial methods are commonly used in 

the contexts of health, disease, agricultural or environmental data but may be quite 

applicable to other specific topics. 

As in the case of time series relationships, borrowing strength through spatial 

relationships adds additional complexity to the small area estimation and should only be 

contemplated where statistical expertise is available. 

Multivariate Relationships 

In a univariate model the response or target variable is a single variable. In this manual 

the models referred to are univariate models. So using the example of disability type 

(physical, sensory, intellectual, psychological/psychiatric, head injury/acquired brain 

damage), a separate univariate model is fitted to each of the disability types. In a 

multivariate model, the target variable is a vector of these variables and the model is 

fitted to these variables simultaneously. 

A multivariate approach may be more efficient in terms of producing more accurate 

predictions if there are strong correlations between the constituent variables. For 

example, physical impairment may have a strong correlation with sensory impairment. A 

multivariate approach that takes advantage of this additional information should be 

more robust and give more accurate estimates. However, multivariate models add 

additional complexity to small area estimation and should only be contemplated where 

statistical expertise is available. 

3.2 Basic Conditions for Success 

The first step in undertaking a small area exercise is to determine the quality of the 

direct estimates and the auxiliary data at the small area level. The variable of interest is 

often drawn from a sample survey, which can not provide estimates at a fine level due to 

small sample size in each small area and correspondingly high Relative Standard Errors 

(RSE's). Auxiliary data can be obtained from many sources including administrative 

datasets, survey variables and census counts. Table 3.1 outlines some issues that will 

help in determining whether the basic conditions for producing quality small area 

estimates are being met. 



Table 3.1: Recipe for Success 

Ingredient 

Small Area Size 

Each small area should have a reasonable 

sample. Few small areas should have no 

sample. 

Variable of Interest 

Reasonably common population 

characteristic 

Consistent estimates across small areas 

Model Specification 

Model is well-specified, meaning that: 

o all main determinants or explanators 

(auxiliary variables) for the target variable 

are included in the model and 

o the model reflects the correct form of the 

relationship between the target variable 

and the auxiliary variables (eg linear, 

quadratic, logistic etc) and that variance 

structures are accounted for correctly. 

Auxiliary Data 

Strong theoretical relationship between 

auxiliary variable and population of interest 

Statistically significant relationships between 

auxiliary data and small area estimates. 

The auxiliary data has been accurately 

collected and maintained and uses similar 

scope and definitions to the survey data. 

No missing values 

Compatibility of auxiliary data with census 

data in terms of consistency of definitions of 

variables, measurement, timing and other 

issues. 

Confidentiality 

Maintain confidentiality standards 

Reason 

The smaller the sample the harder it is to reliably discern 

the characteristics of individual small areas. More reliance 

is then placed on the assumption that the small area is 

similar to others. It also becomes more difficult to identify 

relationships either in the data or with auxiliary data. This 

will lead to lower quality small area estimates . 

Similar reason to small area size. In the context of 

household surveys, the rarer the characteristic the smaller 

the likely sample 

Key assumption with simple synthetic models. 

Mis-specification may result in incorrect predictions and 

incorrect measures of the statistical reliability of those 

predictions. 

Allows easy identification of potential auxiliary variables 

and aids in explanation of method to users. 

Allows a reasonable small area model to be estimated. 

Eliminates a further source of error that would otherwise 

impact upon the quality of the final small area output. 

Missing values can bias estimates or cause model failure. 

Where possible ensure these have been accounted for 

before modelling. 

Reduces further sources of errors caused due to 

inconsistency of definitions, measurement and other 

changes over time. 

ABS mission statement provides an assurance concerning 

the confidentiality of the data it collects. 



3.3 Choice of Small Area 

Within the ABS, the choice of small areas generally aligns with pre-specified boundaries 

as defined by the Australian Standard Geographical Classification (ASGC). Each area 

within Australia is broken up into Census Collection Districts (CDs) within Statistical 

Local Areas (SLAs) within Statistical Subdivisions (SSDs) within Statistical Divisions (SDs) 

within states. If possible, it is generally advisable to use ASGC classifications as they 

provide a consistent and integrated framework with a readily available set of 

concordances. However, other agencies often have different boundaries for their 

administrative areas. These boundaries generally line up with council boundaries which 

again line up with SLAs. Another common boundary is the postcode which can be 

related to a CD, although only approximately. A new geographical unit, called the 

meshblock, will be introduced in the 2006 population census for output purposes. The 

meshblock is considerably smaller than the CD, and with the help of the Geocoded 

National Address File (G-NAF) will improve the accuracy with which locations are coded 

to other ASGC classifications. In Section 2 we saw how it is important to find out from 

users the broadest area that will meet their small area requirements in order to improve 

the reliability of the modelled estimates. Figure 3.2 depicts the different choices that 

must be made to get reasonable estimates. 

Figure 3.2: Choosing the Appropriate Small Area 



As discussed in Section 3.2 under "Auxiliary Data", a choice exists as to the level at which 

small area models should be applied. In practice, users may require small area data at 

different levels of aggregation and it may be expedient to fit the model at the finest level, 

produce small area estimates at that level and let users aggregate those estimates up to 

the required levels of aggregation. However it is important to realise that this may run 

the risk of incurring what is known as the atomistic fallacy (PAHO, 2003). 

The atomistic fallacy occurs when trying to draw inferences about units defined at a 

higher level of aggregation from a model fitted at a lower level of aggregation. 

Relationships between lower levels of aggregated units may not be the same as those 

between higher level aggregated units. Hence if another model was fitted at the higher 

level, the estimated model parameters and the predictions may be quite different to 

those for the model fitted at the lower level. A similar fallacy, called the ecological 

fallacy, may occur in the reverse situation of fitting a model at a broad regional level and 

assuming that the inferences drawn can be readily applied to small areas within those 

regions. Wherever possible it is advisable to make model inferences (that is predicted 

estimates and their associated measures of accuracy) at the small area level required by 

users. Where small area estimates are to be aggregated the extent of the aggregation 

should be kept to a minimum. If small area estimates are produced at the LGA level, 

aggregation to user defined regions consisting of only a few LGAs may be acceptable but 

aggregation of many LGA estimates should be met with caution. 

In choosing the most appropriate small area to use, consideration needs to be given to 

the sample size in each small area. This needs to be sufficient so that the model can 

produce appropriately reliable estimates. The size of the sample will depend on the 

strengths of the cross-sectional relationships or other areas for borrowing strength. If 

these are quite strong then perhaps as few as ten or twenty will be sufficient. In the 

absence of strong relationships in the data, a larger sample size of perhaps a couple of 

hundred units may be required in each small area. The sample sizes referred to here 

should be interpreted as a very rough guide as, apart from model strength, the required 

sample size will also depend upon the variation of units within each small area. 

The number of small areas is also important especially if units are clustered within small 

areas. Generally having more small areas will help improve the goodness of fit of the 

small area model. Consideration should also be given to the geographical distribution 

of the sample through each small area. In ABS household surveys clustering is used in 

the sample design to help reduce costs, with the result that in remote areas all of the 

sampled dwellings in a small area may have been selected from one or more small towns 

and none from throughout the vast rural expanse. This is likely to result in bias if the 

characteristics of people in those towns are different to those in the remote rural areas. 

The allocation of the sample across small areas will often reflect the relative frequency 

with which the characteristic or variable of interest occurs in the population. In the case 

of a common sub-population such as the number of persons employed or the number of 

persons with a disability then local government area (LGA) may be a suitable choice of 

small area. For rare characteristics such as indigenous status or a particular type of 

illness then larger areas may be required to give reasonable estimates. For this, the 

Statistical Subdivision (SSD) or broader regions may be required. 

Of course the decision on which level of geography to choose for the small area will 



ultimately hinge upon user decision making requirements. It makes sense to choose 

small area that are as close as possible to the areas used for program planning and 

implementation. However, such areas are often really no more than administrative 

regions, chosen for pragmatic or logistical reasons such as transport costs or workforce 

management efficiency. Statistical units within these administrative regions are not 

necessarily homogenous with respect to the variable we are trying to calculate small area 

estimates for. If this is the case it may be worth considering (subject to the minimum 

sample size requirement) small areas at a finer level with greater homogeneity to obtain 

a better fitting model. Small area estimates at this level can then be aggregated to the 

required administrative regional level. 

For example, in the disability empirical study, disability programs are funded and 

administered at the level of Disability and Health Services Regions (DHSR) which are 

aggregations of usually a few LGAs. LGA was considered sufficiently close for modelling 

purposes while also having the advantage of sufficient sample sizes and higher level of 

homogeneity with respect to disability characteristics. 

Another example is that of producing small area estimates of water usage. One might 

consider using water catchment areas because that is the level required by users, 

however these are not always standardised across water and energy authorities. There is 

also the problem of geocoding ASGC classifications on which ABS data is based to the 

water catchment area. Water catchment areas can also be vast along major river systems, 

encompassing very different land uses, rainfall patterns and geological drainage features. 

3.4 Variable of Interest 

The variable of interest is typically measured from an ABS sample survey. This forms our 

dependent variable to build the small area model around. If the proportion of the 

population with a characteristic of interest is constant across broad geographic areas 

(e.g. assuming each small area has say, the same rate of heart attacks within NSWs), then 

auxiliary data are not really needed and a simple technique such as the broad area ratio 

estimator will give good results. 

In practice, however, this will be a strong assumption to make. If we believe that small 

area proportions vary with other factors then auxiliary information will be required to 

build a model. The auxiliary data can help explain the variation between small areas and 

assist in creating quality small area estimates. 

Another point for consideration is that in many applications there will be not just one 

but a number of variables of interest requiring small area estimates. Auxiliary data may 

not be available for each of these and the strength of the relationship between each 

variable of interest and the available auxiliary variables may vary markedly. Prioritising 

the variables of interest with users will assist in focusing effort to improve the quality of 

those estimates that matter most. 

3.5 Quality of Auxiliary Data 

Potential auxiliary data should be evaluated for their relationship to the variable(s) of 

interest, both theoretically and statistically as well as the accuracy and reliability with 

which they have been collected. The theoretical relationship should emanate from 

tested social or economic theories. A careful examination should be made to understand 



any major differences between the auxiliary data and the variables of interest. 

Consideration should be given to the purpose for which the data was initially collected, 

how was it processed and edited, what conceptual definitions were used and what is the 

scope of the auxiliary data holdings. This will allow appropriate auxiliary information to 

be chosen to improve the model, aid in explaining to users what factors are driving the 

small area estimates and help pinpoint potential sources of error. 

In summary the following aspects should always be examined carefully when 

considering administrative data for use as auxiliary variables: 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

Population scope of the data 

Definitions of variables / concepts used 

Purpose for collecting data / what is it used for 

Reference period 

Questionnaire (or form) and collection methodology used to collect the data, 

Survey design used 

Quality of the framework used to select units from 

The extent of missing data. What if any, imputation treatments were used? 

Classifications used 

Editing or data validation process used 

In the disability study, auxiliary data was sourced from Centrelink on the number of 

people receiving the Disability Support Pension (DSP). In areas with a greater 

proportion of people receiving the DSP we would expect a higher incidence of disability. 

A person’s eligibility to receive the DSP is related to their ability to undertake 

employment related activities, whereas the ABS Survey of Disability, Ageing and Carers 

(SDAC) concept of disability relates to a person’s ability to undertake a wide range of 

household, social as well as employment activities. 

There are a number of simple approaches for evaluating the strength of the statistical 

relationship between the variable of interest and the auxiliary data. The strength and 

statistical significance of this relationship can be analysed through simple scatter plots, 

correlations or simple models. Where substantial differences between the data do 

appear, it may be possible in some circumstances to improve the statistical relationship 

by the application of suitable adjustments or imputation methods to the auxiliary data to 

make it more comparable with the response variable. The aim of these adjustments may 

be to reduce the impact of scope or definitional differences or to treat outliers in the 

auxiliary data. Such adjustments may help to improve the statistical relationship between 

the auxiliary data and the response variable. However it is important that a statistician be 

consulted before applying such adjustments. 



3.6 Confidentiality 

Protecting the confidentiality of data provided to the ABS is of utmost importance and is 

enshrined in the Census and Statistics Act, 1905. The risk of breaches of confidentiality 

need to be carefully assessed in the case of small area data releases, as such releases 

naturally produce a higher level of detail than is normally the case. Hence care must be 

taken to ensure that the potential for identifying individual persons or businesses is 

greatly reduced. The risk of identification is increased when: 

o 

o 

o 

The population of interest is quite rare 

The geographic area is very small 

A major part of the small area estimate can be attributed to units with unusual 

characteristics. (Such as in the case of doctors in remote areas or the 

telecommunications sector) 

The release of small area estimates should follow the standard ABS guidelines.. While the 

fine level of geography increases the risk of identification, this risk may to some extent 

be mitigated by the inherent smoothing of the data and additional model error 

introduced by the modeling process itself. However this does not mean that all caution 

can be thrown to the wind. Most small area projects will be commissioned by external 

agencies and individuals in these or other agencies may be realistically expected to be in 

a position to obtain knowledge of the models used to produce the small area estimates. 

Such information could possibly be used to identify individuals. Another issue is that 

although most small area estimates will be modeled and hence incur model error and 

further smoothing, there is the risk that an individual is correctly identified from the data 

although using incorrect logic. There will still be a public perception that the Act has 

been breached. In conclusion, all possible steps to avoid disclosure should be taken in 

preparing small area data for release and the Data Access and Confidentiality 

Methodology Unit should be consulted prior to release. 



4. Choice of Small Area Techniques 

4.1 Types of Small Area Estimation Techniques 

In this section we discuss some of the more common techniques available for small area 

estimation. We consider these techniques under the general headings of "Simple Small 

area Methods" (Section 4.1.1) and "Regression Methods" (Section 4.1.2). Although the 

methods discussed under Section 4.1.1 can be formulated in terms of a regression 

model, and hence would conceptually belong under Section 4.1.2, we have treated them 

separately because they are: 

1. 

2. 

simple to implement and require less statistical expertise. They are also commonly 

used to produce small area estimates by many government agencies. 

they are often appropriate as an initial exercise to obtain "rough" small area 

estimates, before attempting more rigorous techniques 

4.1.1 Simple Small Area Methods 

Here we discuss the simpler methods that involve weighted survey estimates derived for 

a given level of geography that can be applied without the explicit application of 

statistical models. These methods include the: 

! 

Direct Estimator 

Direct estimates are classical design-based estimators that are obtained by 

applying survey weights to the sample units in each small area (Saei and 

Chambers, 2003). Since most ABS surveys are designed to provide reliable 

estimates only at the national or state levels, sample sizes are often too small at 

the small area level to produce reliable direct estimates. Small area estimation 

is therefore concerned with alternative techniques that can produce small area 

estimates with higher accuracy than that of direct estimates. 

! 

Broad Area Ratio Estimator (BARE) 

This estimator is one of the simplest types of synthetic estimators. It is 

calculated by pro-rating a broad area direct estimate by the ratio of the small 

area to broad area populations. This estimator applies the reliable broad area 

estimate proportionately across all small areas contained in the broad region. 

The success of the BARE estimator hinges largely on the choice of the broad 

area. The broad area needs to be chosen large enough to afford a direct 

estimate that is sufficiently reliable but small enough that all small areas within 

the broad area are sufficiently homogenous in the characteristic of interest. It 

is important to note that if small areas are in fact not homogenous within the 

broad region then the BARE will be biased. In practice this is difficult to verify 

hence caution should be exercised when using the BARE. It should only be 

used when users are aware of, and fully prepared to accept this assumption. 



! 

Calibration Estimator 

To produce calibration estimators, the original survey weights (usually the 

inverse probabilities of inclusion in the sample) are replaced with new 

"calibrated" weights that are in some sense as close as possible to the original 

weights, but are calibrated on some auxiliary variable available for the 

population (Chambers, 2005). The small area estimate for this auxiliary 

variable, calculated using the calibrated weights, will agree with the known 

population totals. A simple example of calibration is where population age by 

gender demographic totals are known for each small area. The survey weights 

are then adjusted so that estimates of population count by age and gender, 

agree with the known population counts. 

There are a couple of points to note about the calibration estimator. Firstly it is 

a straightforward method to put into production because the resulting 

adjusted (calibrated) weights can be stored on the survey file and used to 

produce estimates at the desired level of aggregation. Secondly the auxiliary 

variables should be chosen with care and should relate to variables we wish to 

produce estimates for. If the calibrated weights are used to produce estimates 

for variables that aren’t related to the auxiliary variable(s) used in determining 

the calibrated weights, the resulting estimates may be biased. In general 

calibrated estimates possess good design-based properties. Government 

statisticians have historically preferred the design-based to the model-based 

approach as the resulting estimates are not subject to the consequences of 

model mis-specification. 

4.1.2 Regression Methods 

Where a higher level of accuracy is required for small area estimates, an alternative is to 

use regression or model-based approaches, however these methods require a higher 

level of statistical expertise to implement and interpret results. A wide variety of 

different regression techniques are available, but for the purposes of this manual, they 

are divided into two main categories: synthetic and random effects regression models. 

! 

Synthetic Regression Models 

Synthetic regression models make use of available auxiliary data to 

mathematically express a deterministic relationship between those auxiliary 

variables and the target (response) variable we are trying to predict in each 

small area. Synthetic models assume that all the systematic variability in the 

response variable is explained by the variability in the values of the auxiliary 

variables. The remaining variability, which is referred to as the "random noise" 

or "stochastic variation", is represented by the difference between the 

predicted value for the response variable under the model and the value 

observed from the data. These differences are called random errors, residuals 

or disturbances. 

In the case of small area models, synthetic models assume that the same 

deterministic relationship between the variable of interest and the auxiliary 

variables, holds across a range of small areas, say for example within a state. 



Synthetic models work well when all relevant auxiliary variables that help 

predict the response variable are available, accurate and can be included in the 

model. However in practice this is more the exception than the rule. 

! 

Random Effects Regressions Models 

When fitting a synthetic model, the residuals should look like "white noise", 

however in practice they often display significant between area variation which 

indicates that there is some other systematic variation in the response variable 

between different small areas that is not being accounted for by the auxiliary 

variables. This implies that the synthetic model is missing certain auxiliary 

variables, the values of which would, had they been available, better help 

predict differences between small areas. 

This problem can be addressed by incorporating a random effect into the 

model. This is done by treating the constant or intercept term in the model as 

a fixed constant plus a random component known as the random effect. The 

interpretation of this is that each small area is assigned an intercept term in the 

model which is allowed to vary, around some overall constant value, from one 

small area to another. This is usually sufficient to take account of between area 

variation, however it is possible to include a random effect in a parameter 

coefficient rather than the intercept term. Doing this further adds to the level 

of complexity and is not covered in this manual. 

For models fitted to small area level data, the inclusion of random effects may 

give a distinct advantage over the synthetic model approach, possibly leading 

to estimates with higher precision and robustness. In the case of linear models, 

random effects model can theoretically be shown to give small area estimates 

that reflect the best trade-off between the accuracy of the direct estimate and 

the uncertainty associated with the synthetic model. So for a small area that 

happens to have a low sampling error (eg because of a large sample size, say) 

relative to the total error (sum of sampling error of the direct estimate and 

synthetic model error), a random effects model will give more weight to the 

direct estimate for that small area. On the other hand, for a small area with 

high sampling error, more weight will be given to the model based estimate as 

this will be more reliable. 

While being more complex than synthetic models, random effects models can 

be estimated using a variety of statistical techniques. However due to their 

technical nature, this manual will not go into any further detail about how to 

apply random effects models. A more detailed treatment will be given in the 

forthcoming technical manual. 



4.2 The Modelling Framework 

Figure 4.1 presents a schematic representation of the small area modeling framework 

followed in this manual. Figure 4.2 complements Figure 4.1 by providing a list of key 

questions the purpose of which is to aid the decision making process of small area 

modeling in a reasonably systematic approach. The objective of these questions is to 

help the modeller/analyst better understand the modeling framework (Figure 4.1) and 

hence be able to choose the most appropriate technique for a given set of data. This, 

however, does not mean these are the only questions that need to be raised in this kind 

of exercise. 

The left-hand-side of Figure 4.1 shows the simplest small area methods, these being the 

Direct and Broad Area Ratio estimators, which are frequently used in the absence of 

good quality auxiliary data. The answer to question 1 of Figure 4.2 is important as good 

quality auxiliary data is a key requisite in order to proceed to the regression-based small 

area estimators. We take good auxiliary data to mean area-level and/or unit-level data 

that are potentially correlated (both theoretically and empirically) with the variable of 

interest. Section 3.5 discusses some of the ways the quality of auxiliary data can be 

determined. The quality of the auxiliary data, therefore, has a large bearing on the 

reliability of model predictions for the variable of interest. In other words, when good 

quality auxiliary data is available one can choose among a number of regression-based 

estimators that “borrow strength” from the relationship between the variable of interest 

and the auxiliary data; thereby improving the quality of small area estimates/predictions. 



Figure 4.1: Small Area Modelling Framework 

Small Area 

Methods 

Simple Small Area 

Models 

Regression based 

Models 

Less complex 

More complex 

Direct 

Estimator 

Broad Area 

Ratio 

Estimator 

Linear Models for 

- Continuous data 

With No Auxiliary 

data 

With Auxiliary 

data 

Synthetic 

Regression 

Models 

Area 

Level 

Analysis 

Unit 

Level 

Analysis 

Random 

Effects 

Models 

Generalised Linear Models 

- Count data(poisson model) 

- Binary data (logistic model) 

Univariate Analysis 

Multivariate Analysis 

The classes of regression based estimators are shown in the right-hand-side of Figure 

4.1. These estimators can be classified into two major categories, namely, the synthetic 

regression models and the random effects models which are relatively more complex 

than their synthetic counterparts. For the moment let us focus on the synthetic models. 

Once this choice is made the next choice is between a linear or generalised linear 

model. The Linear model, which is the simplest of all, is suitable if the variable of interest 

is continuous (e.g.; income, age, etc. ). If the variable of interest is not continuous 

(binary or count data) one can select appropriately from a wide range of Generalised 

Linear Models. The most common examples are the Logistic and Poisson models which 

are used to model binary and count data, respectively. 



Clearly, as indicated in questions 2 to 3 of Figure 4.2, the choice of any of these or other 

models depends on the following important interrelated factors: 

i. 

ii. 

iii. 

iv. 

v. 

the level at which the small area estimates are required. Are small area estimates 

required at area-level or at some other sub-population such as age by sex group. 

the nature of the auxiliary data available related to the variable of interest. Again, 

these may include whether the data is at the unit-level (person-level), area-level or 

both. 

the nature of the variable of interest, i.e., whether it is continuous, binary or count 

data. 

users quality requirements for small area estimates 

access to statistical expertise 

Small area models can be fitted either at area-level or person-level. Area level models are 

fitted when the variable of interest and associated covariates in the auxiliary data are 

observed at the level of the specific geographic area, which is referred in Figure 4.1 as 

area-level analysis. On the other hand a unit/person-level analysis refers to 

unit/person-level model that makes use of individual/unit level data in the analysis. When 

a model is fitted using unit/person-level data then the predictions based on this model 

must be aggregated to produce area-level estimates. It is also possible to fit a unit/person 

level model involving both individual and area-level covariates. 

Choosing the right model for the right type of data is crucial in the modelling process. 

For example, if the auxiliary information consists of data observed at area or unit level 

and the variable of interest is of a continuous nature, then it will be appropriate to use a 

linear model to estimate the variable of interest. Alternatively, if we have unit level data 

where the variable of interest is binary (e.g., 1= person has a disability and 0 = person 

has no disability) which is usually the case in many small area models, then we would go 

for a model that captures the binary nature of the observations, such as the logistic 

regression model. Similarly, if our data provides, say, area level count data of people 

with a disability then a suitable choice would be the Poisson model which is appropriate 

for count data models. It is also possible to use two or more models (e.g., unit-level and 

area-level models) provided that the dataset is amenable to such analyses . For instance, 

as we will see in the examples of Section 5 , the logistic and Poisson models are used to 

predict person-level and area-level disability proportions, respectively. 



Figure 4.2: Key Questions for Small Area Modelling 

If NO 

Q1. Do you have good quality auxiliary Data? 

If Yes 

Use Linear or Generalised 

linear models, depending 

on your data. 

Q2. Is the variable of interest of continuous, binary or 

count data? 

If 

Continuous 

data: Linear 

model 

If Binary 

data: 

Logistic 

Model 

If Count 

Data: 

Poisson 

model 

Simple 

Direct or 

Broad Area 

Ratio 

Estimators 

are the 

likely 

candidates. 

Q3. Shall I use an area-level or unit-level model or both? 

Q4. At what level is my auxiliary data available and of 

good quality? 

Good Area Level 

or unit level 

continuous data 

Good 

Unit Level binary 

data 

Good Area Level 

count data 

Q5. Are there likely to be major differences between 

small areas that are not taken into account by the 

auxiliary data? 

If Yes 

Use random effects model 

Consult methodology 

staff for technical advice 

The next key question (as indicated by questions 5 of Figure 4.2) is when and why do we 

use the random effects models as compared to the synthetic models. To start with, the 

preceding discussion on the choice of models (linear versus generalised linear) also 

applies to the random effects models as well. However, the random effects models are 

different in that they include an additional error component to account for differences 

between units that aren’t explained by the auxiliary variables. In other words, synthetic 

models assume that the variable of interest can be determined from the same functional 

relationship with the auxiliary variables, and that this relationship applies across all small 

areas. 

This assumption, however, could be restrictive for a number of reasons. For example, in 

the disability data some small areas are located in remote areas with limited support 

facilities and services while others are in big cities with better infrastructure and services 

where people with disability could move there to take advantage of the improved 

services. Some areas are may have larger population of indigenous people relative to 

others which again may affect disability rates in different areas. Yet, others are located in 

coastal areas that attract people of retirement age and the elderly. These factors are not 

fully accounted for in the auxiliary data. Thus, unless these and other factors are taken 

into account in the model, they could limit the predictive abilities of synthetic models 



for some small areas or units. Such differences, therefore, call for a more general/flexible 

specification of the models to capture the area-specific (person-specific) factors after 

taking account of the auxiliary variables - and hence the random effects models. Thus, 

the choice between random effects models versus synthetic models could be made on 

the basis of one or more of the following factors: 

i. 

ii. 

iii. 

iv. 

v. 

prior knowledge of small areas or units vis-a-vis the auxiliary data gained from 

experience or through discussions with subject matter specialists, 

users/stakeholders, etc. (for example, we may not have a lot of faith in our auxiliary 

variables /the synthetic model). 

from statistical outcomes based on the models. A close assessment or evaluation of 

the small area estimates/predictions from comparative synthetic and random effects 

models and see whether they meet expectations. 

on the basis of statistical/econometric tests (a battery of diagnostic and statistical 

tests) on the adequacy of the models. 

when one wants small areas with large samples to be less affected by the model 

because the direct estimates for such areas can be expected to be quite reliable in 

their own right. The random effects model allows for a suitable trade-off between the 

reliability of the direct estimates and reliability of model estimates. 

when one wants to apply the model to areas with no sample in them (out of sample 

areas). Random effects models allow for greater flexibility in applying the model to 

make predictions for areas other than those to which it was fitted. 

Clearly, once the random effects models are chosen they require a higher level of 

statistical skill and some familiarity with specialised software. It is also true that more 

complex models may not necessarily provide better results. This is particularly true if 

sufficiently strong relationships in the data, from which to borrow strength, are simply 

not present in the data. One should be aware that results from simple models may be as 

good as those from complex ones. In other words, as will be discussed later in this 

section, the gains in efficiency of estimates from using more complex models need to be 

assessed. 

An important aspect of the modeling process which may also have significant bearing on 

the complexity and quality of the analysis is whether the variable of interest involves a 

univariate or multivariate analytical framework. Here we are specifically referring 

whether the variable of interest is a univariate or multivariate form. For example, in the 

disability study, if our variable of interest is simply to predict whether a person has an 

impairment or not (i.e., 1= person has a disability and 0= person has no disability) 

regardless of the type of impairment then this is within a univariate framework. On the 

other hand, a breakdown of the variable of interest by type of impairment (e.g., physical, 

mental, sensory etc.) would involve a multivariate framework. The real issue here is that 

while a univariate analysis is simpler to undertake, a multivariate analysis provides an 

opportunity to exploit additional information on the correlations that exist between the 

various types of impairment and hence improve the reliability of estimates. 



4.3 Trade-off between Quality, Cost, Time and Effort 

In this manual, the term ‘quality’ is used to indicate the overall level of accuracy, 

acceptability and reliability of small area estimates, both from a statistical point of view 

and in terms of providing a more informed and reliable decision making capability for 

users. More specifically, we borrow from the characterisation of quality as having six 

dimensions, these being: relevance, accuracy, timeliness, accessibility, interpretability 

and coherence (Allen, 2001). The ABS has a strategic policy of ensuring the quality of all 

its output and clearly demonstrating that quality to users (ABS, 2002), and this is 

particularly relevant and important for the production and release of small area 

estimates. 

A key aspect that needs to be taken into consideration is whether the gains in terms of 

quality of outputs from using more complex methods outweigh the time, costs and 

effort required to generate, interpret and validate the results. Regardless of the degree of 

sophistication contemplated at the outset, the small area practitioner is well advised to 

commence with simpler techniques (say, synthetic models). Should resources and user 

requirements permit, more rigorous statistical techniques may be applied in stages 

resulting in a choice of competing models (say, fitting both synthetic and random effects 

models to the data). Choosing the best model in light of expert knowledge and 

informed judgment would lead to improved results and decision making outcomes. 

Figure 4.3 below provides some indication of the trade-off between quality, cost, time 

and effort in small area modeling. It should be clear that these relationships are not 

linear in nature and one cannot authoritatively represent such relationship in a simple 

two-dimensional diagram like this. The purpose is, however, to provide a rough idea on 

the kind of relationships that may exist between quality and cost/time/effort. In Figure 

4.3, quality is represented by the vertical axis and could range from, say, low to high 

levels of quality. The horizontal axis represents cost/time/effort combined. The three 

terms (cost/time/effort) are presented in this way as they are interrelated and one is 

implicitly defined by the other. In simple terms, it is assumed that increased effort 

implies a longer time frame and presupposes more resources and higher costs. 



Figure 4.3 : Trade-off between Quality and Cost / Time / Effort 

Statistical expertise 

Robustness of results 

Understanding results 

Interpretability 

Issues 

Validity of assumptions 

User requirements 

Availability of resources 

Timeliness/deadlines 

Quality 

Simple 

models 

Complex 

models 

Level of 

precision 

Finer 

disaggregation 

Good auxiliary data 

Cost/time/effort 

As you can see from Figure 4.3, good quality auxiliary data is a crucial prerequisite for 

obtaining quality small area estimates. Quality is of course a relative term and depends 

very much upon the clients’ decision making requirements. 

Assuming that we have good quality auxiliary data, we would expect more sophisticated 

methods to provide results of a higher level of quality, as indicated by the upward slope 

of the cost-quality curve. The same curve also indicates that somewhere in the 

continuum there exists an optimal point (a level of precision) whereby any additional 

effort/cost/time from that point on, has either marginal or declining effects on quality. 

More elaborate techniques may give only marginal improvements in accuracy but 

decrease timeliness, an important dimension of quality. Overall quality may also be 

eroded when exceedingly smaller areas or finer disaggregations of the data are 

demanded. For example, in the disability analysis, disaggregating disability by type of 

impairment, level of severity and age group, in addition to the small area level, leads to 

poor quality estimates, especially for the rarest impairment types such as sensory. 

There are also other issues, as shown just above the cost-curve in Figure 4.3, that may 

have significant bearing in relation to quality and cost of small area estimates. For 

example, the use of more complex models may require a higher level of subject matter 

knowledge and expertise to assist in understanding and interpreting model results. 

Such knowledge is also important for testing the validity of assumptions inherent in the 

model and in checking the robustness and sensitivity of model results. 

Finally, there are important points that have to be made in relation to the quality versus 

cost issue discussed above. Firstly, that simplicity is an important aspect of quality in 

that it aids the interpretability of small area output. We do not intend to imply from 

Figure 4.3 that simpler methods always imply poor quality estimates. More complex 

methods should only be attempted where there are likely to be demonstrable gains in 



the accuracy of small area estimates. Secondly , the use of sophisticated methods may 

not necessarily lead to higher costs. For instance, once a strong analytic capability to 

undertake small area estimation has been established (in terms of statistical skill and 

other resources) any increase in cost, effort and time for undertaking complex small area 

methods may be marginal.

SAE Manual Sections 1 to 4_1 (May 06).pdf - National Statistical ...

Create successful ePaper yourself

Delete template?

Save as template?