SAE Manual Sections 1 to 4_1 (May 06).pdf - National Statistical ...
SAE Manual Sections 1 to 4_1 (May 06).pdf - National Statistical ...
SAE Manual Sections 1 to 4_1 (May 06).pdf - National Statistical ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
A Guide <strong>to</strong> Small Area Estimation - Version 1.1<br />
Key Clients: <strong>National</strong> <strong>Statistical</strong> Centres and Client Services<br />
<strong>May</strong> 20<strong>06</strong>
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
The ABS intends <strong>to</strong> periodically update this manual. Therefore, the<br />
ABS would welcome any comments and suggestions from users.<br />
Readers who would like more information or who would like <strong>to</strong><br />
forward comments on this manual may contact any of the<br />
following ABS officers:<br />
Location Contact Name Phone Number Email address<br />
Central Office Daniel Elazar +61 2 6252 6962 daniel.elazar@abs.gov.au<br />
NSW Edward Szoldra +61 2 9268 4214 edward.szoldra@abs.gov.au<br />
QLD Brett Frazer<br />
John Pres<strong>to</strong>n<br />
SA Justin Lokhorst<br />
Philip Bell<br />
+61 7 3222 6028<br />
+61 7 3222 6229<br />
+61 8 8237 7476<br />
+61 8 8237 7304<br />
brett.frazer@abs.gov.au<br />
john.pres<strong>to</strong>n@abs.gov.au<br />
justin.lokhorst@abs.gov.au<br />
philip.bell@abs.gov.au<br />
TAS Keith Farwell +61 3 6222 5889 keith.farwell@abs.gov.au<br />
VIC Elsa Lapiz +61 3 9615 7364 elsa.lapiz@abs.gov.au<br />
WA Carl Mackin +61 8 9360 5250 carl.mackin@abs.gov.au<br />
Australian Bureau of Statistics 2
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
Contents<br />
1 Introduction 5<br />
1.1 What are Small Area Estimates? 5<br />
1.2 Background of the Small Area Practice <strong>Manual</strong> 6<br />
1.3 Purpose 7<br />
1.4 What are the primary uses for Small Area Estimates? 9<br />
1.5 When should Small Area Estimates be Produced? 9<br />
2 Assessing User Requirements 10<br />
2.1 User Requirements 10<br />
3 Some issues in Small Area Estimation 14<br />
3.1 Sources of Additional Information 14<br />
3.2 Basic Conditions for Success 18<br />
3.3 Choice of Small Area 20<br />
3.4 Variable of Interest 22<br />
3.5 Quality of Auxiliary Data 22<br />
3.6 Confidentiality 24<br />
4 Choice of Small Area Techniques 25<br />
4.1 Types of Small Area Estimation Techniques 25<br />
4.1.1 Simple Small Area Methods 25<br />
4.1.2 Regression Methods 26<br />
4.2 The Modelling Framework 28<br />
4.3 Trade-off between Quality, Cost, Time and Effort 33<br />
5 Case Studies of Small Area Applications 36<br />
5.1 Simple Small Area Models 36<br />
5.1.1 Broad Area Ratio Estima<strong>to</strong>r with No Auxiliary Data 36<br />
5.1.2 Broad Area Ratio Estima<strong>to</strong>r with Auxiliary Data 42<br />
5.2 Regression Based Models 46<br />
5.2.1 Overview 46<br />
5.2.2 Framework for Regression Based Models 48<br />
5.2.3 Regression Based Synthetic Estimates 51<br />
5.2.4 Generating Small Area Estimates from<br />
Person Level Models 56<br />
5.2.5 Discussion of Examples 1-4 60<br />
Australian Bureau of Statistics 3
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
6 Diagnostics For the Quality Small Area Estimates 62<br />
6.1 Introduction 62<br />
6.2 Diagnostics From Case Study 62<br />
6.3 Assessment of Models Against Diagnostics 76<br />
7 Communicating Quality <strong>to</strong> Users 78<br />
7.1 Introduction 78<br />
7.2 Sources of Error 80<br />
7.3 Impact of Errors 83<br />
7.4 Explana<strong>to</strong>ry Notes 85<br />
8 Summary 88<br />
8.1 Points <strong>to</strong> Consider 88<br />
8.2 What Areas of the ABS Provide Small Area Estimates 88<br />
9 Frequently Asked Questions 90<br />
APPENDICES 92<br />
Appendix 1: List of Previous Small Area Work 92<br />
Appendix 2: Technical Notes of Estima<strong>to</strong>rs 96<br />
Appendix 3: SAS Datasets and Codes 100<br />
Appendix 4: Diagnostics Graphs 101<br />
Appendix 5: Explana<strong>to</strong>ry Notes 103<br />
Appendix 6: Quality Declaration 127<br />
BIBLIOGRAPHY 139<br />
LIST OF ACRONYMS 140<br />
Australian Bureau of Statistics 4
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
1.1 What are Small Area Estimates?<br />
1. Introduction<br />
Most ABS surveys are designed <strong>to</strong> provide statistically reliable, design based estimates<br />
only at the national and/or state/terri<strong>to</strong>ry geographic levels. The sheer practical<br />
difficulties and cost of implementing and conducting sample surveys that would provide<br />
reliable estimates at levels finer than state/terri<strong>to</strong>ry are generally prohibitive, both in<br />
terms of the increased sample size required and the added burden on providers of<br />
survey data (respondents). For purposes of this manual, small area estimation refers<br />
<strong>to</strong> methods of producing sufficiently reliable estimates for geographic areas that are <strong>to</strong>o<br />
fine <strong>to</strong> obtain with precision, using direct survey estimation methods. By direct<br />
estimation we mean classical design based survey estimation methods (Saei and<br />
Chambers, 2003) that utilise only the sample units contained in each small area. Small<br />
area estimation methods are used <strong>to</strong> overcome the problem of small samples sizes <strong>to</strong><br />
produce small area estimates that improve upon the quality of direct survey estimates<br />
obtained from the sample in each small area. The more sophisticated of these methods<br />
work by taking advantage of various relationships in the data, and involve, either<br />
implicitly or explicitly, a statistical model 1 <strong>to</strong> describe these relationships. (See <strong>Sections</strong><br />
4.1 & 4.2 for further discussion).<br />
Although conceptually similar, small domain estimates refers <strong>to</strong> those disaggregated<br />
<strong>to</strong> fine classifica<strong>to</strong>ry levels, such as by socioeconomic status, income, labour force status<br />
or industry. It is important <strong>to</strong> note that we have not undertaken any empirical study for<br />
small domain estimation methods for this manual, although intuitively we would expect<br />
that most techniques covered in this manual would still apply. The empirical analysis of<br />
this manual is based on knowledge and experience derived from only one empirical<br />
study, this being a study of the incidence of disability in Australia. This study uses data<br />
from the Survey of Disability, Ageing and Carers (SDAC) (see ABS (2003) for more<br />
details).<br />
1 A statistical model is a mathematical representation of the relationship we assume <strong>to</strong> exist<br />
between the variable we are interested in predicting (known as the response or dependent<br />
variable) and other associated variables (known as the auxiliary, explana<strong>to</strong>ry or independent<br />
variable). A model is then fitted <strong>to</strong> data that contains observed values for both the<br />
dependent variable and the auxiliary variables for each unit. The fitting process produces<br />
estimates of the model parameters such as intercepts and slopes. The unit here may be a<br />
person, a business or a small area itself, depending upon the level at which we wish <strong>to</strong> fit the<br />
model. The model also includes one or more error terms <strong>to</strong> describe the degree of<br />
s<strong>to</strong>chastic or random variation with which predicted values for the response variable deviate<br />
from the observed values.<br />
Australian Bureau of Statistics 5
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
1.2 Background <strong>to</strong> the Small Area Practice <strong>Manual</strong><br />
The small area practice manual project was developed <strong>to</strong> give a simple and clear guide of<br />
how <strong>to</strong> undertake small area estimation. The ABS has previously carried out a number of<br />
small area estimates projects (See Appendix 1). In recent years user demand for these<br />
kinds of statistics has increased. Most of this increase in demand has become apparent<br />
during specific consultations between the ABS and key government users <strong>to</strong> gauge and<br />
assess users’ medium and long term statistical data requirements. Consolidated<br />
examples of these can be found in the Information Development Plan (IDP) (ABS, 2005)<br />
(Catalogue 1362.0 <strong>to</strong> be released in early 20<strong>06</strong>) and State <strong>Statistical</strong> Priorities (ABS<br />
Corporate Information - State <strong>Statistical</strong> Forum 16 February 2005)<br />
This reflects the growing statistical sophistication of users. Also local government bodies<br />
such as cities, councils and shires are taking on a greater role in the long term planning<br />
and socio-economic development of their regions. This increase in demand for small<br />
area data is occurring globally. In response, more advanced methods for producing<br />
reliable small area statistics are being developed and are gaining methodological<br />
acceptance. This was recognised at an international conference held on small area<br />
statistics in Riga, Latvia in 1999 where the Deputy Australian Statistician encouraged<br />
<strong>National</strong> <strong>Statistical</strong> Organisations <strong>to</strong> make greater use of model-based methods <strong>to</strong><br />
produce small area statistics (Trewin 1999). The paper also noted that explaining quality<br />
is an especially important issue for a <strong>National</strong> Statistics Office when producing these<br />
types of estimates and products.<br />
Various areas within the ABS have been involved in the provision of small area estimates<br />
<strong>to</strong> varying levels of sophistication in both the methods used and the quality of the<br />
estimates produced. Table A.1 of Appendix 1 contains a selection of the major pieces of<br />
small area work that have been conducted <strong>to</strong> date. In addition, there has been no<br />
definitive set of clear, ABS wide guidelines on how <strong>to</strong> assess the quality of small area<br />
estimates and what should be the agreed minimum level of quality required before<br />
releasing small area statistics <strong>to</strong> external clients. In other words, what needs <strong>to</strong> be<br />
developed is a cohesive, coordinated approach <strong>to</strong> the production of small area<br />
estimates.<br />
There is a strong need <strong>to</strong> set up a framework for the practice of small area estimation at<br />
all levels of involvement in the small area statistical process. These include client services<br />
areas in regional offices or Central Office (CO), Methodology Division, <strong>National</strong><br />
<strong>Statistical</strong> Centres and senior managers responsible for clearing and releasing small area<br />
output. Such a framework is important for ensuring that consistent practices are used<br />
across the ABS in producing small area estimates and that these practices accord with<br />
best practices used both in the ABS and in statistical agencies overseas.<br />
Australian Bureau of Statistics 6
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
A consistent approach <strong>to</strong> the production of small area estimates is important for the<br />
following reasons:<br />
o<br />
o<br />
o<br />
o<br />
a need for the ABS <strong>to</strong> more precisely understand users' small area needs, ie how they<br />
utilise small area estimates in their decision making. Getting this right at the outset<br />
will ensure effort is efficiently directed <strong>to</strong> producing small area estimates that are fit<br />
for purpose.<br />
<strong>to</strong> ensure that small area estimates are produced with sufficient quality and are<br />
appropriate <strong>to</strong> user requirements.<br />
<strong>to</strong> ensure that users fully understand the assumptions and conditions underpinning<br />
output data and the fitness for use.<br />
<strong>to</strong> ensure small area estimation methodologies are sound, robust and practicable for<br />
a large range of small area estimation problems.<br />
Linked <strong>to</strong> this is the broader issue of the circumstances in which the ABS should or<br />
should not be producing small area estimates. These decisions need <strong>to</strong> be made by<br />
determining the risk that the provision of such data will detract from informed decision<br />
making.<br />
This manual has been prepared on the basis of work done on small area estimates of<br />
disability. Although the wording of the manual inadvertently reflects the context of a<br />
household based population survey, the small area methods described can also be<br />
applied <strong>to</strong> the context of economic/business collections. As further empirical studies are<br />
applied <strong>to</strong> other data contexts we anticipate that the manual will be expanded and<br />
adapted <strong>to</strong> include examples relating <strong>to</strong> economic data.<br />
1.3 Purpose<br />
This volume of the manual, which is the first of two volumes, the second of which will<br />
contain a more technical treatment, aims <strong>to</strong> provide a simple non-technical guide on the<br />
production, uses, quality and validation of small area estimates. The intended audience<br />
includes survey practitioners, consultants, methodologists and users of small area data.<br />
The broad objectives of the Small Area Estimation Practice <strong>Manual</strong> are as follows:<br />
o<br />
o<br />
o<br />
o<br />
To build a stable bridge between the knowledge, the theory and the practice of small<br />
area estimation while taking account of ABS priorities and policies with regards <strong>to</strong><br />
the production of small area statistics. This should result in a more consistent and<br />
quality assured approach <strong>to</strong> producing small area estimates within the ABS.<br />
To realise a quantum increase in the level of ABS knowledge and understanding of<br />
small area estimation techniques, how and under what conditions they can be<br />
applied, and how <strong>to</strong> measure or assess the quality of the small area estimates<br />
produced.<br />
To provide coherent, relevant, accurate and accessible information on small area<br />
estimation practices and techniques which are used regularly by their intended<br />
audiences and updated <strong>to</strong> reflect increases in knowledge and understanding.<br />
To ensure that practitioners within the ABS have a clear understanding of the quality<br />
and assumptions underpinning the small area estimates produced and that these are<br />
clearly communicated <strong>to</strong> users so that small area estimates are used appropriately<br />
and for the purposes intended.<br />
Intended audience<br />
Australian Bureau of Statistics 7
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
o<br />
o<br />
o<br />
This guide aims <strong>to</strong> give advice for <strong>National</strong> <strong>Statistical</strong> Centres and regional offices on<br />
how <strong>to</strong> advise, respond <strong>to</strong> and incorporate small area estimates in<strong>to</strong> their work, so<br />
they can apply simple models themselves and know when <strong>to</strong> draw on<br />
methodological skills for more complex models.<br />
A second volume of the manual will cover the more technical aspects of small area<br />
estimation and will be primarily aimed at methodologists and technical analysts<br />
involved in producing modeled small area estimates. The technical manual will cover<br />
in more detail the methodological and statistical issues that arise in small area<br />
estimation.<br />
The content of this manual contains material on the application of basic statistical<br />
models. The manual therefore assumes the reader has a basic familiarity with the<br />
theory and application of such models. Some parts of the manual contain references<br />
<strong>to</strong> somewhat more advanced methods. In such instances warning boxes strongly<br />
recommend <strong>to</strong> the reader that further methodological advice should be obtained<br />
from Methodology Division (ABS) before applying such techniques.<br />
What it is - A Guide <strong>to</strong>:<br />
o<br />
o<br />
o<br />
o<br />
o<br />
o<br />
what issues need <strong>to</strong> be thought through before undertaking a small area exercise,<br />
the methods and techniques available in small area estimation, the relative<br />
advantages and disadvantages and assumptions involved in each,<br />
who <strong>to</strong> talk <strong>to</strong>, who has implemented specific approaches in<strong>to</strong> practice already and<br />
where <strong>to</strong> find relevant documentation,<br />
the trips and traps of putting various techniques in<strong>to</strong> practice,<br />
how <strong>to</strong> best measure the reliability of small area predictions,<br />
how <strong>to</strong> detect model miss-specification and what diagnostics are available for<br />
assessing the overall quality of small area estimates.<br />
What it is not<br />
o<br />
An up-<strong>to</strong>- date encyclopedia of all the literature on small area techniques. The focus<br />
of this manual is much more on the practice of small area estimation in the<br />
production of government statistics. Compiling and maintaining an up-<strong>to</strong>-date<br />
summary of the technical literature would be highly resource intensive as the field is<br />
relatively new and rapidly evolving. It would also make it more difficult for the<br />
practitioner <strong>to</strong> access.<br />
Finally, we emphasise that this manual has been written under the assumption that the<br />
primary goal of small area data users is <strong>to</strong> obtain descriptive statistics of the relative<br />
characteristics of small areas rather than obtain the form of some dynamic structural<br />
process which generates those small area characteristics. The manual is therefore<br />
premised upon a descriptive framework for the ultimate decision making objectives,<br />
even though analytical methods are used <strong>to</strong> construct the models used <strong>to</strong> predict those<br />
small area characteristics. In other words, we assume users are primarily interested in<br />
the predictions from those models, not just the form and structure of the models per se.<br />
Australian Bureau of Statistics 8
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
1.4 What are the primary uses for Small Area Estimates?<br />
Federal, state and local government bodies involved in program funding / evaluation or<br />
regional planning are typically the primary users of ABS small area data. They require<br />
estimates of specified accuracy <strong>to</strong> assist them in making informed decisions on how <strong>to</strong><br />
allocate resources or apply for additional resources. The need for government services<br />
<strong>to</strong> justify their decision making and be accountable <strong>to</strong> the community is seen as a very<br />
important fac<strong>to</strong>r.<br />
Small area estimates are often used by program administra<strong>to</strong>rs <strong>to</strong> determine or<br />
benchmark their funding allocations. Without the small area information, the<br />
administra<strong>to</strong>rs have difficulty in assessing the actual need for goods and services in each<br />
area. This can result in undesirable scenarios such as "the squeaky wheel gets the<br />
grease", whereby interest groups or areas which are most vocal receive a greater share of<br />
the funding allocations. Small area estimates provide detailed information on each area<br />
allowing for objective and informed decision making.<br />
Local government demand for small area data has also increased as they become<br />
increasingly aware and interested in the role statistics can play in informing them about<br />
what is happening in their own jurisdictions.<br />
1.5 When should Small Area Estimates be Produced?<br />
Small area estimates should only be produced when there is strong and justified user<br />
demand as well as no alternate data at the small area level that will serve the required<br />
purpose. In addition there needs <strong>to</strong> be adequate survey and auxiliary data <strong>to</strong> ensure that<br />
the outputs produced will be of sufficient quality <strong>to</strong> fit their intended purpose.<br />
Small area estimates should primarily be considered where key policy making decisions<br />
require discerning between relative needs of different small areas and such information<br />
does not currently exist or requires updating (eg. Disability data). To develop small area<br />
estimates, significant resources in staff time <strong>to</strong> develop, check and get approval for<br />
release is needed. The complexity of most small area estimation exercises and the<br />
difficulty in validating the reliability of the output makes it very difficult <strong>to</strong> fully au<strong>to</strong>mate<br />
the production process. To a large extent, each small area undertaking has <strong>to</strong> be tailored<br />
<strong>to</strong> the nature and specifics of the problem at hand. Therefore, care needs <strong>to</strong> be taken <strong>to</strong><br />
ensure the need for the small area estimates warrants the effort required.<br />
The first step is <strong>to</strong> discuss with the users <strong>to</strong> see if state or part of state estimates would<br />
be adequate. If there is not much variation between the small areas then more broad<br />
estimates would be adequate. It is also worth investigating any sources of administrative<br />
data that can be used as auxiliary data for a small area model. Finally, it is worthwhile<br />
checking that the chosen small area model fitted <strong>to</strong> the data is appropriate for that data<br />
and inherent assumptions in the model do at least approximately hold. For example<br />
fitting a linear model <strong>to</strong> the data would require that the errors are identically and<br />
independently distributed with zero mean and constant variance. It is therefore prudent<br />
<strong>to</strong> check such assumptions are reasonable and have been satisfied before estimating the<br />
model.<br />
Australian Bureau of Statistics 9
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
2. Assessing User Requirements<br />
2.1 User Requirements<br />
Understanding user requirements for small area estimates is paramount for providing a<br />
high quality small area product that meets the client's decision making requirements.<br />
The importance of gaining a thorough understanding of user requirements at this initial<br />
phase cannot be over emphasised. Shortcuts taken at this phase will often lead <strong>to</strong> an<br />
inferior quality product and/or valuable time and resources lost along the way. With the<br />
right questions, users will be able <strong>to</strong> give a clear indication as <strong>to</strong> what information is<br />
critical in their decision making. Users are also a valuable resource in helping <strong>to</strong><br />
determine the best potential sources for borrowing strength.<br />
Where complex techniques need <strong>to</strong> be applied, Methodology Division (MD) staff will<br />
need <strong>to</strong> be involved in performing the methodological work and it is highly<br />
recommended that MD staff are directly involved in the discussion with clients at the<br />
earliest possible opportunity.<br />
Table 2.1 below displays a checklist of the key questions <strong>to</strong> ask clients when<br />
commencing a small area exercise.<br />
Table 2.1 Checklist of Questions <strong>to</strong> Ask Users.<br />
Question<br />
A) What are the key policy making or program funding decisions that require small area data ?<br />
B) What are the organisation's strategic context, goals and desired outcomes, in which these<br />
decision making requirements are nested ?<br />
C) What small area data do users think would best meet their decision making requirements<br />
and what level of geography is required ?<br />
D) What are the consequences for users’ decision making outcomes if the small area data is<br />
incorrect, say, by 5%, 10%, 20%, etc? Which small area estimates have the greatest priority in<br />
terms of accuracy requirements ?<br />
E) Are there any conceptual models, either social or economic, that are believed <strong>to</strong> describe the<br />
process which influences the variable(s) for which we are <strong>to</strong> calculate small area estimates ?<br />
F) What administrative data is available and relevant as auxiliary information <strong>to</strong> support the<br />
modeling of the small area estimates? How is this data collected, for what purpose is it used,<br />
and how accurate is it likely <strong>to</strong> be ?<br />
G) Will small area estimates be required <strong>to</strong> be disaggregated by other categories ?<br />
H) What previous studies have been used, if any, <strong>to</strong> undertake the policy/funding decision for<br />
which small area estimates are required ?<br />
Australian Bureau of Statistics 10
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
A) What are the key policy making or program funding decisions that<br />
require small area data?<br />
Knowing how the small area data will be used as input <strong>to</strong> user’s decision making process<br />
is essential in ensuring the small area output meets user requirements. User decision<br />
making requirements can vary considerably. Some may be quite sophisticated and<br />
quantitatively based. Others may be quite informal and qualitatively based. In the former<br />
case, the decision making process should be identified and well unders<strong>to</strong>od as inherent<br />
assumptions may help determine just how accurate small area data really needs <strong>to</strong> be. It<br />
is also important <strong>to</strong> ensure, where possible, that the small area data is consistent and<br />
compatible with the users’ decision making process, and that the output of this process<br />
meets user expectations, not just the ABS small area output. A quality assessment should<br />
include measures of the fitness for purpose of small area output.<br />
However many users do not have sophisticated, quantitatively based decision making<br />
processes, and may have difficulty in articulating the very nature of the problem they<br />
wish <strong>to</strong> solve.<br />
Before undertaking the project it is worth investigating whether the small area estimates<br />
requested may suit the needs of a wider range of clients. Quite often similar data is<br />
required by different clients and can be useful for a wide range of users. By incorporating<br />
their needs in<strong>to</strong> the project, this increases the value of the final product with minimal<br />
additional cost.<br />
B) What are the organisation's strategic context, goals and desired<br />
outcomes, in which these decision making requirements are nested?<br />
Need <strong>to</strong> ask users what the data problem is, why data needs <strong>to</strong> be obtained, the decision<br />
making processes used, what the users are trying <strong>to</strong> find out and why. This can be<br />
matched up with what is possible <strong>to</strong> estimate from the available data. Any possible<br />
limitations then can be identified early and additional information can be sought or the<br />
user can be made aware. When the final product is created the user has a good<br />
understanding of the limitations and the product is a close as is possible <strong>to</strong> what they<br />
need.<br />
C) What small area data do users think would best meet their decision<br />
making requirements and what level of geography is required?<br />
A minimum level of information on the variable of interest is needed in each small area.<br />
Given the available data, the user needs <strong>to</strong> be aware that a given level of the quality for<br />
the small area estimates is subject <strong>to</strong> a trade-off between the level of what geographic<br />
level and level of detail in the data is possible <strong>to</strong> model. That is, in the context of<br />
household based collections, a reasonably common characteristic of the variable of<br />
interest (say, greater than 10%) may be estimated at a reasonably fine level of geography<br />
such as <strong>Statistical</strong> Local Area (SLA). However, a variable of interest representing less<br />
than 1% of the population, can only be reliably estimated at a broader level of geography<br />
such as <strong>Statistical</strong> Sub-Division (SSD). For example, in the disability study estimates for<br />
physical disability (which accounts for more than 10%) could be obtained at a reasonably<br />
Australian Bureau of Statistics 11
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
fine geographic and level of detail as compared <strong>to</strong> psychological disability (which is<br />
around 1%). This choice also depends on the quality of data which is discussed in the<br />
next section.<br />
D) What are the consequences for users’ decision making outcomes if the<br />
small area data is incorrect, say, by 5%, 10%, 20%, etc? Which small<br />
area estimates have the greatest priority in terms of accuracy<br />
requirements?<br />
The answer <strong>to</strong> this question will drive the level of quality and hence resources required<br />
<strong>to</strong> produce small area estimates of acceptable quality <strong>to</strong> users. If large funds from a<br />
government program are <strong>to</strong> be allocated <strong>to</strong> regions based on the small area estimates,<br />
then a high level of quality assurance and validation is required. However, if all that is<br />
required is an approximate guide <strong>to</strong> indicate areas where there may be unmet need, say<br />
for program evaluation purposes, then broad quality checks may be adequate.<br />
To assess how accurate small area estimates need <strong>to</strong> be before they start <strong>to</strong> adversely<br />
impact upon decision making outcomes, it is important <strong>to</strong> understand the entire<br />
decision making process and the way in which small area estimates feeding in<strong>to</strong> that<br />
process impact upon the outputs. This involves understanding the assumptions implicit<br />
in the process. The analyst needs <strong>to</strong> work out, in consultation with users, how accurate<br />
final decision making outcomes need <strong>to</strong> be. By working backwards, it may be possible<br />
<strong>to</strong> work out what level of accuracy in the small areas estimates will give this level of<br />
accuracy in the decision making outcomes. A sensitivity analysis is another approach<br />
that can also be undertaken <strong>to</strong> determine how sensitive final decisions are <strong>to</strong> changes in<br />
the small area estimates.<br />
Zaslavsky and Schirm (2002) discuss, in the context of funding allocations, how<br />
interactions between the provisions of the funding formula, data sources and estimation<br />
procedures used <strong>to</strong> derive formula inputs can have unanticipated consequences that are<br />
inconsistent with the policy goals of a program.<br />
E) Are there any conceptual models, either social or economic, that are<br />
believed <strong>to</strong> describe the process which influences the variable(s) for<br />
which we are <strong>to</strong> calculate small area estimates<br />
This is a great opportunity <strong>to</strong> get expert advice on what variables should have a<br />
relationship with the population of interest. This will give a theoretical base <strong>to</strong> look at<br />
certain variables which can then be confirmed by statistical analysis. A widely accepted<br />
theoretical model or framework, published in the literature and/or supported by<br />
empirical investigations can greatly assist in deciding which variables, interaction terms<br />
and contextual effects should be included in the small area model or in validating the<br />
predicted estimates. Should you decide <strong>to</strong> include other variables not included in the<br />
framework or exclude variables that are included, you are aware of the potential need <strong>to</strong><br />
justify the decision.<br />
Australian Bureau of Statistics 12
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
F) What administrative data is available and relevant as auxiliary<br />
information <strong>to</strong> support the modeling of the small area estimates? How is<br />
this data collected, for what purpose is it used, and how accurate is it<br />
likely <strong>to</strong> be?<br />
It is important <strong>to</strong> cast the net wide in considering all potential sources of auxiliary data<br />
that may help improve the goodness of fit and specification of the small area model.<br />
The importance of understanding differences between auxiliary data and the survey data<br />
cannot be overstated. Administrative datasets may not reflect the entire population of<br />
interest or be as reliable as it is captured during some other process (ie. tax collection).<br />
A careful assessment should be made of the differences in:<br />
o<br />
o<br />
o<br />
o<br />
o<br />
o<br />
o<br />
concepts,<br />
data item definitions,<br />
(standard) classifications used<br />
scope<br />
mode of data collection<br />
reference periods<br />
editing procedures<br />
across all the data sources in order <strong>to</strong> at least understand the limitations of the small<br />
area model.<br />
G) Will small area estimates be required <strong>to</strong> be disaggregated by other<br />
categories?<br />
Users often request a whole range of small area data at different levels that may actually<br />
be superfluous <strong>to</strong> their needs. Here it is useful <strong>to</strong> find out what is the minimum level of<br />
data and geographic detail required <strong>to</strong> meet their needs. Prioritise any further<br />
breakdowns either at the geographic or sub-population level so during the modeling<br />
time is best spent on the essential models.<br />
H) What previous studies have been used, if any, by the clients <strong>to</strong><br />
undertake the policy/funding decision for which small area estimates are<br />
required?<br />
This allows the project <strong>to</strong> compare results <strong>to</strong> current or previous studies, which will give<br />
a good outline if it is consistent with other research. It also allows research in<strong>to</strong> what<br />
problems have come up in the past.<br />
Australian Bureau of Statistics 13
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
3. Some issues in Small Area Estimation<br />
3.1 Sources of Additional Information<br />
The aim of small area estimation is <strong>to</strong> output a set of reliable estimates for each small<br />
area for the target variable(s) of interest. The challenge therefore, in small area<br />
estimation, is how best <strong>to</strong> use innovative approaches that take advantage of additional<br />
information <strong>to</strong> circumvent the small sample size problem and provide estimates with<br />
improved quality. Small area estimation methods are effective when they can draw upon<br />
intrinsic relationships within and between the survey data and other data sources, from<br />
which they borrow strength. These relationships, which are schematically represented in<br />
Figure 3.1, may be found:<br />
o<br />
o<br />
o<br />
o<br />
o<br />
between the survey based direct estimate and auxiliary information available from<br />
administrative data sources, censuses or other surveys or<br />
in correlations between direct estimates observed across time or<br />
in spatial relationships between neighbouring small areas or<br />
in cross-sectional relationships between units with similar characteristics observed in<br />
different small areas within some broader region<br />
or any combinations of the above.<br />
Figure 3.1: Possible sources of additional information<br />
Auxiliary Data<br />
(Demographic<br />
Information)<br />
Cross-sectional<br />
Relationships<br />
Small<br />
Area<br />
Model<br />
Time Series<br />
Relationships<br />
Multivariate<br />
Correlations<br />
Spatial<br />
Effects<br />
It turns out that, in most cases, by far the most important source from which <strong>to</strong> borrow<br />
strength, is the use of auxiliary data.<br />
Auxiliary data<br />
Australian Bureau of Statistics 14
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
One of the more important prerequisites for the successful production of small area<br />
estimates is the availability of accurate auxiliary data that is well correlated with the<br />
target variable. By auxiliary data we mean one or more variables obtained from either<br />
administrative data sources or a census that are included in the model as explana<strong>to</strong>ry<br />
variables. The auxiliary data should:<br />
o<br />
o<br />
o<br />
comprehensively cover the entire population scope for which small area estimates<br />
are required. If an auxiliary data item is not available for the unselected part of the<br />
population then small area predictions cannot be made and the affected data items<br />
cannot be included in the model.<br />
include reliable geographic information so that all units belonging <strong>to</strong> a small area can<br />
be accurately identified, and<br />
be contemporaneous with the target variable and other auxiliary data used in the<br />
model<br />
Model based small area estimates are produced by firstly fitting the model <strong>to</strong> the<br />
sample data <strong>to</strong> estimate model parameters, which include the intercept and slope<br />
parameters. The estimated model is then applied <strong>to</strong> the population auxiliary data <strong>to</strong><br />
produce the small area predicted estimates.<br />
In the case of a purely area level model, the target variable and auxiliary variables are<br />
all at the small area level, so it is relatively straightforward <strong>to</strong> produce small area<br />
estimates as described above. However in the case of unit or person level models, the<br />
second step referred <strong>to</strong> above is a little more complex as the model fitted <strong>to</strong> the<br />
sampled units is generally applied <strong>to</strong> those population units not selected in the<br />
sample. Small area estimates are compiled by taking the sum of the sample unit<br />
values for the target variable (obtained from the survey data) and adding <strong>to</strong> it the sum<br />
of the model predictions for the non-sampled units.<br />
This approach naturally applies if the survey data can be reliably matched <strong>to</strong> the<br />
auxiliary information using a hard matching identifier such as Medicare number or tax<br />
file number. This is common practice in a number of European Union countries<br />
where national identifiers exist. However due <strong>to</strong> privacy considerations and related<br />
issues, this practice rarely occurs in Australia. Where it is not possible <strong>to</strong> distinguish<br />
between sampled and non-sampled units on the auxiliary data sources, there are two<br />
options available:<br />
- apply the model fitted <strong>to</strong> the sample data <strong>to</strong> the entire population data file, or<br />
- group population units within each small area (eg age by sex), fit a model <strong>to</strong> the<br />
small area by sub-group level sample data and then apply this model <strong>to</strong> make<br />
predictions for the non-sampled population in each small area sub-group.<br />
The first approach suffers from the disadvantage that the prediction error for the small<br />
area estimates will be increased slightly because target variable values for the sampled<br />
units are predicted from the model, thereby contributing <strong>to</strong> <strong>to</strong>tal model error. It would<br />
be more preferable, however <strong>to</strong> make use of the available survey response values which<br />
are not subject <strong>to</strong> model error. If the sampling fraction is very small then this should not<br />
be a major concern.<br />
The second approach has the advantage that only population counts of the non-sampled<br />
population in each small area sub-group are required <strong>to</strong> make predictions. The<br />
predicted <strong>to</strong>tals for the non-sampled population (at the small area sub-group level) can<br />
Australian Bureau of Statistics 15
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
then be added <strong>to</strong> the corresponding sample <strong>to</strong>tals <strong>to</strong> form small area estimates. A<br />
potential disadvantage of this approach is that the small area sub-group level model may<br />
be less efficient than a unit level model.<br />
Auxiliary data may be available at area-level or person/unit level or a combination of<br />
both. However, in practice due <strong>to</strong> confidentiality or security reasons, data from<br />
government administrative sources are more likely <strong>to</strong> be available at some aggregated<br />
level. The choice between a unit/person level or area level model will depend on the<br />
level at which data for the variable of interest and explana<strong>to</strong>ry variables are available as<br />
well as the efficiency of the small area estimates generated. For example, if data for<br />
the target variable and the auxiliary variables are only available at the area level, fitting<br />
an area level model will be the only option. However if unit level data is available for<br />
all variables, either an area level or unit level model is an option. It is also possible <strong>to</strong><br />
fit a model in which the target variable is at the unit level but some auxiliary variables<br />
are at the unit level while others are at area level. Further discussion on the choice of<br />
small area model is provided in Section 4.2 below.<br />
In practice, the efficiency of predicted small area estimates may be improved by<br />
including some auxiliary variables as small area averages. Such covariates are referred <strong>to</strong><br />
as contextual effects and may be included as an additional covariate even if the variable<br />
already appears in the model as a unit level auxiliary variable. Contextual effects allow<br />
differences in the area level characteristics in which a person lives <strong>to</strong> be accounted for in<br />
the model. For example, high income earners living in low income areas may have quite<br />
different characteristics <strong>to</strong> people on similarly high incomes living in high income areas,<br />
and it may be important <strong>to</strong> take account of this in the model.<br />
We now give an example of the data sources and auxiliary variables that were considered<br />
for the disability empirical study. The target variable was whether or not a person has a<br />
disability. The auxiliary data was drawn from the survey, a census as well as<br />
administrative data sources and comprised:<br />
- Survey of Disability, Ageing and Carers (SDAC) (ABS, 1998)<br />
- Census of Population and Housing, 2001 (ABS)<br />
- Socio-Economic Indexes For Areas (SEIFA) (ABS)<br />
- Disability Support Pension (DSP) data from Centrelink<br />
Given these sources of data, the following auxiliary variables were considered:<br />
- proportion of people in the small area receiving the DSP,<br />
- age and sex, income, household structure (from SDAC)<br />
- Socio- Economic Indexes For Area (SEIFA) score for the small area,<br />
- Indica<strong>to</strong>r of remoteness<br />
Some of these variables were only available at the area level while those sourced from<br />
SDAC/Census, for example, age, sex and income, were available at the person level.<br />
These SDAC variables were chosen subject <strong>to</strong> the requirement that these variables were<br />
similarly defined and available from the census.<br />
Another key issue relating <strong>to</strong> auxiliary data concerns the case where survey data cannot<br />
be matched <strong>to</strong> auxiliary data sources. In order <strong>to</strong> make predictions for each small area,<br />
auxiliary variables obtained from the survey must correspond closely with similar data<br />
items available for the rest of the population. If this is not the case then model<br />
predictions may be significantly biased. For example in the empirical study of small area<br />
Australian Bureau of Statistics 16
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
estimates of disability, we used auxiliary variables such as age, sex, income and<br />
household structure, found on the SDAC survey file <strong>to</strong> fit the model and then used the<br />
corresponding variables on the population census file <strong>to</strong> make the small area<br />
predictions.<br />
When considering potential sources of auxiliary data it is highly advisable <strong>to</strong> cast a wide<br />
net and assess the value of data that may not on first reflection appear highly relevant.<br />
For example, in the context of disability data, an economic variable in addition <strong>to</strong> health<br />
related variable may have good predictive power. Some caution however needs <strong>to</strong> be<br />
exercised as it is possible that the correlation between the target and some of the more<br />
tenuous auxiliary variables is more due <strong>to</strong> coincidence than <strong>to</strong> an intrinsic real world<br />
relationship between the two. Such auxiliary variables are referred <strong>to</strong> as spurious<br />
auxiliary variables.<br />
Demographic information is a particular form of auxiliary information, relating <strong>to</strong><br />
population attributes such as age and sex. Many social variables will have some<br />
relationship <strong>to</strong> such demographic data thereby necessitating its use. However there is<br />
another reason for using demographic information and that is where the population size<br />
or demographic composition of small areas varies considerably. In Australia, with its<br />
extreme variation in population densities, this is a very common issue.<br />
Cross-sectional relationships<br />
Cross-sectional correlations are intrinsic relationships between units (observed at the<br />
same time point) with similar characteristics, even if they are not in the same small area.<br />
For example, units with the same age, sex and occupational characteristics may have<br />
similar health outcomes regardless of whether they live in Sydney or Melbourne. Small<br />
area methods borrow strength cross-sectionally by pooling sample data across a broader<br />
area (thus obtaining more statistical reliability) and then adjusting each small area<br />
estimate according <strong>to</strong> it's age-sex-occupation profile. In practice, borrowing strength<br />
cross-sectionally may be restricted <strong>to</strong> a predefined broader region if it is believed that<br />
cross-sectional relationships are likely <strong>to</strong> be different between regions. For example<br />
exposure <strong>to</strong> air pollutants is likely <strong>to</strong> be similar for Sydney and Melbourne but different<br />
<strong>to</strong> that of other cities. Hence Sydney and Melbourne may be combined in<strong>to</strong> a broader<br />
region within which cross-sectional relationships can be drawn upon.<br />
Time Series Relationships<br />
Borrowing strength across time enables the practitioner <strong>to</strong> effectively pool sample data<br />
across time. The sample in each small area may be very sparse at a given time point,<br />
however if a sufficiently long time series exists and au<strong>to</strong>-correlations across time are<br />
reasonably strong, data from a number of time points can be pooled <strong>to</strong>gether giving a<br />
larger effective sample size <strong>to</strong> utilize in each small area. Time series au<strong>to</strong>-correlations are<br />
utilised <strong>to</strong> adjust for the degree of similarity or dissimilarity between units observed at<br />
specified time periods apart. This approach also has the benefit of reducing the impact<br />
of an observed value that is discordant with its neighbouring values in time. Borrowing<br />
strength across time adds a considerable degree of complexity <strong>to</strong> small area estimation<br />
and should only be contemplated where statistical expertise is available.<br />
Australian Bureau of Statistics 17
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
Spatial Relationships<br />
Spatial relationships in the data can be harnessed in much the same way that time series<br />
relationships can be. Thus, if we hypothesize that different units bear some relationship<br />
<strong>to</strong> each other that depends upon the distance and direction between them, units can<br />
then be pooled <strong>to</strong>gether <strong>to</strong> give a greater effective sample size for each small area<br />
estimate. This approach also has the benefit of reducing the impact of the odd unit value<br />
that is discordant with its neighbouring values. Spatial methods are commonly used in<br />
the contexts of health, disease, agricultural or environmental data but may be quite<br />
applicable <strong>to</strong> other specific <strong>to</strong>pics.<br />
As in the case of time series relationships, borrowing strength through spatial<br />
relationships adds additional complexity <strong>to</strong> the small area estimation and should only be<br />
contemplated where statistical expertise is available.<br />
Multivariate Relationships<br />
In a univariate model the response or target variable is a single variable. In this manual<br />
the models referred <strong>to</strong> are univariate models. So using the example of disability type<br />
(physical, sensory, intellectual, psychological/psychiatric, head injury/acquired brain<br />
damage), a separate univariate model is fitted <strong>to</strong> each of the disability types. In a<br />
multivariate model, the target variable is a vec<strong>to</strong>r of these variables and the model is<br />
fitted <strong>to</strong> these variables simultaneously.<br />
A multivariate approach may be more efficient in terms of producing more accurate<br />
predictions if there are strong correlations between the constituent variables. For<br />
example, physical impairment may have a strong correlation with sensory impairment. A<br />
multivariate approach that takes advantage of this additional information should be<br />
more robust and give more accurate estimates. However, multivariate models add<br />
additional complexity <strong>to</strong> small area estimation and should only be contemplated where<br />
statistical expertise is available.<br />
3.2 Basic Conditions for Success<br />
The first step in undertaking a small area exercise is <strong>to</strong> determine the quality of the<br />
direct estimates and the auxiliary data at the small area level. The variable of interest is<br />
often drawn from a sample survey, which can not provide estimates at a fine level due <strong>to</strong><br />
small sample size in each small area and correspondingly high Relative Standard Errors<br />
(RSE's). Auxiliary data can be obtained from many sources including administrative<br />
datasets, survey variables and census counts. Table 3.1 outlines some issues that will<br />
help in determining whether the basic conditions for producing quality small area<br />
estimates are being met.<br />
Australian Bureau of Statistics 18
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
Table 3.1: Recipe for Success<br />
Ingredient<br />
Small Area Size<br />
Each small area should have a reasonable<br />
sample. Few small areas should have no<br />
sample.<br />
Variable of Interest<br />
Reasonably common population<br />
characteristic<br />
Consistent estimates across small areas<br />
Model Specification<br />
Model is well-specified, meaning that:<br />
o all main determinants or explana<strong>to</strong>rs<br />
(auxiliary variables) for the target variable<br />
are included in the model and<br />
o the model reflects the correct form of the<br />
relationship between the target variable<br />
and the auxiliary variables (eg linear,<br />
quadratic, logistic etc) and that variance<br />
structures are accounted for correctly.<br />
Auxiliary Data<br />
Strong theoretical relationship between<br />
auxiliary variable and population of interest<br />
<strong>Statistical</strong>ly significant relationships between<br />
auxiliary data and small area estimates.<br />
The auxiliary data has been accurately<br />
collected and maintained and uses similar<br />
scope and definitions <strong>to</strong> the survey data.<br />
No missing values<br />
Compatibility of auxiliary data with census<br />
data in terms of consistency of definitions of<br />
variables, measurement, timing and other<br />
issues.<br />
Confidentiality<br />
Maintain confidentiality standards<br />
Reason<br />
The smaller the sample the harder it is <strong>to</strong> reliably discern<br />
the characteristics of individual small areas. More reliance<br />
is then placed on the assumption that the small area is<br />
similar <strong>to</strong> others. It also becomes more difficult <strong>to</strong> identify<br />
relationships either in the data or with auxiliary data. This<br />
will lead <strong>to</strong> lower quality small area estimates .<br />
Similar reason <strong>to</strong> small area size. In the context of<br />
household surveys, the rarer the characteristic the smaller<br />
the likely sample<br />
Key assumption with simple synthetic models.<br />
Mis-specification may result in incorrect predictions and<br />
incorrect measures of the statistical reliability of those<br />
predictions.<br />
Allows easy identification of potential auxiliary variables<br />
and aids in explanation of method <strong>to</strong> users.<br />
Allows a reasonable small area model <strong>to</strong> be estimated.<br />
Eliminates a further source of error that would otherwise<br />
impact upon the quality of the final small area output.<br />
Missing values can bias estimates or cause model failure.<br />
Where possible ensure these have been accounted for<br />
before modelling.<br />
Reduces further sources of errors caused due <strong>to</strong><br />
inconsistency of definitions, measurement and other<br />
changes over time.<br />
ABS mission statement provides an assurance concerning<br />
the confidentiality of the data it collects.<br />
Australian Bureau of Statistics 19
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
3.3 Choice of Small Area<br />
Within the ABS, the choice of small areas generally aligns with pre-specified boundaries<br />
as defined by the Australian Standard Geographical Classification (ASGC). Each area<br />
within Australia is broken up in<strong>to</strong> Census Collection Districts (CDs) within <strong>Statistical</strong><br />
Local Areas (SLAs) within <strong>Statistical</strong> Subdivisions (SSDs) within <strong>Statistical</strong> Divisions (SDs)<br />
within states. If possible, it is generally advisable <strong>to</strong> use ASGC classifications as they<br />
provide a consistent and integrated framework with a readily available set of<br />
concordances. However, other agencies often have different boundaries for their<br />
administrative areas. These boundaries generally line up with council boundaries which<br />
again line up with SLAs. Another common boundary is the postcode which can be<br />
related <strong>to</strong> a CD, although only approximately. A new geographical unit, called the<br />
meshblock, will be introduced in the 20<strong>06</strong> population census for output purposes. The<br />
meshblock is considerably smaller than the CD, and with the help of the Geocoded<br />
<strong>National</strong> Address File (G-NAF) will improve the accuracy with which locations are coded<br />
<strong>to</strong> other ASGC classifications. In Section 2 we saw how it is important <strong>to</strong> find out from<br />
users the broadest area that will meet their small area requirements in order <strong>to</strong> improve<br />
the reliability of the modelled estimates. Figure 3.2 depicts the different choices that<br />
must be made <strong>to</strong> get reasonable estimates.<br />
Figure 3.2: Choosing the Appropriate Small Area<br />
Australian Bureau of Statistics 20
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
As discussed in Section 3.2 under "Auxiliary Data", a choice exists as <strong>to</strong> the level at which<br />
small area models should be applied. In practice, users may require small area data at<br />
different levels of aggregation and it may be expedient <strong>to</strong> fit the model at the finest level,<br />
produce small area estimates at that level and let users aggregate those estimates up <strong>to</strong><br />
the required levels of aggregation. However it is important <strong>to</strong> realise that this may run<br />
the risk of incurring what is known as the a<strong>to</strong>mistic fallacy (PAHO, 2003).<br />
The a<strong>to</strong>mistic fallacy occurs when trying <strong>to</strong> draw inferences about units defined at a<br />
higher level of aggregation from a model fitted at a lower level of aggregation.<br />
Relationships between lower levels of aggregated units may not be the same as those<br />
between higher level aggregated units. Hence if another model was fitted at the higher<br />
level, the estimated model parameters and the predictions may be quite different <strong>to</strong><br />
those for the model fitted at the lower level. A similar fallacy, called the ecological<br />
fallacy, may occur in the reverse situation of fitting a model at a broad regional level and<br />
assuming that the inferences drawn can be readily applied <strong>to</strong> small areas within those<br />
regions. Wherever possible it is advisable <strong>to</strong> make model inferences (that is predicted<br />
estimates and their associated measures of accuracy) at the small area level required by<br />
users. Where small area estimates are <strong>to</strong> be aggregated the extent of the aggregation<br />
should be kept <strong>to</strong> a minimum. If small area estimates are produced at the LGA level,<br />
aggregation <strong>to</strong> user defined regions consisting of only a few LGAs may be acceptable but<br />
aggregation of many LGA estimates should be met with caution.<br />
In choosing the most appropriate small area <strong>to</strong> use, consideration needs <strong>to</strong> be given <strong>to</strong><br />
the sample size in each small area. This needs <strong>to</strong> be sufficient so that the model can<br />
produce appropriately reliable estimates. The size of the sample will depend on the<br />
strengths of the cross-sectional relationships or other areas for borrowing strength. If<br />
these are quite strong then perhaps as few as ten or twenty will be sufficient. In the<br />
absence of strong relationships in the data, a larger sample size of perhaps a couple of<br />
hundred units may be required in each small area. The sample sizes referred <strong>to</strong> here<br />
should be interpreted as a very rough guide as, apart from model strength, the required<br />
sample size will also depend upon the variation of units within each small area.<br />
The number of small areas is also important especially if units are clustered within small<br />
areas. Generally having more small areas will help improve the goodness of fit of the<br />
small area model. Consideration should also be given <strong>to</strong> the geographical distribution<br />
of the sample through each small area. In ABS household surveys clustering is used in<br />
the sample design <strong>to</strong> help reduce costs, with the result that in remote areas all of the<br />
sampled dwellings in a small area may have been selected from one or more small <strong>to</strong>wns<br />
and none from throughout the vast rural expanse. This is likely <strong>to</strong> result in bias if the<br />
characteristics of people in those <strong>to</strong>wns are different <strong>to</strong> those in the remote rural areas.<br />
The allocation of the sample across small areas will often reflect the relative frequency<br />
with which the characteristic or variable of interest occurs in the population. In the case<br />
of a common sub-population such as the number of persons employed or the number of<br />
persons with a disability then local government area (LGA) may be a suitable choice of<br />
small area. For rare characteristics such as indigenous status or a particular type of<br />
illness then larger areas may be required <strong>to</strong> give reasonable estimates. For this, the<br />
<strong>Statistical</strong> Subdivision (SSD) or broader regions may be required.<br />
Of course the decision on which level of geography <strong>to</strong> choose for the small area will<br />
Australian Bureau of Statistics 21
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
ultimately hinge upon user decision making requirements. It makes sense <strong>to</strong> choose<br />
small area that are as close as possible <strong>to</strong> the areas used for program planning and<br />
implementation. However, such areas are often really no more than administrative<br />
regions, chosen for pragmatic or logistical reasons such as transport costs or workforce<br />
management efficiency. <strong>Statistical</strong> units within these administrative regions are not<br />
necessarily homogenous with respect <strong>to</strong> the variable we are trying <strong>to</strong> calculate small area<br />
estimates for. If this is the case it may be worth considering (subject <strong>to</strong> the minimum<br />
sample size requirement) small areas at a finer level with greater homogeneity <strong>to</strong> obtain<br />
a better fitting model. Small area estimates at this level can then be aggregated <strong>to</strong> the<br />
required administrative regional level.<br />
For example, in the disability empirical study, disability programs are funded and<br />
administered at the level of Disability and Health Services Regions (DHSR) which are<br />
aggregations of usually a few LGAs. LGA was considered sufficiently close for modelling<br />
purposes while also having the advantage of sufficient sample sizes and higher level of<br />
homogeneity with respect <strong>to</strong> disability characteristics.<br />
Another example is that of producing small area estimates of water usage. One might<br />
consider using water catchment areas because that is the level required by users,<br />
however these are not always standardised across water and energy authorities. There is<br />
also the problem of geocoding ASGC classifications on which ABS data is based <strong>to</strong> the<br />
water catchment area. Water catchment areas can also be vast along major river systems,<br />
encompassing very different land uses, rainfall patterns and geological drainage features.<br />
3.4 Variable of Interest<br />
The variable of interest is typically measured from an ABS sample survey. This forms our<br />
dependent variable <strong>to</strong> build the small area model around. If the proportion of the<br />
population with a characteristic of interest is constant across broad geographic areas<br />
(e.g. assuming each small area has say, the same rate of heart attacks within NSWs), then<br />
auxiliary data are not really needed and a simple technique such as the broad area ratio<br />
estima<strong>to</strong>r will give good results.<br />
In practice, however, this will be a strong assumption <strong>to</strong> make. If we believe that small<br />
area proportions vary with other fac<strong>to</strong>rs then auxiliary information will be required <strong>to</strong><br />
build a model. The auxiliary data can help explain the variation between small areas and<br />
assist in creating quality small area estimates.<br />
Another point for consideration is that in many applications there will be not just one<br />
but a number of variables of interest requiring small area estimates. Auxiliary data may<br />
not be available for each of these and the strength of the relationship between each<br />
variable of interest and the available auxiliary variables may vary markedly. Prioritising<br />
the variables of interest with users will assist in focusing effort <strong>to</strong> improve the quality of<br />
those estimates that matter most.<br />
3.5 Quality of Auxiliary Data<br />
Potential auxiliary data should be evaluated for their relationship <strong>to</strong> the variable(s) of<br />
interest, both theoretically and statistically as well as the accuracy and reliability with<br />
which they have been collected. The theoretical relationship should emanate from<br />
tested social or economic theories. A careful examination should be made <strong>to</strong> understand<br />
Australian Bureau of Statistics 22
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
any major differences between the auxiliary data and the variables of interest.<br />
Consideration should be given <strong>to</strong> the purpose for which the data was initially collected,<br />
how was it processed and edited, what conceptual definitions were used and what is the<br />
scope of the auxiliary data holdings. This will allow appropriate auxiliary information <strong>to</strong><br />
be chosen <strong>to</strong> improve the model, aid in explaining <strong>to</strong> users what fac<strong>to</strong>rs are driving the<br />
small area estimates and help pinpoint potential sources of error.<br />
In summary the following aspects should always be examined carefully when<br />
considering administrative data for use as auxiliary variables:<br />
o<br />
o<br />
o<br />
o<br />
o<br />
o<br />
o<br />
o<br />
o<br />
o<br />
Population scope of the data<br />
Definitions of variables / concepts used<br />
Purpose for collecting data / what is it used for<br />
Reference period<br />
Questionnaire (or form) and collection methodology used <strong>to</strong> collect the data,<br />
Survey design used<br />
Quality of the framework used <strong>to</strong> select units from<br />
The extent of missing data. What if any, imputation treatments were used?<br />
Classifications used<br />
Editing or data validation process used<br />
In the disability study, auxiliary data was sourced from Centrelink on the number of<br />
people receiving the Disability Support Pension (DSP). In areas with a greater<br />
proportion of people receiving the DSP we would expect a higher incidence of disability.<br />
A person’s eligibility <strong>to</strong> receive the DSP is related <strong>to</strong> their ability <strong>to</strong> undertake<br />
employment related activities, whereas the ABS Survey of Disability, Ageing and Carers<br />
(SDAC) concept of disability relates <strong>to</strong> a person’s ability <strong>to</strong> undertake a wide range of<br />
household, social as well as employment activities.<br />
There are a number of simple approaches for evaluating the strength of the statistical<br />
relationship between the variable of interest and the auxiliary data. The strength and<br />
statistical significance of this relationship can be analysed through simple scatter plots,<br />
correlations or simple models. Where substantial differences between the data do<br />
appear, it may be possible in some circumstances <strong>to</strong> improve the statistical relationship<br />
by the application of suitable adjustments or imputation methods <strong>to</strong> the auxiliary data <strong>to</strong><br />
make it more comparable with the response variable. The aim of these adjustments may<br />
be <strong>to</strong> reduce the impact of scope or definitional differences or <strong>to</strong> treat outliers in the<br />
auxiliary data. Such adjustments may help <strong>to</strong> improve the statistical relationship between<br />
the auxiliary data and the response variable. However it is important that a statistician be<br />
consulted before applying such adjustments.<br />
Australian Bureau of Statistics 23
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
3.6 Confidentiality<br />
Protecting the confidentiality of data provided <strong>to</strong> the ABS is of utmost importance and is<br />
enshrined in the Census and Statistics Act, 1905. The risk of breaches of confidentiality<br />
need <strong>to</strong> be carefully assessed in the case of small area data releases, as such releases<br />
naturally produce a higher level of detail than is normally the case. Hence care must be<br />
taken <strong>to</strong> ensure that the potential for identifying individual persons or businesses is<br />
greatly reduced. The risk of identification is increased when:<br />
o<br />
o<br />
o<br />
The population of interest is quite rare<br />
The geographic area is very small<br />
A major part of the small area estimate can be attributed <strong>to</strong> units with unusual<br />
characteristics. (Such as in the case of doc<strong>to</strong>rs in remote areas or the<br />
telecommunications sec<strong>to</strong>r)<br />
The release of small area estimates should follow the standard ABS guidelines.. While the<br />
fine level of geography increases the risk of identification, this risk may <strong>to</strong> some extent<br />
be mitigated by the inherent smoothing of the data and additional model error<br />
introduced by the modeling process itself. However this does not mean that all caution<br />
can be thrown <strong>to</strong> the wind. Most small area projects will be commissioned by external<br />
agencies and individuals in these or other agencies may be realistically expected <strong>to</strong> be in<br />
a position <strong>to</strong> obtain knowledge of the models used <strong>to</strong> produce the small area estimates.<br />
Such information could possibly be used <strong>to</strong> identify individuals. Another issue is that<br />
although most small area estimates will be modeled and hence incur model error and<br />
further smoothing, there is the risk that an individual is correctly identified from the data<br />
although using incorrect logic. There will still be a public perception that the Act has<br />
been breached. In conclusion, all possible steps <strong>to</strong> avoid disclosure should be taken in<br />
preparing small area data for release and the Data Access and Confidentiality<br />
Methodology Unit should be consulted prior <strong>to</strong> release.<br />
Australian Bureau of Statistics 24
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
4. Choice of Small Area Techniques<br />
4.1 Types of Small Area Estimation Techniques<br />
In this section we discuss some of the more common techniques available for small area<br />
estimation. We consider these techniques under the general headings of "Simple Small<br />
area Methods" (Section 4.1.1) and "Regression Methods" (Section 4.1.2). Although the<br />
methods discussed under Section 4.1.1 can be formulated in terms of a regression<br />
model, and hence would conceptually belong under Section 4.1.2, we have treated them<br />
separately because they are:<br />
1.<br />
2.<br />
simple <strong>to</strong> implement and require less statistical expertise. They are also commonly<br />
used <strong>to</strong> produce small area estimates by many government agencies.<br />
they are often appropriate as an initial exercise <strong>to</strong> obtain "rough" small area<br />
estimates, before attempting more rigorous techniques<br />
4.1.1 Simple Small Area Methods<br />
Here we discuss the simpler methods that involve weighted survey estimates derived for<br />
a given level of geography that can be applied without the explicit application of<br />
statistical models. These methods include the:<br />
!<br />
Direct Estima<strong>to</strong>r<br />
Direct estimates are classical design-based estima<strong>to</strong>rs that are obtained by<br />
applying survey weights <strong>to</strong> the sample units in each small area (Saei and<br />
Chambers, 2003). Since most ABS surveys are designed <strong>to</strong> provide reliable<br />
estimates only at the national or state levels, sample sizes are often <strong>to</strong>o small at<br />
the small area level <strong>to</strong> produce reliable direct estimates. Small area estimation<br />
is therefore concerned with alternative techniques that can produce small area<br />
estimates with higher accuracy than that of direct estimates.<br />
!<br />
Broad Area Ratio Estima<strong>to</strong>r (BARE)<br />
This estima<strong>to</strong>r is one of the simplest types of synthetic estima<strong>to</strong>rs. It is<br />
calculated by pro-rating a broad area direct estimate by the ratio of the small<br />
area <strong>to</strong> broad area populations. This estima<strong>to</strong>r applies the reliable broad area<br />
estimate proportionately across all small areas contained in the broad region.<br />
The success of the BARE estima<strong>to</strong>r hinges largely on the choice of the broad<br />
area. The broad area needs <strong>to</strong> be chosen large enough <strong>to</strong> afford a direct<br />
estimate that is sufficiently reliable but small enough that all small areas within<br />
the broad area are sufficiently homogenous in the characteristic of interest. It<br />
is important <strong>to</strong> note that if small areas are in fact not homogenous within the<br />
broad region then the BARE will be biased. In practice this is difficult <strong>to</strong> verify<br />
hence caution should be exercised when using the BARE. It should only be<br />
used when users are aware of, and fully prepared <strong>to</strong> accept this assumption.<br />
Australian Bureau of Statistics 25
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
!<br />
Calibration Estima<strong>to</strong>r<br />
To produce calibration estima<strong>to</strong>rs, the original survey weights (usually the<br />
inverse probabilities of inclusion in the sample) are replaced with new<br />
"calibrated" weights that are in some sense as close as possible <strong>to</strong> the original<br />
weights, but are calibrated on some auxiliary variable available for the<br />
population (Chambers, 2005). The small area estimate for this auxiliary<br />
variable, calculated using the calibrated weights, will agree with the known<br />
population <strong>to</strong>tals. A simple example of calibration is where population age by<br />
gender demographic <strong>to</strong>tals are known for each small area. The survey weights<br />
are then adjusted so that estimates of population count by age and gender,<br />
agree with the known population counts.<br />
There are a couple of points <strong>to</strong> note about the calibration estima<strong>to</strong>r. Firstly it is<br />
a straightforward method <strong>to</strong> put in<strong>to</strong> production because the resulting<br />
adjusted (calibrated) weights can be s<strong>to</strong>red on the survey file and used <strong>to</strong><br />
produce estimates at the desired level of aggregation. Secondly the auxiliary<br />
variables should be chosen with care and should relate <strong>to</strong> variables we wish <strong>to</strong><br />
produce estimates for. If the calibrated weights are used <strong>to</strong> produce estimates<br />
for variables that aren’t related <strong>to</strong> the auxiliary variable(s) used in determining<br />
the calibrated weights, the resulting estimates may be biased. In general<br />
calibrated estimates possess good design-based properties. Government<br />
statisticians have his<strong>to</strong>rically preferred the design-based <strong>to</strong> the model-based<br />
approach as the resulting estimates are not subject <strong>to</strong> the consequences of<br />
model mis-specification.<br />
4.1.2 Regression Methods<br />
Where a higher level of accuracy is required for small area estimates, an alternative is <strong>to</strong><br />
use regression or model-based approaches, however these methods require a higher<br />
level of statistical expertise <strong>to</strong> implement and interpret results. A wide variety of<br />
different regression techniques are available, but for the purposes of this manual, they<br />
are divided in<strong>to</strong> two main categories: synthetic and random effects regression models.<br />
!<br />
Synthetic Regression Models<br />
Synthetic regression models make use of available auxiliary data <strong>to</strong><br />
mathematically express a deterministic relationship between those auxiliary<br />
variables and the target (response) variable we are trying <strong>to</strong> predict in each<br />
small area. Synthetic models assume that all the systematic variability in the<br />
response variable is explained by the variability in the values of the auxiliary<br />
variables. The remaining variability, which is referred <strong>to</strong> as the "random noise"<br />
or "s<strong>to</strong>chastic variation", is represented by the difference between the<br />
predicted value for the response variable under the model and the value<br />
observed from the data. These differences are called random errors, residuals<br />
or disturbances.<br />
In the case of small area models, synthetic models assume that the same<br />
deterministic relationship between the variable of interest and the auxiliary<br />
variables, holds across a range of small areas, say for example within a state.<br />
Australian Bureau of Statistics 26
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
Synthetic models work well when all relevant auxiliary variables that help<br />
predict the response variable are available, accurate and can be included in the<br />
model. However in practice this is more the exception than the rule.<br />
!<br />
Random Effects Regressions Models<br />
When fitting a synthetic model, the residuals should look like "white noise",<br />
however in practice they often display significant between area variation which<br />
indicates that there is some other systematic variation in the response variable<br />
between different small areas that is not being accounted for by the auxiliary<br />
variables. This implies that the synthetic model is missing certain auxiliary<br />
variables, the values of which would, had they been available, better help<br />
predict differences between small areas.<br />
This problem can be addressed by incorporating a random effect in<strong>to</strong> the<br />
model. This is done by treating the constant or intercept term in the model as<br />
a fixed constant plus a random component known as the random effect. The<br />
interpretation of this is that each small area is assigned an intercept term in the<br />
model which is allowed <strong>to</strong> vary, around some overall constant value, from one<br />
small area <strong>to</strong> another. This is usually sufficient <strong>to</strong> take account of between area<br />
variation, however it is possible <strong>to</strong> include a random effect in a parameter<br />
coefficient rather than the intercept term. Doing this further adds <strong>to</strong> the level<br />
of complexity and is not covered in this manual.<br />
For models fitted <strong>to</strong> small area level data, the inclusion of random effects may<br />
give a distinct advantage over the synthetic model approach, possibly leading<br />
<strong>to</strong> estimates with higher precision and robustness. In the case of linear models,<br />
random effects model can theoretically be shown <strong>to</strong> give small area estimates<br />
that reflect the best trade-off between the accuracy of the direct estimate and<br />
the uncertainty associated with the synthetic model. So for a small area that<br />
happens <strong>to</strong> have a low sampling error (eg because of a large sample size, say)<br />
relative <strong>to</strong> the <strong>to</strong>tal error (sum of sampling error of the direct estimate and<br />
synthetic model error), a random effects model will give more weight <strong>to</strong> the<br />
direct estimate for that small area. On the other hand, for a small area with<br />
high sampling error, more weight will be given <strong>to</strong> the model based estimate as<br />
this will be more reliable.<br />
While being more complex than synthetic models, random effects models can<br />
be estimated using a variety of statistical techniques. However due <strong>to</strong> their<br />
technical nature, this manual will not go in<strong>to</strong> any further detail about how <strong>to</strong><br />
apply random effects models. A more detailed treatment will be given in the<br />
forthcoming technical manual.<br />
Australian Bureau of Statistics 27
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
4.2 The Modelling Framework<br />
Figure 4.1 presents a schematic representation of the small area modeling framework<br />
followed in this manual. Figure 4.2 complements Figure 4.1 by providing a list of key<br />
questions the purpose of which is <strong>to</strong> aid the decision making process of small area<br />
modeling in a reasonably systematic approach. The objective of these questions is <strong>to</strong><br />
help the modeller/analyst better understand the modeling framework (Figure 4.1) and<br />
hence be able <strong>to</strong> choose the most appropriate technique for a given set of data. This,<br />
however, does not mean these are the only questions that need <strong>to</strong> be raised in this kind<br />
of exercise.<br />
The left-hand-side of Figure 4.1 shows the simplest small area methods, these being the<br />
Direct and Broad Area Ratio estima<strong>to</strong>rs, which are frequently used in the absence of<br />
good quality auxiliary data. The answer <strong>to</strong> question 1 of Figure 4.2 is important as good<br />
quality auxiliary data is a key requisite in order <strong>to</strong> proceed <strong>to</strong> the regression-based small<br />
area estima<strong>to</strong>rs. We take good auxiliary data <strong>to</strong> mean area-level and/or unit-level data<br />
that are potentially correlated (both theoretically and empirically) with the variable of<br />
interest. Section 3.5 discusses some of the ways the quality of auxiliary data can be<br />
determined. The quality of the auxiliary data, therefore, has a large bearing on the<br />
reliability of model predictions for the variable of interest. In other words, when good<br />
quality auxiliary data is available one can choose among a number of regression-based<br />
estima<strong>to</strong>rs that “borrow strength” from the relationship between the variable of interest<br />
and the auxiliary data; thereby improving the quality of small area estimates/predictions.<br />
Australian Bureau of Statistics 28
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
Figure 4.1: Small Area Modelling Framework<br />
Small Area<br />
Methods<br />
Simple Small Area<br />
Models<br />
Regression based<br />
Models<br />
Less complex<br />
More complex<br />
Direct<br />
Estima<strong>to</strong>r<br />
Broad Area<br />
Ratio<br />
Estima<strong>to</strong>r<br />
Linear Models for<br />
- Continuous data<br />
With No Auxiliary<br />
data<br />
With Auxiliary<br />
data<br />
Synthetic<br />
Regression<br />
Models<br />
Area<br />
Level<br />
Analysis<br />
Unit<br />
Level<br />
Analysis<br />
Random<br />
Effects<br />
Models<br />
Generalised Linear Models<br />
- Count data(poisson model)<br />
- Binary data (logistic model)<br />
Univariate Analysis<br />
Multivariate Analysis<br />
The classes of regression based estima<strong>to</strong>rs are shown in the right-hand-side of Figure<br />
4.1. These estima<strong>to</strong>rs can be classified in<strong>to</strong> two major categories, namely, the synthetic<br />
regression models and the random effects models which are relatively more complex<br />
than their synthetic counterparts. For the moment let us focus on the synthetic models.<br />
Once this choice is made the next choice is between a linear or generalised linear<br />
model. The Linear model, which is the simplest of all, is suitable if the variable of interest<br />
is continuous (e.g.; income, age, etc. ). If the variable of interest is not continuous<br />
(binary or count data) one can select appropriately from a wide range of Generalised<br />
Linear Models. The most common examples are the Logistic and Poisson models which<br />
are used <strong>to</strong> model binary and count data, respectively.<br />
Australian Bureau of Statistics 29
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
Clearly, as indicated in questions 2 <strong>to</strong> 3 of Figure 4.2, the choice of any of these or other<br />
models depends on the following important interrelated fac<strong>to</strong>rs:<br />
i.<br />
ii.<br />
iii.<br />
iv.<br />
v.<br />
the level at which the small area estimates are required. Are small area estimates<br />
required at area-level or at some other sub-population such as age by sex group.<br />
the nature of the auxiliary data available related <strong>to</strong> the variable of interest. Again,<br />
these may include whether the data is at the unit-level (person-level), area-level or<br />
both.<br />
the nature of the variable of interest, i.e., whether it is continuous, binary or count<br />
data.<br />
users quality requirements for small area estimates<br />
access <strong>to</strong> statistical expertise<br />
Small area models can be fitted either at area-level or person-level. Area level models are<br />
fitted when the variable of interest and associated covariates in the auxiliary data are<br />
observed at the level of the specific geographic area, which is referred in Figure 4.1 as<br />
area-level analysis. On the other hand a unit/person-level analysis refers <strong>to</strong><br />
unit/person-level model that makes use of individual/unit level data in the analysis. When<br />
a model is fitted using unit/person-level data then the predictions based on this model<br />
must be aggregated <strong>to</strong> produce area-level estimates. It is also possible <strong>to</strong> fit a unit/person<br />
level model involving both individual and area-level covariates.<br />
Choosing the right model for the right type of data is crucial in the modelling process.<br />
For example, if the auxiliary information consists of data observed at area or unit level<br />
and the variable of interest is of a continuous nature, then it will be appropriate <strong>to</strong> use a<br />
linear model <strong>to</strong> estimate the variable of interest. Alternatively, if we have unit level data<br />
where the variable of interest is binary (e.g., 1= person has a disability and 0 = person<br />
has no disability) which is usually the case in many small area models, then we would go<br />
for a model that captures the binary nature of the observations, such as the logistic<br />
regression model. Similarly, if our data provides, say, area level count data of people<br />
with a disability then a suitable choice would be the Poisson model which is appropriate<br />
for count data models. It is also possible <strong>to</strong> use two or more models (e.g., unit-level and<br />
area-level models) provided that the dataset is amenable <strong>to</strong> such analyses . For instance,<br />
as we will see in the examples of Section 5 , the logistic and Poisson models are used <strong>to</strong><br />
predict person-level and area-level disability proportions, respectively.<br />
Australian Bureau of Statistics 30
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
Figure 4.2: Key Questions for Small Area Modelling<br />
If NO<br />
Q1. Do you have good quality auxiliary Data?<br />
If Yes<br />
Use Linear or Generalised<br />
linear models, depending<br />
on your data.<br />
Q2. Is the variable of interest of continuous, binary or<br />
count data?<br />
If<br />
Continuous<br />
data: Linear<br />
model<br />
If Binary<br />
data:<br />
Logistic<br />
Model<br />
If Count<br />
Data:<br />
Poisson<br />
model<br />
Simple<br />
Direct or<br />
Broad Area<br />
Ratio<br />
Estima<strong>to</strong>rs<br />
are the<br />
likely<br />
candidates.<br />
Q3. Shall I use an area-level or unit-level model or both?<br />
Q4. At what level is my auxiliary data available and of<br />
good quality?<br />
Good Area Level<br />
or unit level<br />
continuous data<br />
Good<br />
Unit Level binary<br />
data<br />
Good Area Level<br />
count data<br />
Q5. Are there likely <strong>to</strong> be major differences between<br />
small areas that are not taken in<strong>to</strong> account by the<br />
auxiliary data?<br />
If Yes<br />
Use random effects model<br />
Consult methodology<br />
staff for technical advice<br />
The next key question (as indicated by questions 5 of Figure 4.2) is when and why do we<br />
use the random effects models as compared <strong>to</strong> the synthetic models. To start with, the<br />
preceding discussion on the choice of models (linear versus generalised linear) also<br />
applies <strong>to</strong> the random effects models as well. However, the random effects models are<br />
different in that they include an additional error component <strong>to</strong> account for differences<br />
between units that aren’t explained by the auxiliary variables. In other words, synthetic<br />
models assume that the variable of interest can be determined from the same functional<br />
relationship with the auxiliary variables, and that this relationship applies across all small<br />
areas.<br />
This assumption, however, could be restrictive for a number of reasons. For example, in<br />
the disability data some small areas are located in remote areas with limited support<br />
facilities and services while others are in big cities with better infrastructure and services<br />
where people with disability could move there <strong>to</strong> take advantage of the improved<br />
services. Some areas are may have larger population of indigenous people relative <strong>to</strong><br />
others which again may affect disability rates in different areas. Yet, others are located in<br />
coastal areas that attract people of retirement age and the elderly. These fac<strong>to</strong>rs are not<br />
fully accounted for in the auxiliary data. Thus, unless these and other fac<strong>to</strong>rs are taken<br />
in<strong>to</strong> account in the model, they could limit the predictive abilities of synthetic models<br />
Australian Bureau of Statistics 31
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
for some small areas or units. Such differences, therefore, call for a more general/flexible<br />
specification of the models <strong>to</strong> capture the area-specific (person-specific) fac<strong>to</strong>rs after<br />
taking account of the auxiliary variables - and hence the random effects models. Thus,<br />
the choice between random effects models versus synthetic models could be made on<br />
the basis of one or more of the following fac<strong>to</strong>rs:<br />
i.<br />
ii.<br />
iii.<br />
iv.<br />
v.<br />
prior knowledge of small areas or units vis-a-vis the auxiliary data gained from<br />
experience or through discussions with subject matter specialists,<br />
users/stakeholders, etc. (for example, we may not have a lot of faith in our auxiliary<br />
variables /the synthetic model).<br />
from statistical outcomes based on the models. A close assessment or evaluation of<br />
the small area estimates/predictions from comparative synthetic and random effects<br />
models and see whether they meet expectations.<br />
on the basis of statistical/econometric tests (a battery of diagnostic and statistical<br />
tests) on the adequacy of the models.<br />
when one wants small areas with large samples <strong>to</strong> be less affected by the model<br />
because the direct estimates for such areas can be expected <strong>to</strong> be quite reliable in<br />
their own right. The random effects model allows for a suitable trade-off between the<br />
reliability of the direct estimates and reliability of model estimates.<br />
when one wants <strong>to</strong> apply the model <strong>to</strong> areas with no sample in them (out of sample<br />
areas). Random effects models allow for greater flexibility in applying the model <strong>to</strong><br />
make predictions for areas other than those <strong>to</strong> which it was fitted.<br />
Clearly, once the random effects models are chosen they require a higher level of<br />
statistical skill and some familiarity with specialised software. It is also true that more<br />
complex models may not necessarily provide better results. This is particularly true if<br />
sufficiently strong relationships in the data, from which <strong>to</strong> borrow strength, are simply<br />
not present in the data. One should be aware that results from simple models may be as<br />
good as those from complex ones. In other words, as will be discussed later in this<br />
section, the gains in efficiency of estimates from using more complex models need <strong>to</strong> be<br />
assessed.<br />
An important aspect of the modeling process which may also have significant bearing on<br />
the complexity and quality of the analysis is whether the variable of interest involves a<br />
univariate or multivariate analytical framework. Here we are specifically referring<br />
whether the variable of interest is a univariate or multivariate form. For example, in the<br />
disability study, if our variable of interest is simply <strong>to</strong> predict whether a person has an<br />
impairment or not (i.e., 1= person has a disability and 0= person has no disability)<br />
regardless of the type of impairment then this is within a univariate framework. On the<br />
other hand, a breakdown of the variable of interest by type of impairment (e.g., physical,<br />
mental, sensory etc.) would involve a multivariate framework. The real issue here is that<br />
while a univariate analysis is simpler <strong>to</strong> undertake, a multivariate analysis provides an<br />
opportunity <strong>to</strong> exploit additional information on the correlations that exist between the<br />
various types of impairment and hence improve the reliability of estimates.<br />
Australian Bureau of Statistics 32
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
4.3 Trade-off between Quality, Cost, Time and Effort<br />
In this manual, the term ‘quality’ is used <strong>to</strong> indicate the overall level of accuracy,<br />
acceptability and reliability of small area estimates, both from a statistical point of view<br />
and in terms of providing a more informed and reliable decision making capability for<br />
users. More specifically, we borrow from the characterisation of quality as having six<br />
dimensions, these being: relevance, accuracy, timeliness, accessibility, interpretability<br />
and coherence (Allen, 2001). The ABS has a strategic policy of ensuring the quality of all<br />
its output and clearly demonstrating that quality <strong>to</strong> users (ABS, 2002), and this is<br />
particularly relevant and important for the production and release of small area<br />
estimates.<br />
A key aspect that needs <strong>to</strong> be taken in<strong>to</strong> consideration is whether the gains in terms of<br />
quality of outputs from using more complex methods outweigh the time, costs and<br />
effort required <strong>to</strong> generate, interpret and validate the results. Regardless of the degree of<br />
sophistication contemplated at the outset, the small area practitioner is well advised <strong>to</strong><br />
commence with simpler techniques (say, synthetic models). Should resources and user<br />
requirements permit, more rigorous statistical techniques may be applied in stages<br />
resulting in a choice of competing models (say, fitting both synthetic and random effects<br />
models <strong>to</strong> the data). Choosing the best model in light of expert knowledge and<br />
informed judgment would lead <strong>to</strong> improved results and decision making outcomes.<br />
Figure 4.3 below provides some indication of the trade-off between quality, cost, time<br />
and effort in small area modeling. It should be clear that these relationships are not<br />
linear in nature and one cannot authoritatively represent such relationship in a simple<br />
two-dimensional diagram like this. The purpose is, however, <strong>to</strong> provide a rough idea on<br />
the kind of relationships that may exist between quality and cost/time/effort. In Figure<br />
4.3, quality is represented by the vertical axis and could range from, say, low <strong>to</strong> high<br />
levels of quality. The horizontal axis represents cost/time/effort combined. The three<br />
terms (cost/time/effort) are presented in this way as they are interrelated and one is<br />
implicitly defined by the other. In simple terms, it is assumed that increased effort<br />
implies a longer time frame and presupposes more resources and higher costs.<br />
Australian Bureau of Statistics 33
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
Figure 4.3 : Trade-off between Quality and Cost / Time / Effort<br />
<strong>Statistical</strong> expertise<br />
Robustness of results<br />
Understanding results<br />
Interpretability<br />
Issues<br />
Validity of assumptions<br />
User requirements<br />
Availability of resources<br />
Timeliness/deadlines<br />
Quality<br />
Simple<br />
models<br />
Complex<br />
models<br />
Level of<br />
precision<br />
Finer<br />
disaggregation<br />
Good auxiliary data<br />
Cost/time/effort<br />
As you can see from Figure 4.3, good quality auxiliary data is a crucial prerequisite for<br />
obtaining quality small area estimates. Quality is of course a relative term and depends<br />
very much upon the clients’ decision making requirements.<br />
Assuming that we have good quality auxiliary data, we would expect more sophisticated<br />
methods <strong>to</strong> provide results of a higher level of quality, as indicated by the upward slope<br />
of the cost-quality curve. The same curve also indicates that somewhere in the<br />
continuum there exists an optimal point (a level of precision) whereby any additional<br />
effort/cost/time from that point on, has either marginal or declining effects on quality.<br />
More elaborate techniques may give only marginal improvements in accuracy but<br />
decrease timeliness, an important dimension of quality. Overall quality may also be<br />
eroded when exceedingly smaller areas or finer disaggregations of the data are<br />
demanded. For example, in the disability analysis, disaggregating disability by type of<br />
impairment, level of severity and age group, in addition <strong>to</strong> the small area level, leads <strong>to</strong><br />
poor quality estimates, especially for the rarest impairment types such as sensory.<br />
There are also other issues, as shown just above the cost-curve in Figure 4.3, that may<br />
have significant bearing in relation <strong>to</strong> quality and cost of small area estimates. For<br />
example, the use of more complex models may require a higher level of subject matter<br />
knowledge and expertise <strong>to</strong> assist in understanding and interpreting model results.<br />
Such knowledge is also important for testing the validity of assumptions inherent in the<br />
model and in checking the robustness and sensitivity of model results.<br />
Finally, there are important points that have <strong>to</strong> be made in relation <strong>to</strong> the quality versus<br />
cost issue discussed above. Firstly, that simplicity is an important aspect of quality in<br />
that it aids the interpretability of small area output. We do not intend <strong>to</strong> imply from<br />
Figure 4.3 that simpler methods always imply poor quality estimates. More complex<br />
methods should only be attempted where there are likely <strong>to</strong> be demonstrable gains in<br />
Australian Bureau of Statistics 34
A Guide <strong>to</strong> Small Area Estimation - Version 1.1 05/05/20<strong>06</strong><br />
the accuracy of small area estimates. Secondly , the use of sophisticated methods may<br />
not necessarily lead <strong>to</strong> higher costs. For instance, once a strong analytic capability <strong>to</strong><br />
undertake small area estimation has been established (in terms of statistical skill and<br />
other resources) any increase in cost, effort and time for undertaking complex small area<br />
methods may be marginal.<br />
Australian Bureau of Statistics 35