presentation - VirtualRDC @ Cornell - Cornell University
presentation - VirtualRDC @ Cornell - Cornell University
presentation - VirtualRDC @ Cornell - Cornell University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
GoBack
Confidentiality Protection in the Census<br />
Bureau’s Quarterly Workforce Indicators<br />
John M. Abowd 1,2 , Bryce E. Stephens 2,3 , and Lars Vilhuber 1<br />
1 <strong>Cornell</strong> <strong>University</strong>, 2 U.S. Census Bureau, 3 <strong>University</strong> of Maryland<br />
August 10, 2005 - p. 1/31
Introduction<br />
■ We will describe the confidentiality protection mechanism as<br />
applied to a new statistical product: the Quarterly Workforce<br />
Indicators (QWI).<br />
■ The underlying data infrastructure was designed by the<br />
Longitudinal Employer-Household Dynamics Program at the<br />
Census Bureau and is described in detail in Abowd et al<br />
(2005).<br />
■ From a longitudinally integrated frame of state<br />
unemployment insurance wage records, we create measures<br />
of employment, wages, hiring, separation, job creation and<br />
job destruction.<br />
August 10, 2005 - p. 2/31
Core problem<br />
■ Disclosure proofing is required to protect the information<br />
about individuals and businesses that contribute to<br />
✦ (confidential) unemployment insurance (UI) wage records<br />
✦ (confidential) Quarterly Census of Employment and<br />
Wages (QCEW, also known as ES-202) reports<br />
✦ as well as information from Census Bureau demographic<br />
data that have been integrated with these sources.<br />
■ Primary concern of the confidentiality protection mechanism<br />
is thus with small cells, i.e., cells that reflect data on few<br />
individuals or few firms.<br />
August 10, 2005 - p. 3/31
Protection provided<br />
■ In general, data are considered protected when<br />
aggregate cell values do not closely approximate data for any one<br />
respondent in the cell (Cox and Zayatz, 1993, pg. 5)<br />
■ In the QWI confidentiality protection scheme, confidential<br />
micro-data are considered protected by noise infusion if<br />
1. any inference regarding the magnitude of a particular<br />
respondent’s data differs from the confidential quantity by<br />
at least c% even if that inference is made by a coalition of<br />
respondents with exact knowledge of their own answers<br />
or<br />
2. any inference regarding the magnitude of an item is<br />
incorrect with probability no less than y%, where c and y<br />
are confidential but generally large<br />
August 10, 2005 - p. 4/31
Quality of the disclosed data<br />
■ The confidentiality-protected data must be inference-valid for<br />
a well-defined set of analyses<br />
■ We show that<br />
✦ the theoretical properties of the disclosure-proofing<br />
mechanism are designed to maintain analytical validity for<br />
trend analysis;<br />
✦ in practice, the disclosure-proofed data are not biased;<br />
✦ in practice, the time-series properties of the<br />
disclosure-proofed data remain intact.<br />
August 10, 2005 - p. 5/31
Three-layer confidentiality protection in<br />
QWI (I)<br />
■ Layer 1: Multiplicative noise-infusion at the establishment<br />
level, with three very important properties<br />
1. every establishment-level data item is distorted by some<br />
minimum amount<br />
2. distortion amount and direction are time-invariant: data<br />
are always distorted in the same direction (increased or<br />
decreased) by the same percentage amount in every<br />
period.<br />
3. when estimates are aggregated, the effects of the<br />
distortion cancel out for the vast majority of the estimates<br />
August 10, 2005 - p. 6/31
Three-layer confidentiality protection in<br />
QWI (II)<br />
■ Layer 2: Weighting of estimates at higher levels ( e.g.,<br />
sub-state geography and industry detail)<br />
✦ construct weights such that state-level beginning of<br />
quarter employment for all private employers matches the<br />
first month in quarter employment in QCEW.<br />
✦ the establishment-level weight is used for every indicator<br />
in the QWIs<br />
August 10, 2005 - p. 7/31
Three-layer confidentiality protection in<br />
QWI (II)<br />
■ Layer 3: Small-cell editing (Suppression or synthesizing)<br />
✦ Some aggregate estimates are based on fewer than three<br />
persons or establishments.<br />
✦ Currently, these estimates are suppressed and a flag set<br />
to indicate suppression.<br />
✦ In next version of disclosure-proofing system, these<br />
estimates are replaced with synthetic values.<br />
Note: Editing is only used when the combination of noise<br />
infusion and weighting may not distort the publication data<br />
with a high enough probability to meet the criteria layed out<br />
above.<br />
✦ Count data such as employment are subject to editing.<br />
✦ Continuous dollar measures like payroll are not.<br />
✦ Regardless of small-cell editing, all published estimates<br />
are still substantially influenced by the noise that was<br />
infused in the first layer of the protection system.<br />
August 10, 2005 - p. 8/31
Implementation of multiplicative noise<br />
model<br />
■ a random fuzz factor δ j is drawn for each establishment j<br />
p (δ j ) =<br />
F (δ j ) =<br />
⎧<br />
⎪⎨<br />
⎪⎩<br />
⎧<br />
⎪⎨<br />
⎪⎩<br />
(b − δ)/(b − a) 2 , δ ∈ [a, b]<br />
(b + δ − 2)/(b − a) 2 , δ ∈ [2 − b, 2 − a]<br />
0, otherwise<br />
[<br />
0, δ < 2 − b<br />
(δ + b − 2) 2 / 2 (b − a) 2] , δ ∈ [2 − b, 2 − a]<br />
[<br />
0.5, δ ∈ (2 − a, a)<br />
0.5 + (b − a) 2 − (b − δ) 2] , δ ∈ [a, b]<br />
1, δ > b<br />
where a = 1 + c/100 and b = 1 + d/100 are constants chosen<br />
such that the true value is distorted by a minimum of c<br />
percent and a maximum of d percent.<br />
August 10, 2005 - p. 9/31
Distribution of Fuzz Factors<br />
0<br />
0 2−b 2−a 1 a b 2<br />
August 10, 2005 - p. 10/31
Distorting magnitudes and counts<br />
The exact implementation depends on the type of estimate:<br />
■ Magnitudes and counts<br />
X ∗ jt = δ j X jt ,<br />
where X jt is an establishment level statistic among B, E, M,<br />
F , A, S, H, R, FA, FS, FH, W k , WFH, NA, NH, NR, NS<br />
August 10, 2005 - p. 11/31
Distorting ratios<br />
Ratios are distorted by distorting numerators (magnitudes), but<br />
using undistorted denominators:<br />
ZY ∗<br />
jt =<br />
Y ∗<br />
jt<br />
B(Y ) jt<br />
= δ j<br />
Y jt<br />
B(Y ) jt<br />
,<br />
This method is used for<br />
■ average earnings (ZW k ) and<br />
■ average periods of nonemployment (ZN_) for various groups<br />
August 10, 2005 - p. 12/31
Distorting flows<br />
■ Distorted net job flow (JF ) is computed at the aggregate (k =<br />
geography, industry, or combination of the two for the<br />
appropriate age and sex categories) level as the product of<br />
the aggregated, undistorted rate of growth and the<br />
aggregated distorted employment:<br />
JF ∗ kt = G kt × Ē∗ kt = JF kt × Ē∗ kt<br />
Ē kt<br />
.<br />
■ The formulas for distorting gross job creation (JC) and job<br />
destruction (JD) are similar.<br />
■ Same logic is used to distort wage changes for subgroups<br />
August 10, 2005 - p. 13/31
Item suppression<br />
■ Some disclosure risk remains for counts based on very few<br />
entities in a cell (fewer than three individuals or employers)<br />
■ Variables affected are: B, E, M, F , A, S, H, R, FA, FH,<br />
FS, JC, JD, JF , FJC, FJD, FJF .<br />
➜ item suppression based on the number of either workers or<br />
the number of employers that contribute data for that item in<br />
a cell k in time period t, where a cell represents a particular<br />
combination of geography × industry × age × sex.<br />
■ Because of noise infusion, no complementary suppressions<br />
are needed<br />
■ Some denominators may be zeroes - the ratio or rate cannot<br />
be computed.<br />
August 10, 2005 - p. 14/31
Economic concepts in QCEW and QWI<br />
Difference in the economic concepts underlying the Quarterly<br />
Census of Employment and Wages (QCEW) and the QWI<br />
statistics<br />
■ QCEW: employment on the 12th day of the first month in the<br />
quarter (QCEW 1,jt )<br />
■ QWI: several measures of employment, derived from reports<br />
of quarterly employment and wages of individual workers at<br />
particular employers (state UI accounts).<br />
■ Key definition: Beginning of quarter employment B jt<br />
employees at establishment j in both quarter t and t − 1, and<br />
by inference, on the 1st day of quarter t.<br />
August 10, 2005 - p. 15/31
Protection by Weighting<br />
■ QCEW 1,jt and B jt are not identical because<br />
✦ they do not refer to exactly the same point in time,<br />
✦ the in-scope establishments differ slightly, and<br />
✦ they are computed from different universe data.<br />
■ Actual differences captured by the QWI weighting scheme:<br />
time-series of adjustment weights are defined by<br />
∑<br />
w t b jt = ∑ QCEW 1,jt<br />
j<br />
■ All variables are weighted by w t<br />
j<br />
August 10, 2005 - p. 16/31
Protection by Imputation<br />
■ no actual confidential micro-data measured at the<br />
establishment level in QWI<br />
■ workplace characteristics ( geography, industry) are<br />
multiply-imputed for multi-unit employers<br />
➜ these establishments are protected by a form of synthetic<br />
data.<br />
August 10, 2005 - p. 17/31
Protection<br />
Table 1: Small Cells: B, Raw vs. Weighted<br />
(a) Illinois<br />
Weighted count<br />
Unweighted<br />
5 or<br />
count 0 1 2 3 4 more<br />
0 99.33 0.66 0.00 0.00 0.00 0.00<br />
1 0.10 96.76 3.13 0.00 0.00 0.00<br />
2 0.01 2.00 84.68 13.26 0.04 0.01<br />
3 0.01 0.01 3.42 75.72 20.26 0.59<br />
4 0.00 0.00 0.01 4.49 67.62 27.87<br />
5 or more 0.00 0.00 0.00 0.01 0.59 99.39<br />
Total number of cells: 14,229,968 . For details, see text.<br />
August 10, 2005 - p. 18/31
Protection<br />
Table 2: Small Cells: B, Undistorted vs. Distorted<br />
(a) Illinois<br />
Distorted count<br />
Undistorted<br />
5 or<br />
count 0 1 2 3 4 more<br />
0 99.86 0.14 0.00 0.00 0.00 0.00<br />
1 0.91 95.75 3.34 0.00 0.00 0.00<br />
2 0.00 4.27 87.25 8.47 0.00 0.00<br />
3 0.00 0.00 10.69 77.20 12.11 0.00<br />
4 0.00 0.00 0.00 14.73 67.49 17.78<br />
5 or more 0.00 0.00 0.00 0.00 1.93 98.07<br />
Total number of cells: 14,229,968 . Both comparisons are for weighted<br />
data. For details, see text.<br />
August 10, 2005 - p. 19/31
Protection<br />
Table 3: Small Cells: B, Raw vs. Published<br />
(a) Illinois<br />
Published count<br />
Unweighted<br />
5 or<br />
count Suppressed 0 1 2 3 4 more<br />
0 0.79 99.21 0.00 0.00 0.00 0.00 0.00<br />
1 99.91 0.08 0.00 0.00 0.00 0.00 0.00<br />
2 94.02 0.01 0.00 0.00 5.87 0.09 0.01<br />
3 34.33 0.00 0.00 0.00 47.75 16.98 0.94<br />
4 25.87 0.00 0.00 0.00 5.56 43.24 25.32<br />
5 or more 15.20 0.00 0.00 0.00 0.03 0.82 83.95<br />
Total number of cells: 14,229,968 . Raw is unweighted and undistorted.<br />
Published is after weighting, distorting, and suppression. For details,<br />
see text.<br />
August 10, 2005 - p. 20/31
Analytic validity<br />
Analytic validity is obtained when<br />
■ the data display no bias and<br />
■ the additional dispersion can be quantified so<br />
that statistical inferences can be adjusted to<br />
accomodate it.<br />
Evidence on<br />
■ time-series properties of the distorted and published data,<br />
■ cross-sectional unbiasedness of the published data<br />
Unit of analysis:<br />
■ interior sub-state geography × industry × age × sex cell kt.<br />
■ Sub-state geography = county<br />
■ Industry classification = SIC (Division, 2- and 3-digit)<br />
August 10, 2005 - p. 21/31
Distribution of the Error in the First Order<br />
Serial Correlation, Raw vs. Distorted Data<br />
∆r = r − r ∗<br />
Percentile B A S F JF<br />
01 -0.069373 -0.049274 -0.052155 -0.066461 -0.007969<br />
05 -0.041585 -0.031460 -0.032934 -0.039787 -0.004651<br />
10 -0.028849 -0.022166 -0.023733 -0.027926 -0.002785<br />
25 -0.011920 -0.009996 -0.010161 -0.011913 -0.001003<br />
50 0.000571 0.000384 0.000768 0.000306 -0.000044<br />
75 0.013974 0.011806 0.012891 0.012632 0.000776<br />
90 0.030948 0.025152 0.026290 0.028299 0.002263<br />
95 0.044233 0.033871 0.037198 0.040565 0.004375<br />
99 0.078519 0.054415 0.060327 0.074212 0.007845<br />
SIC-division × County, State of Illinois. r from AR(1) estimated for each<br />
cell’s time series.<br />
August 10, 2005 - p. 22/31
Distribution of the Error in the First Order<br />
Serial Correlation, Raw vs. Distorted Data,<br />
SIC Division, F<br />
Distribution of Delta r<br />
IL, Division<br />
25 50 75<br />
−.1<br />
−.05<br />
−.025<br />
.025 .05<br />
0 .1<br />
Delta r(f)<br />
August 10, 2005 - p. 23/31
Distribution of the Error in the First Order<br />
Serial Correlation, Raw vs. Distorted Data,<br />
SIC Division, JF<br />
Distribution of Delta r<br />
IL, Division<br />
25 50 75<br />
−.1<br />
−.05<br />
−.025<br />
.025 .05<br />
0 .1<br />
Delta r(jf)<br />
August 10, 2005 - p. 24/31
Distribution of the Error in the First Order<br />
Serial Correlation, Raw vs. Distorted Data,<br />
SIC Division, B<br />
Distribution of Delta r<br />
IL, Division<br />
25 50 75<br />
−.1<br />
−.05<br />
−.025<br />
.025 .05<br />
0 .1<br />
Delta r(b)<br />
August 10, 2005 - p. 25/31
Distribution of the Error in the First Order<br />
Serial Correlation, Raw vs. Published<br />
Data, SIC Division, B<br />
Distribution of Delta r<br />
IL, Division<br />
25 50 75<br />
−.1<br />
−.05<br />
−.025<br />
.025 .05<br />
0 .1<br />
Delta r(b)<br />
August 10, 2005 - p. 26/31
Distribution of the Error in the First Order<br />
Serial Correlation, Raw vs. Published<br />
Data, SIC2 B<br />
Distribution of Delta r<br />
IL, SIC2<br />
25 50 75<br />
−.1<br />
−.05<br />
−.025<br />
.025 .05<br />
0 .1<br />
Delta r(b)<br />
August 10, 2005 - p. 27/31
Cross-sectional unbiasedness: B<br />
Fraction<br />
0 .05 .1 .15<br />
−d −c 0 +c +d<br />
percent<br />
August 10, 2005 - p. 28/31
Cross-sectional unbiasedness: W 1<br />
Fraction<br />
0 .05 .1 .15<br />
−d −c 0 +c +d<br />
percent<br />
August 10, 2005 - p. 29/31
Conclusion<br />
■ First large-scale implementation of confidentiality protection<br />
by noise-infusion<br />
✦ absence of table-level or complementary suppressions,<br />
despite significant number of count item suppressions<br />
✦ All ratios and magnitudes are released without<br />
suppression<br />
■ High-quality data:<br />
✦ remarkable consistency in the serial correlation<br />
coefficients between the undistorted and the distorted<br />
data series at highly detailed levels<br />
✦ little or no bias in induced on average by the confidentiality<br />
protection mechanism<br />
✦ distributions of bias are tightly centered around the modal<br />
bias of zero<br />
✦ properties of the raw data distortion can be used to<br />
correct inferences from statistical models<br />
August 10, 2005 - p. 30/31
Download of the <strong>presentation</strong> and paper<br />
This <strong>presentation</strong> and the accompanying paper can be<br />
downloaded from the <strong>Cornell</strong> <strong>VirtualRDC</strong> website<br />
http://lservices.ciser.cornell.edu/news/index.phpitemid=40<br />
and soon on the U.S. Census LEHD website<br />
http://lehd.dsd.census.gov<br />
August 10, 2005 - p. 31/31