27.12.2014 Views

presentation - VirtualRDC @ Cornell - Cornell University

presentation - VirtualRDC @ Cornell - Cornell University

presentation - VirtualRDC @ Cornell - Cornell University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

GoBack


Confidentiality Protection in the Census<br />

Bureau’s Quarterly Workforce Indicators<br />

John M. Abowd 1,2 , Bryce E. Stephens 2,3 , and Lars Vilhuber 1<br />

1 <strong>Cornell</strong> <strong>University</strong>, 2 U.S. Census Bureau, 3 <strong>University</strong> of Maryland<br />

August 10, 2005 - p. 1/31


Introduction<br />

■ We will describe the confidentiality protection mechanism as<br />

applied to a new statistical product: the Quarterly Workforce<br />

Indicators (QWI).<br />

■ The underlying data infrastructure was designed by the<br />

Longitudinal Employer-Household Dynamics Program at the<br />

Census Bureau and is described in detail in Abowd et al<br />

(2005).<br />

■ From a longitudinally integrated frame of state<br />

unemployment insurance wage records, we create measures<br />

of employment, wages, hiring, separation, job creation and<br />

job destruction.<br />

August 10, 2005 - p. 2/31


Core problem<br />

■ Disclosure proofing is required to protect the information<br />

about individuals and businesses that contribute to<br />

✦ (confidential) unemployment insurance (UI) wage records<br />

✦ (confidential) Quarterly Census of Employment and<br />

Wages (QCEW, also known as ES-202) reports<br />

✦ as well as information from Census Bureau demographic<br />

data that have been integrated with these sources.<br />

■ Primary concern of the confidentiality protection mechanism<br />

is thus with small cells, i.e., cells that reflect data on few<br />

individuals or few firms.<br />

August 10, 2005 - p. 3/31


Protection provided<br />

■ In general, data are considered protected when<br />

aggregate cell values do not closely approximate data for any one<br />

respondent in the cell (Cox and Zayatz, 1993, pg. 5)<br />

■ In the QWI confidentiality protection scheme, confidential<br />

micro-data are considered protected by noise infusion if<br />

1. any inference regarding the magnitude of a particular<br />

respondent’s data differs from the confidential quantity by<br />

at least c% even if that inference is made by a coalition of<br />

respondents with exact knowledge of their own answers<br />

or<br />

2. any inference regarding the magnitude of an item is<br />

incorrect with probability no less than y%, where c and y<br />

are confidential but generally large<br />

August 10, 2005 - p. 4/31


Quality of the disclosed data<br />

■ The confidentiality-protected data must be inference-valid for<br />

a well-defined set of analyses<br />

■ We show that<br />

✦ the theoretical properties of the disclosure-proofing<br />

mechanism are designed to maintain analytical validity for<br />

trend analysis;<br />

✦ in practice, the disclosure-proofed data are not biased;<br />

✦ in practice, the time-series properties of the<br />

disclosure-proofed data remain intact.<br />

August 10, 2005 - p. 5/31


Three-layer confidentiality protection in<br />

QWI (I)<br />

■ Layer 1: Multiplicative noise-infusion at the establishment<br />

level, with three very important properties<br />

1. every establishment-level data item is distorted by some<br />

minimum amount<br />

2. distortion amount and direction are time-invariant: data<br />

are always distorted in the same direction (increased or<br />

decreased) by the same percentage amount in every<br />

period.<br />

3. when estimates are aggregated, the effects of the<br />

distortion cancel out for the vast majority of the estimates<br />

August 10, 2005 - p. 6/31


Three-layer confidentiality protection in<br />

QWI (II)<br />

■ Layer 2: Weighting of estimates at higher levels ( e.g.,<br />

sub-state geography and industry detail)<br />

✦ construct weights such that state-level beginning of<br />

quarter employment for all private employers matches the<br />

first month in quarter employment in QCEW.<br />

✦ the establishment-level weight is used for every indicator<br />

in the QWIs<br />

August 10, 2005 - p. 7/31


Three-layer confidentiality protection in<br />

QWI (II)<br />

■ Layer 3: Small-cell editing (Suppression or synthesizing)<br />

✦ Some aggregate estimates are based on fewer than three<br />

persons or establishments.<br />

✦ Currently, these estimates are suppressed and a flag set<br />

to indicate suppression.<br />

✦ In next version of disclosure-proofing system, these<br />

estimates are replaced with synthetic values.<br />

Note: Editing is only used when the combination of noise<br />

infusion and weighting may not distort the publication data<br />

with a high enough probability to meet the criteria layed out<br />

above.<br />

✦ Count data such as employment are subject to editing.<br />

✦ Continuous dollar measures like payroll are not.<br />

✦ Regardless of small-cell editing, all published estimates<br />

are still substantially influenced by the noise that was<br />

infused in the first layer of the protection system.<br />

August 10, 2005 - p. 8/31


Implementation of multiplicative noise<br />

model<br />

■ a random fuzz factor δ j is drawn for each establishment j<br />

p (δ j ) =<br />

F (δ j ) =<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

(b − δ)/(b − a) 2 , δ ∈ [a, b]<br />

(b + δ − 2)/(b − a) 2 , δ ∈ [2 − b, 2 − a]<br />

0, otherwise<br />

[<br />

0, δ < 2 − b<br />

(δ + b − 2) 2 / 2 (b − a) 2] , δ ∈ [2 − b, 2 − a]<br />

[<br />

0.5, δ ∈ (2 − a, a)<br />

0.5 + (b − a) 2 − (b − δ) 2] , δ ∈ [a, b]<br />

1, δ > b<br />

where a = 1 + c/100 and b = 1 + d/100 are constants chosen<br />

such that the true value is distorted by a minimum of c<br />

percent and a maximum of d percent.<br />

August 10, 2005 - p. 9/31


Distribution of Fuzz Factors<br />

0<br />

0 2−b 2−a 1 a b 2<br />

August 10, 2005 - p. 10/31


Distorting magnitudes and counts<br />

The exact implementation depends on the type of estimate:<br />

■ Magnitudes and counts<br />

X ∗ jt = δ j X jt ,<br />

where X jt is an establishment level statistic among B, E, M,<br />

F , A, S, H, R, FA, FS, FH, W k , WFH, NA, NH, NR, NS<br />

August 10, 2005 - p. 11/31


Distorting ratios<br />

Ratios are distorted by distorting numerators (magnitudes), but<br />

using undistorted denominators:<br />

ZY ∗<br />

jt =<br />

Y ∗<br />

jt<br />

B(Y ) jt<br />

= δ j<br />

Y jt<br />

B(Y ) jt<br />

,<br />

This method is used for<br />

■ average earnings (ZW k ) and<br />

■ average periods of nonemployment (ZN_) for various groups<br />

August 10, 2005 - p. 12/31


Distorting flows<br />

■ Distorted net job flow (JF ) is computed at the aggregate (k =<br />

geography, industry, or combination of the two for the<br />

appropriate age and sex categories) level as the product of<br />

the aggregated, undistorted rate of growth and the<br />

aggregated distorted employment:<br />

JF ∗ kt = G kt × Ē∗ kt = JF kt × Ē∗ kt<br />

Ē kt<br />

.<br />

■ The formulas for distorting gross job creation (JC) and job<br />

destruction (JD) are similar.<br />

■ Same logic is used to distort wage changes for subgroups<br />

August 10, 2005 - p. 13/31


Item suppression<br />

■ Some disclosure risk remains for counts based on very few<br />

entities in a cell (fewer than three individuals or employers)<br />

■ Variables affected are: B, E, M, F , A, S, H, R, FA, FH,<br />

FS, JC, JD, JF , FJC, FJD, FJF .<br />

➜ item suppression based on the number of either workers or<br />

the number of employers that contribute data for that item in<br />

a cell k in time period t, where a cell represents a particular<br />

combination of geography × industry × age × sex.<br />

■ Because of noise infusion, no complementary suppressions<br />

are needed<br />

■ Some denominators may be zeroes - the ratio or rate cannot<br />

be computed.<br />

August 10, 2005 - p. 14/31


Economic concepts in QCEW and QWI<br />

Difference in the economic concepts underlying the Quarterly<br />

Census of Employment and Wages (QCEW) and the QWI<br />

statistics<br />

■ QCEW: employment on the 12th day of the first month in the<br />

quarter (QCEW 1,jt )<br />

■ QWI: several measures of employment, derived from reports<br />

of quarterly employment and wages of individual workers at<br />

particular employers (state UI accounts).<br />

■ Key definition: Beginning of quarter employment B jt<br />

employees at establishment j in both quarter t and t − 1, and<br />

by inference, on the 1st day of quarter t.<br />

August 10, 2005 - p. 15/31


Protection by Weighting<br />

■ QCEW 1,jt and B jt are not identical because<br />

✦ they do not refer to exactly the same point in time,<br />

✦ the in-scope establishments differ slightly, and<br />

✦ they are computed from different universe data.<br />

■ Actual differences captured by the QWI weighting scheme:<br />

time-series of adjustment weights are defined by<br />

∑<br />

w t b jt = ∑ QCEW 1,jt<br />

j<br />

■ All variables are weighted by w t<br />

j<br />

August 10, 2005 - p. 16/31


Protection by Imputation<br />

■ no actual confidential micro-data measured at the<br />

establishment level in QWI<br />

■ workplace characteristics ( geography, industry) are<br />

multiply-imputed for multi-unit employers<br />

➜ these establishments are protected by a form of synthetic<br />

data.<br />

August 10, 2005 - p. 17/31


Protection<br />

Table 1: Small Cells: B, Raw vs. Weighted<br />

(a) Illinois<br />

Weighted count<br />

Unweighted<br />

5 or<br />

count 0 1 2 3 4 more<br />

0 99.33 0.66 0.00 0.00 0.00 0.00<br />

1 0.10 96.76 3.13 0.00 0.00 0.00<br />

2 0.01 2.00 84.68 13.26 0.04 0.01<br />

3 0.01 0.01 3.42 75.72 20.26 0.59<br />

4 0.00 0.00 0.01 4.49 67.62 27.87<br />

5 or more 0.00 0.00 0.00 0.01 0.59 99.39<br />

Total number of cells: 14,229,968 . For details, see text.<br />

August 10, 2005 - p. 18/31


Protection<br />

Table 2: Small Cells: B, Undistorted vs. Distorted<br />

(a) Illinois<br />

Distorted count<br />

Undistorted<br />

5 or<br />

count 0 1 2 3 4 more<br />

0 99.86 0.14 0.00 0.00 0.00 0.00<br />

1 0.91 95.75 3.34 0.00 0.00 0.00<br />

2 0.00 4.27 87.25 8.47 0.00 0.00<br />

3 0.00 0.00 10.69 77.20 12.11 0.00<br />

4 0.00 0.00 0.00 14.73 67.49 17.78<br />

5 or more 0.00 0.00 0.00 0.00 1.93 98.07<br />

Total number of cells: 14,229,968 . Both comparisons are for weighted<br />

data. For details, see text.<br />

August 10, 2005 - p. 19/31


Protection<br />

Table 3: Small Cells: B, Raw vs. Published<br />

(a) Illinois<br />

Published count<br />

Unweighted<br />

5 or<br />

count Suppressed 0 1 2 3 4 more<br />

0 0.79 99.21 0.00 0.00 0.00 0.00 0.00<br />

1 99.91 0.08 0.00 0.00 0.00 0.00 0.00<br />

2 94.02 0.01 0.00 0.00 5.87 0.09 0.01<br />

3 34.33 0.00 0.00 0.00 47.75 16.98 0.94<br />

4 25.87 0.00 0.00 0.00 5.56 43.24 25.32<br />

5 or more 15.20 0.00 0.00 0.00 0.03 0.82 83.95<br />

Total number of cells: 14,229,968 . Raw is unweighted and undistorted.<br />

Published is after weighting, distorting, and suppression. For details,<br />

see text.<br />

August 10, 2005 - p. 20/31


Analytic validity<br />

Analytic validity is obtained when<br />

■ the data display no bias and<br />

■ the additional dispersion can be quantified so<br />

that statistical inferences can be adjusted to<br />

accomodate it.<br />

Evidence on<br />

■ time-series properties of the distorted and published data,<br />

■ cross-sectional unbiasedness of the published data<br />

Unit of analysis:<br />

■ interior sub-state geography × industry × age × sex cell kt.<br />

■ Sub-state geography = county<br />

■ Industry classification = SIC (Division, 2- and 3-digit)<br />

August 10, 2005 - p. 21/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Distorted Data<br />

∆r = r − r ∗<br />

Percentile B A S F JF<br />

01 -0.069373 -0.049274 -0.052155 -0.066461 -0.007969<br />

05 -0.041585 -0.031460 -0.032934 -0.039787 -0.004651<br />

10 -0.028849 -0.022166 -0.023733 -0.027926 -0.002785<br />

25 -0.011920 -0.009996 -0.010161 -0.011913 -0.001003<br />

50 0.000571 0.000384 0.000768 0.000306 -0.000044<br />

75 0.013974 0.011806 0.012891 0.012632 0.000776<br />

90 0.030948 0.025152 0.026290 0.028299 0.002263<br />

95 0.044233 0.033871 0.037198 0.040565 0.004375<br />

99 0.078519 0.054415 0.060327 0.074212 0.007845<br />

SIC-division × County, State of Illinois. r from AR(1) estimated for each<br />

cell’s time series.<br />

August 10, 2005 - p. 22/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Distorted Data,<br />

SIC Division, F<br />

Distribution of Delta r<br />

IL, Division<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(f)<br />

August 10, 2005 - p. 23/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Distorted Data,<br />

SIC Division, JF<br />

Distribution of Delta r<br />

IL, Division<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(jf)<br />

August 10, 2005 - p. 24/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Distorted Data,<br />

SIC Division, B<br />

Distribution of Delta r<br />

IL, Division<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(b)<br />

August 10, 2005 - p. 25/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Published<br />

Data, SIC Division, B<br />

Distribution of Delta r<br />

IL, Division<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(b)<br />

August 10, 2005 - p. 26/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Published<br />

Data, SIC2 B<br />

Distribution of Delta r<br />

IL, SIC2<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(b)<br />

August 10, 2005 - p. 27/31


Cross-sectional unbiasedness: B<br />

Fraction<br />

0 .05 .1 .15<br />

−d −c 0 +c +d<br />

percent<br />

August 10, 2005 - p. 28/31


Cross-sectional unbiasedness: W 1<br />

Fraction<br />

0 .05 .1 .15<br />

−d −c 0 +c +d<br />

percent<br />

August 10, 2005 - p. 29/31


Conclusion<br />

■ First large-scale implementation of confidentiality protection<br />

by noise-infusion<br />

✦ absence of table-level or complementary suppressions,<br />

despite significant number of count item suppressions<br />

✦ All ratios and magnitudes are released without<br />

suppression<br />

■ High-quality data:<br />

✦ remarkable consistency in the serial correlation<br />

coefficients between the undistorted and the distorted<br />

data series at highly detailed levels<br />

✦ little or no bias in induced on average by the confidentiality<br />

protection mechanism<br />

✦ distributions of bias are tightly centered around the modal<br />

bias of zero<br />

✦ properties of the raw data distortion can be used to<br />

correct inferences from statistical models<br />

August 10, 2005 - p. 30/31


Download of the <strong>presentation</strong> and paper<br />

This <strong>presentation</strong> and the accompanying paper can be<br />

downloaded from the <strong>Cornell</strong> <strong>VirtualRDC</strong> website<br />

http://lservices.ciser.cornell.edu/news/index.phpitemid=40<br />

and soon on the U.S. Census LEHD website<br />

http://lehd.dsd.census.gov<br />

August 10, 2005 - p. 31/31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!