27.12.2014 Views

presentation - VirtualRDC @ Cornell - Cornell University

presentation - VirtualRDC @ Cornell - Cornell University

presentation - VirtualRDC @ Cornell - Cornell University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GoBack


Confidentiality Protection in the Census<br />

Bureau’s Quarterly Workforce Indicators<br />

John M. Abowd 1,2 , Bryce E. Stephens 2,3 , and Lars Vilhuber 1<br />

1 <strong>Cornell</strong> <strong>University</strong>, 2 U.S. Census Bureau, 3 <strong>University</strong> of Maryland<br />

August 10, 2005 - p. 1/31


Introduction<br />

■ We will describe the confidentiality protection mechanism as<br />

applied to a new statistical product: the Quarterly Workforce<br />

Indicators (QWI).<br />

■ The underlying data infrastructure was designed by the<br />

Longitudinal Employer-Household Dynamics Program at the<br />

Census Bureau and is described in detail in Abowd et al<br />

(2005).<br />

■ From a longitudinally integrated frame of state<br />

unemployment insurance wage records, we create measures<br />

of employment, wages, hiring, separation, job creation and<br />

job destruction.<br />

August 10, 2005 - p. 2/31


Core problem<br />

■ Disclosure proofing is required to protect the information<br />

about individuals and businesses that contribute to<br />

✦ (confidential) unemployment insurance (UI) wage records<br />

✦ (confidential) Quarterly Census of Employment and<br />

Wages (QCEW, also known as ES-202) reports<br />

✦ as well as information from Census Bureau demographic<br />

data that have been integrated with these sources.<br />

■ Primary concern of the confidentiality protection mechanism<br />

is thus with small cells, i.e., cells that reflect data on few<br />

individuals or few firms.<br />

August 10, 2005 - p. 3/31


Protection provided<br />

■ In general, data are considered protected when<br />

aggregate cell values do not closely approximate data for any one<br />

respondent in the cell (Cox and Zayatz, 1993, pg. 5)<br />

■ In the QWI confidentiality protection scheme, confidential<br />

micro-data are considered protected by noise infusion if<br />

1. any inference regarding the magnitude of a particular<br />

respondent’s data differs from the confidential quantity by<br />

at least c% even if that inference is made by a coalition of<br />

respondents with exact knowledge of their own answers<br />

or<br />

2. any inference regarding the magnitude of an item is<br />

incorrect with probability no less than y%, where c and y<br />

are confidential but generally large<br />

August 10, 2005 - p. 4/31


Quality of the disclosed data<br />

■ The confidentiality-protected data must be inference-valid for<br />

a well-defined set of analyses<br />

■ We show that<br />

✦ the theoretical properties of the disclosure-proofing<br />

mechanism are designed to maintain analytical validity for<br />

trend analysis;<br />

✦ in practice, the disclosure-proofed data are not biased;<br />

✦ in practice, the time-series properties of the<br />

disclosure-proofed data remain intact.<br />

August 10, 2005 - p. 5/31


Three-layer confidentiality protection in<br />

QWI (I)<br />

■ Layer 1: Multiplicative noise-infusion at the establishment<br />

level, with three very important properties<br />

1. every establishment-level data item is distorted by some<br />

minimum amount<br />

2. distortion amount and direction are time-invariant: data<br />

are always distorted in the same direction (increased or<br />

decreased) by the same percentage amount in every<br />

period.<br />

3. when estimates are aggregated, the effects of the<br />

distortion cancel out for the vast majority of the estimates<br />

August 10, 2005 - p. 6/31


Three-layer confidentiality protection in<br />

QWI (II)<br />

■ Layer 2: Weighting of estimates at higher levels ( e.g.,<br />

sub-state geography and industry detail)<br />

✦ construct weights such that state-level beginning of<br />

quarter employment for all private employers matches the<br />

first month in quarter employment in QCEW.<br />

✦ the establishment-level weight is used for every indicator<br />

in the QWIs<br />

August 10, 2005 - p. 7/31


Three-layer confidentiality protection in<br />

QWI (II)<br />

■ Layer 3: Small-cell editing (Suppression or synthesizing)<br />

✦ Some aggregate estimates are based on fewer than three<br />

persons or establishments.<br />

✦ Currently, these estimates are suppressed and a flag set<br />

to indicate suppression.<br />

✦ In next version of disclosure-proofing system, these<br />

estimates are replaced with synthetic values.<br />

Note: Editing is only used when the combination of noise<br />

infusion and weighting may not distort the publication data<br />

with a high enough probability to meet the criteria layed out<br />

above.<br />

✦ Count data such as employment are subject to editing.<br />

✦ Continuous dollar measures like payroll are not.<br />

✦ Regardless of small-cell editing, all published estimates<br />

are still substantially influenced by the noise that was<br />

infused in the first layer of the protection system.<br />

August 10, 2005 - p. 8/31


Implementation of multiplicative noise<br />

model<br />

■ a random fuzz factor δ j is drawn for each establishment j<br />

p (δ j ) =<br />

F (δ j ) =<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

⎧<br />

⎪⎨<br />

⎪⎩<br />

(b − δ)/(b − a) 2 , δ ∈ [a, b]<br />

(b + δ − 2)/(b − a) 2 , δ ∈ [2 − b, 2 − a]<br />

0, otherwise<br />

[<br />

0, δ < 2 − b<br />

(δ + b − 2) 2 / 2 (b − a) 2] , δ ∈ [2 − b, 2 − a]<br />

[<br />

0.5, δ ∈ (2 − a, a)<br />

0.5 + (b − a) 2 − (b − δ) 2] , δ ∈ [a, b]<br />

1, δ > b<br />

where a = 1 + c/100 and b = 1 + d/100 are constants chosen<br />

such that the true value is distorted by a minimum of c<br />

percent and a maximum of d percent.<br />

August 10, 2005 - p. 9/31


Distribution of Fuzz Factors<br />

0<br />

0 2−b 2−a 1 a b 2<br />

August 10, 2005 - p. 10/31


Distorting magnitudes and counts<br />

The exact implementation depends on the type of estimate:<br />

■ Magnitudes and counts<br />

X ∗ jt = δ j X jt ,<br />

where X jt is an establishment level statistic among B, E, M,<br />

F , A, S, H, R, FA, FS, FH, W k , WFH, NA, NH, NR, NS<br />

August 10, 2005 - p. 11/31


Distorting ratios<br />

Ratios are distorted by distorting numerators (magnitudes), but<br />

using undistorted denominators:<br />

ZY ∗<br />

jt =<br />

Y ∗<br />

jt<br />

B(Y ) jt<br />

= δ j<br />

Y jt<br />

B(Y ) jt<br />

,<br />

This method is used for<br />

■ average earnings (ZW k ) and<br />

■ average periods of nonemployment (ZN_) for various groups<br />

August 10, 2005 - p. 12/31


Distorting flows<br />

■ Distorted net job flow (JF ) is computed at the aggregate (k =<br />

geography, industry, or combination of the two for the<br />

appropriate age and sex categories) level as the product of<br />

the aggregated, undistorted rate of growth and the<br />

aggregated distorted employment:<br />

JF ∗ kt = G kt × Ē∗ kt = JF kt × Ē∗ kt<br />

Ē kt<br />

.<br />

■ The formulas for distorting gross job creation (JC) and job<br />

destruction (JD) are similar.<br />

■ Same logic is used to distort wage changes for subgroups<br />

August 10, 2005 - p. 13/31


Item suppression<br />

■ Some disclosure risk remains for counts based on very few<br />

entities in a cell (fewer than three individuals or employers)<br />

■ Variables affected are: B, E, M, F , A, S, H, R, FA, FH,<br />

FS, JC, JD, JF , FJC, FJD, FJF .<br />

➜ item suppression based on the number of either workers or<br />

the number of employers that contribute data for that item in<br />

a cell k in time period t, where a cell represents a particular<br />

combination of geography × industry × age × sex.<br />

■ Because of noise infusion, no complementary suppressions<br />

are needed<br />

■ Some denominators may be zeroes - the ratio or rate cannot<br />

be computed.<br />

August 10, 2005 - p. 14/31


Economic concepts in QCEW and QWI<br />

Difference in the economic concepts underlying the Quarterly<br />

Census of Employment and Wages (QCEW) and the QWI<br />

statistics<br />

■ QCEW: employment on the 12th day of the first month in the<br />

quarter (QCEW 1,jt )<br />

■ QWI: several measures of employment, derived from reports<br />

of quarterly employment and wages of individual workers at<br />

particular employers (state UI accounts).<br />

■ Key definition: Beginning of quarter employment B jt<br />

employees at establishment j in both quarter t and t − 1, and<br />

by inference, on the 1st day of quarter t.<br />

August 10, 2005 - p. 15/31


Protection by Weighting<br />

■ QCEW 1,jt and B jt are not identical because<br />

✦ they do not refer to exactly the same point in time,<br />

✦ the in-scope establishments differ slightly, and<br />

✦ they are computed from different universe data.<br />

■ Actual differences captured by the QWI weighting scheme:<br />

time-series of adjustment weights are defined by<br />

∑<br />

w t b jt = ∑ QCEW 1,jt<br />

j<br />

■ All variables are weighted by w t<br />

j<br />

August 10, 2005 - p. 16/31


Protection by Imputation<br />

■ no actual confidential micro-data measured at the<br />

establishment level in QWI<br />

■ workplace characteristics ( geography, industry) are<br />

multiply-imputed for multi-unit employers<br />

➜ these establishments are protected by a form of synthetic<br />

data.<br />

August 10, 2005 - p. 17/31


Protection<br />

Table 1: Small Cells: B, Raw vs. Weighted<br />

(a) Illinois<br />

Weighted count<br />

Unweighted<br />

5 or<br />

count 0 1 2 3 4 more<br />

0 99.33 0.66 0.00 0.00 0.00 0.00<br />

1 0.10 96.76 3.13 0.00 0.00 0.00<br />

2 0.01 2.00 84.68 13.26 0.04 0.01<br />

3 0.01 0.01 3.42 75.72 20.26 0.59<br />

4 0.00 0.00 0.01 4.49 67.62 27.87<br />

5 or more 0.00 0.00 0.00 0.01 0.59 99.39<br />

Total number of cells: 14,229,968 . For details, see text.<br />

August 10, 2005 - p. 18/31


Protection<br />

Table 2: Small Cells: B, Undistorted vs. Distorted<br />

(a) Illinois<br />

Distorted count<br />

Undistorted<br />

5 or<br />

count 0 1 2 3 4 more<br />

0 99.86 0.14 0.00 0.00 0.00 0.00<br />

1 0.91 95.75 3.34 0.00 0.00 0.00<br />

2 0.00 4.27 87.25 8.47 0.00 0.00<br />

3 0.00 0.00 10.69 77.20 12.11 0.00<br />

4 0.00 0.00 0.00 14.73 67.49 17.78<br />

5 or more 0.00 0.00 0.00 0.00 1.93 98.07<br />

Total number of cells: 14,229,968 . Both comparisons are for weighted<br />

data. For details, see text.<br />

August 10, 2005 - p. 19/31


Protection<br />

Table 3: Small Cells: B, Raw vs. Published<br />

(a) Illinois<br />

Published count<br />

Unweighted<br />

5 or<br />

count Suppressed 0 1 2 3 4 more<br />

0 0.79 99.21 0.00 0.00 0.00 0.00 0.00<br />

1 99.91 0.08 0.00 0.00 0.00 0.00 0.00<br />

2 94.02 0.01 0.00 0.00 5.87 0.09 0.01<br />

3 34.33 0.00 0.00 0.00 47.75 16.98 0.94<br />

4 25.87 0.00 0.00 0.00 5.56 43.24 25.32<br />

5 or more 15.20 0.00 0.00 0.00 0.03 0.82 83.95<br />

Total number of cells: 14,229,968 . Raw is unweighted and undistorted.<br />

Published is after weighting, distorting, and suppression. For details,<br />

see text.<br />

August 10, 2005 - p. 20/31


Analytic validity<br />

Analytic validity is obtained when<br />

■ the data display no bias and<br />

■ the additional dispersion can be quantified so<br />

that statistical inferences can be adjusted to<br />

accomodate it.<br />

Evidence on<br />

■ time-series properties of the distorted and published data,<br />

■ cross-sectional unbiasedness of the published data<br />

Unit of analysis:<br />

■ interior sub-state geography × industry × age × sex cell kt.<br />

■ Sub-state geography = county<br />

■ Industry classification = SIC (Division, 2- and 3-digit)<br />

August 10, 2005 - p. 21/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Distorted Data<br />

∆r = r − r ∗<br />

Percentile B A S F JF<br />

01 -0.069373 -0.049274 -0.052155 -0.066461 -0.007969<br />

05 -0.041585 -0.031460 -0.032934 -0.039787 -0.004651<br />

10 -0.028849 -0.022166 -0.023733 -0.027926 -0.002785<br />

25 -0.011920 -0.009996 -0.010161 -0.011913 -0.001003<br />

50 0.000571 0.000384 0.000768 0.000306 -0.000044<br />

75 0.013974 0.011806 0.012891 0.012632 0.000776<br />

90 0.030948 0.025152 0.026290 0.028299 0.002263<br />

95 0.044233 0.033871 0.037198 0.040565 0.004375<br />

99 0.078519 0.054415 0.060327 0.074212 0.007845<br />

SIC-division × County, State of Illinois. r from AR(1) estimated for each<br />

cell’s time series.<br />

August 10, 2005 - p. 22/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Distorted Data,<br />

SIC Division, F<br />

Distribution of Delta r<br />

IL, Division<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(f)<br />

August 10, 2005 - p. 23/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Distorted Data,<br />

SIC Division, JF<br />

Distribution of Delta r<br />

IL, Division<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(jf)<br />

August 10, 2005 - p. 24/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Distorted Data,<br />

SIC Division, B<br />

Distribution of Delta r<br />

IL, Division<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(b)<br />

August 10, 2005 - p. 25/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Published<br />

Data, SIC Division, B<br />

Distribution of Delta r<br />

IL, Division<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(b)<br />

August 10, 2005 - p. 26/31


Distribution of the Error in the First Order<br />

Serial Correlation, Raw vs. Published<br />

Data, SIC2 B<br />

Distribution of Delta r<br />

IL, SIC2<br />

25 50 75<br />

−.1<br />

−.05<br />

−.025<br />

.025 .05<br />

0 .1<br />

Delta r(b)<br />

August 10, 2005 - p. 27/31


Cross-sectional unbiasedness: B<br />

Fraction<br />

0 .05 .1 .15<br />

−d −c 0 +c +d<br />

percent<br />

August 10, 2005 - p. 28/31


Cross-sectional unbiasedness: W 1<br />

Fraction<br />

0 .05 .1 .15<br />

−d −c 0 +c +d<br />

percent<br />

August 10, 2005 - p. 29/31


Conclusion<br />

■ First large-scale implementation of confidentiality protection<br />

by noise-infusion<br />

✦ absence of table-level or complementary suppressions,<br />

despite significant number of count item suppressions<br />

✦ All ratios and magnitudes are released without<br />

suppression<br />

■ High-quality data:<br />

✦ remarkable consistency in the serial correlation<br />

coefficients between the undistorted and the distorted<br />

data series at highly detailed levels<br />

✦ little or no bias in induced on average by the confidentiality<br />

protection mechanism<br />

✦ distributions of bias are tightly centered around the modal<br />

bias of zero<br />

✦ properties of the raw data distortion can be used to<br />

correct inferences from statistical models<br />

August 10, 2005 - p. 30/31


Download of the <strong>presentation</strong> and paper<br />

This <strong>presentation</strong> and the accompanying paper can be<br />

downloaded from the <strong>Cornell</strong> <strong>VirtualRDC</strong> website<br />

http://lservices.ciser.cornell.edu/news/index.phpitemid=40<br />

and soon on the U.S. Census LEHD website<br />

http://lehd.dsd.census.gov<br />

August 10, 2005 - p. 31/31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!