25.03.2013 Views

Normal approximation to the hypergeometric distribution in ...

Normal approximation to the hypergeometric distribution in ...

Normal approximation to the hypergeometric distribution in ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

S.N. Lahiri et al. / Journal of Statistical Plann<strong>in</strong>g and Inference 137 (2007) 3570 –3590 3573<br />

Then <strong>the</strong>re exists universal constants C1,C2 ∈ (0, ∞) (not depend<strong>in</strong>g on r, nr,Mr and Nr) such that:<br />

<br />

<br />

<br />

P <br />

Xr − nrpr<br />

C1<br />

x − (x) <br />

1 +|x|<br />

r<br />

<br />

r<br />

2<br />

r(x) exp(−C2x 2 2 r (x)) (2.5)<br />

for all x ∈ R, where r(x) = qrI(x0) + prI(x0).<br />

Theorem 1 is a non-uniform Berry–Esseen <strong>the</strong>orem for <strong>the</strong> Hypergeometric <strong>distribution</strong>. It shows that <strong>the</strong> error of<br />

<strong>Normal</strong> <strong>approximation</strong> <strong>to</strong> <strong>the</strong> Hypergeometric <strong>distribution</strong> dies at a sub-Gaussian rate <strong>in</strong> <strong>the</strong> tails. The only condition<br />

needed for <strong>the</strong> validity of this bound is (2.4). It is easy <strong>to</strong> check that<br />

<br />

r ∈ 1<br />

25 , 1<br />

20<br />

(2.6)<br />

for all r satisfy<strong>in</strong>g (2.4). Hence, <strong>the</strong> bound <strong>in</strong> (2.5) is available for all r such that r 25.<br />

As po<strong>in</strong>ted out <strong>in</strong> Section 1, when both <strong>the</strong> sequences {pr} {r 1} and {fr} {r 1} are bounded away from 0 and 1, <strong>the</strong><br />

rate of <strong>approximation</strong> <strong>in</strong> Theorem 1 matches <strong>the</strong> standard rate O(n −1/2<br />

r ) of <strong>Normal</strong> <strong>approximation</strong> for <strong>the</strong> sum of nr<br />

iid random variables with a f<strong>in</strong>ite third moment. Although <strong>the</strong> Hypergeometric random variable Xr can be written as<br />

a sum of nr dependent Bernoulli (pr) variables, <strong>the</strong> lack of <strong>in</strong>dependence of <strong>the</strong> summands does not affect <strong>the</strong> rate of<br />

<strong>Normal</strong> <strong>approximation</strong> as long as <strong>the</strong> sequence {pr} r 1 is bounded away from 0 and 1 and {fr} r 1 is bounded away<br />

from 1. On <strong>the</strong> o<strong>the</strong>r hand, if ei<strong>the</strong>r of <strong>the</strong> sequences {pr} {r 1} and {fr} {r 1} converge <strong>to</strong> one of <strong>the</strong> extreme values 0<br />

and 1, <strong>the</strong>n<br />

r = o(n 1/2<br />

r ) as r →∞<br />

and <strong>the</strong> rate of <strong>Normal</strong> <strong>approximation</strong> <strong>to</strong> <strong>the</strong> Hypergeometric <strong>distribution</strong> is <strong>in</strong>deed worse than <strong>the</strong> standard rate O(n −1/2<br />

r )<br />

<strong>in</strong> such nonstandard cases.<br />

An immediate consequence of Theorem 1 is <strong>the</strong> follow<strong>in</strong>g exponential (sub-Gaussian) probability bound on <strong>the</strong> tails<br />

of Xr.<br />

Theorem 2. Suppose that Xr ∼ Hyp(nr,Mr,Nr), r ∈ N. Then, <strong>the</strong>re exist universal constants C3,C4 ∈ (0, ∞) (not<br />

depend<strong>in</strong>g on r, nr,Mr,Nr) such that for all r satisfy<strong>in</strong>g (2.4),<br />

<br />

Xr − nrpr <br />

P<br />

<br />

x<br />

<br />

C3<br />

<br />

(pr ∧ qr) 3 exp(−C4x 2 [pr ∧ qr] 2 ) for all x>0.<br />

r<br />

3. Numerical results<br />

To ga<strong>in</strong> some <strong>in</strong>sight <strong>in</strong><strong>to</strong> <strong>the</strong> quality of <strong>Normal</strong> <strong>approximation</strong> <strong>to</strong> <strong>the</strong> Hypergeometric <strong>distribution</strong> <strong>in</strong> f<strong>in</strong>ite samples<br />

and <strong>to</strong> compare it with <strong>the</strong> accuracy <strong>in</strong> <strong>the</strong> case of <strong>the</strong> B<strong>in</strong>omial <strong>distribution</strong>, first we consider some jo<strong>in</strong>t plots of <strong>the</strong><br />

cdfs of normalized Hypergeometric and B<strong>in</strong>omial random variables aga<strong>in</strong>st <strong>the</strong> standard <strong>Normal</strong> cdf. Figs. 1–5 show<br />

<strong>the</strong>se plots for different values of <strong>the</strong> parameters n and p for N = 60, 200.<br />

From <strong>the</strong> figures, it follows that <strong>the</strong> quality of <strong>Normal</strong> <strong>approximation</strong> <strong>to</strong> <strong>the</strong> Hypergeometric <strong>distribution</strong> is comparable<br />

<strong>to</strong> that for <strong>the</strong> B<strong>in</strong>omial <strong>distribution</strong> for values of f and p close <strong>to</strong> 0.5, but <strong>the</strong>re is a stark loss of accuracy for high values<br />

of f and p.<br />

Next, <strong>to</strong> get a quantitative picture of <strong>the</strong> error of <strong>Normal</strong> <strong>approximation</strong>, we conducted a moderately large numerical<br />

study with different values of <strong>the</strong> population size N and with different values of <strong>the</strong> parameters p and f. The population<br />

sizes considered were N = 60, 200, 500, 2000. For a given value of N, <strong>the</strong> set of values of p and f considered was<br />

{0.5, 0.6, 0.7, 0.8, 0.9}. We considered <strong>the</strong> Kolmogorov distance, i.e., <strong>the</strong> maximal distance between <strong>the</strong> cdfs of <strong>the</strong><br />

normalized Hypergeometric variable and a standard <strong>Normal</strong> variable as a measure of accuracy. More specifically, <strong>the</strong><br />

measure of accuracy for <strong>the</strong> Hypergeometric case is def<strong>in</strong>ed as<br />

<br />

<br />

(N,p,f)= <br />

P <br />

X − np<br />

<br />

x<br />

<br />

<br />

− (x) <br />

, (3.1)<br />

where X ∼ Hyp(n, M, N), f = n/N, p = M/N, 2 = Nf (1 − f)p(1 − p) and (·) denotes <strong>the</strong> cdf of <strong>the</strong> N(0,1)<br />

<strong>distribution</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!