4.1 Quantitative Data Analysis Prof. Treiman Exercise 4: Illustrative ...
4.1 Quantitative Data Analysis Prof. Treiman Exercise 4: Illustrative ...
4.1 Quantitative Data Analysis Prof. Treiman Exercise 4: Illustrative ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Quantitative</strong> <strong>Data</strong> <strong>Analysis</strong><br />
<strong>Prof</strong>. <strong>Treiman</strong><br />
<strong>Exercise</strong> 4: <strong>Illustrative</strong> Answer<br />
A. Direct Standardization via Stata<br />
Here is the table called for in the assignment.<br />
Premarital sex is...<br />
Education<br />
< H.S. grad H.S. grad Some coll. College +<br />
Observed percentages<br />
Always wrong 37.4 32.2 26.5 19.7<br />
Almost always wrong 12.1 7.5 8.8 10.0<br />
Sometimes wrong 10.1 16.4 14.7 20.5<br />
Not wrong at all 40.4 43.9 50.0 49.8<br />
Total 100.0 100.0 100.0 100.0<br />
N (99) (214) (238) (239)<br />
Percentages adjusted for age<br />
Always wrong 25.8 31.4 26.4 19.5<br />
Almost always wrong 15.8 6.9 10.2 10.4<br />
Sometimes wrong 10.3 16.2 15.3 20.8<br />
Not wrong at all 48.1 45.5 48.1 49.3<br />
The top panel of the table shows a modest association between education and attitudes regarding<br />
premarital sex, with the better educated less likely to think that premarital sex is wrong. About<br />
37 per cent of those lacking high school education think that premarital sex is always wrong,<br />
compared to less than 20 per cent of those with college degrees. At the other extreme, only<br />
about 40 per cent of those with less than high school think that premarital sex is not wrong at all<br />
compared to about half of those with at least some college.<br />
We might suspect that these results simply reflect the tendency of older people both to be more<br />
poorly educated and to have less liberal attitudes regarding premarital sex than younger people.<br />
Indeed, both of these associations hold for late 20 th century Americans. [I have estimated but<br />
<strong>4.1</strong>
have not presented the tables, but you can easily make them. In a full analysis, you would want<br />
to show these tables.] But standardizing for age reveals that the association between education<br />
and acceptance of premarital does not arise simply from their joint dependence on age. Indeed,<br />
standardizing for age has virtually no impact on the coefficients in the table, except for those<br />
with less than a high school education—the disapproval of premarital sex by those with less than<br />
a high school education is apparently due in part to the fact that the poorly educated are<br />
disproportionately elderly since there is a noticeable shift toward greater acceptance of<br />
premarital sex in this group after standardizing for age.<br />
The -do- file that created these computations is shown in the Appendix.<br />
B. Simple Correlation and Regression<br />
1. Correlation ratios<br />
a) Table 1A shows a strong relationship between religious affiliation and tolerance of those<br />
opposed to religion, among Americans queried in 1974. Nearly three quarters of the<br />
Jews, nearly two thirds of those without religion, but only two fifths of the Catholics and<br />
hardly more than a quarter of the Protestants are in the highest tolerance category,<br />
endorsing the right of those “against religion” to speak at a public meeting, teach in a<br />
college or university, and have their books included in public library collections.<br />
b) On a three point tolerance scale (the number of endorsements of the rights mentioned in<br />
answer to (1a)), the means (and standard deviations) for the four religious categories are<br />
as follows:<br />
Mean<br />
S.D.<br />
Protestants 1.44 1.40<br />
Catholics 1.83 1.36<br />
Jews 2.36 1.30<br />
No religion 2.34 1.21<br />
Total 1.63 1.42<br />
From visual inspection ("eye balling the data") we are led to a similar conclusion from<br />
the means as from the percentage distributions. Jews and those without religion are<br />
substantially more tolerant than are Catholics, who are in turn substantially more tolerant<br />
than are Protestants.<br />
c. The correlation ratio, 0 2 , is a measure of the ratio of the sum of squared standard deviations<br />
of observations from subgroup means (the within-group sum of squares) to the sum of<br />
squared deviations of observations from the grand mean for the entire sample. It tells us<br />
what proportion of the variance in the dependent variable can be explained by knowledge of<br />
4.2
which of several categories of the independent variable an observation falls into. In the<br />
present case, the correlation ratio tells us how much of the variance in tolerance of antireligionists<br />
we can explain by knowing the religious affiliation of respondents. The easiest<br />
way to compute 0 2 is to compute sums of squares separately for each subgroup and then add<br />
them up to find the “within group sum of squares.” Then compute the sum of squares for the<br />
total column. This, obviously, is the “total sum of squares.” Then simply make the<br />
computation indicated in the assignment. Recall from the assignment that:<br />
2<br />
η = 1−<br />
Within group sum of squares<br />
Total sum of squares<br />
This can be tabulated by hand. But it is possible to exploit Stata to shorten your hand<br />
computations and improve the accuracy. The trick is to read the data into Stata, treating the<br />
tolerance score, the frequencies for each religious group, and the total frequency as variables<br />
and the four rows (excluding the total) as observations. The -do- file I used to get the<br />
correlation ratio is shown as the second section of the Appendix.<br />
For this problem, 0 2 = .056. This indicates that about six per cent of the variance in tolerance<br />
of anti-religionists can be accounted for by the religious affiliations of the population. The<br />
apparent inconsistency between the small percentage of variance explained by religious<br />
affiliation and the large differences in the tolerance levels of the religious groups indicated<br />
by the percentage tables is due to the fact that the more tolerant groups, Jews and those<br />
without religion, are very small, and hence cannot contribute much to the total variance in<br />
the population. Put differently, religious affiliation cannot explain much of the variance in<br />
tolerance because most people have the same religion (65 per cent of the sample is<br />
Protestant). This is an important point, which will recur repeatedly in our subsequent<br />
discussion. Be sure you understand it.<br />
2. Correlation and regression<br />
The Stata do file that generates the output for Part B.2 of the exercise is shown as the third<br />
section of the Appendix. I have extensively documented my do file.<br />
2.a.<br />
2.b.<br />
See the do file.<br />
See the do file.<br />
2.b.1. As we see from the R-squared, 15.9 per cent of the variance in income is explained by<br />
variance in education.<br />
2.b.2. To get predicted levels of income for particular levels of education, we substitute the<br />
level of education into the equation:<br />
4.3
For high school graduates: = - 24,846 + 5141(12) = 36,846<br />
For someone with a B.A.: = - 24,846 + 5141(16) = 57,410<br />
For someone with 4 years after B.A.: = - 24,846 + 5141(20) = 77,974<br />
Yˆ<br />
Yˆ<br />
Yˆ<br />
2.c.<br />
2.d.<br />
2.e.<br />
The correlation ratio is .188. This is three percentage points larger than the squared<br />
product moment correlation (.159). The correlation ratio will always be larger (or, at the<br />
limit, equal to) the squared product moment correlation, since the squared product<br />
moment correlation gives 1 minus the ratio of the sum of squared deviations around a<br />
straight regression line to the sum of squared deviations around the mean of the<br />
dependent variable while the correlation ratio gives 1 minus the ratio of the sum of<br />
squared deviations around the subgroup means to the sum of squared deviations around<br />
the mean of the dependent variable. Recall from your elementary statistics course that<br />
the mean is the point that minimizes the sum of squared deviations. If the means do not<br />
increase in an exactly linear fashion, the sum of squares around the subgroup means<br />
necessarily will be smaller than the sum of squares around the regression line.<br />
See the do file.<br />
The graph is shown as Fig. 1. The relationship between education and income appears to<br />
be curvilinear. First, no one (in these data) has negative income and so the points for<br />
those with little education are uniformly above the regression line (although we should be<br />
cautious since there are relatively few cases in this part of the graph; very few men in the<br />
sample have less than 8 years of schooling). 1 [I know this not only from inspecting the<br />
scatter plot but also from a tabulation of education that I made but haven’t shown; see the<br />
do file. In the course of your work, you should make many such tabulations to check<br />
whether the data are behaving as expected, even if you never include them in your final<br />
paper.] Second, it is probable that the presence of very high (arbitrarily coded at<br />
$130,000) incomes among some of those with high amounts of education would pull the<br />
line up on the right side were it not constrained to be linear. A better fit to the data might<br />
be found by positing a curvilinear relationship between education and income, which we<br />
will learn how to do shortly. The relatively large variance in income at each level of<br />
education tells us why the squared correlation is so low: while average income tends to<br />
increase with education, there is a great deal of individual variation in the incomes of<br />
equally well educated men. Note finally that the mean for those with 19 years of school<br />
is higher than for those with 20 or more years of school, which probably reflects the fact<br />
that lawyers are better paid than professors.<br />
1 When estimating correlation ratios, one should be cautious about the possibility of “over-fitting” the data<br />
by capitalizing on sampling variation. When the subgroups are small, the subgroup means may reflect sampling<br />
variability rather than true non-linearity in the population. Because of the danger of over-fitting data, it usually is<br />
preferable to estimate r 2 rather than 0 2 when both variables are measured at the interval level—or to fit a smooth<br />
curve if the relationship is curvilinear. An alternative is to collapse small categories into a single larger category. In<br />
the present case, each year of schooling less than eight has fewer than 10 cases, so the estimated means are highly<br />
subject to sampling variability. I might, therefore, have collapsed years 0-8 into a single category to get a more<br />
reliable estimate of the mean for this group.<br />
4.4
3. Individual and aggregate correlations<br />
I ˆ = a+<br />
b ( E )<br />
3.a. We wish to estimate an equation of the form where I = the mean annual<br />
income of white males in each occupation and E = the years of schooling of white males<br />
in each occupation. Our problem is to solve for a and b. We get these from Table 2 by<br />
making use of the relationship and the<br />
b = r ( s / s ) = .72(2153/ 2.3) = 674<br />
YX YX Y X<br />
Y = a+ b( X)<br />
a = Y −b( X)<br />
relationship , which implies that . Thus, a = 5,311 -<br />
674(10.2) = -1,564.<br />
2<br />
3.b. r = .72 2 IE<br />
= .5184. So 52 per cent of the variance over occupations in mean annual<br />
income can be attributed to variance in the average years of school completed by<br />
incumbents.<br />
3.c.<br />
Recall from Part 2 that 16 per cent of the variance in the income of individuals can be<br />
explained by variance in their education, compared to 52 per cent of the variance in the<br />
average income of occupations explained by variance in their average income. It will<br />
almost always be the case that aggregated data will show higher correlations than<br />
disaggregated data involving the same variables, because most of the random<br />
variability is removed when summary characteristics of central tendency (medians,<br />
means, per cent above a certain level, etc.) are correlated. Be sure you understand this<br />
point. In the present case, this is easy to see by thinking about specific occupations. For<br />
example, while an occasional truck driver might be well educated and a few truck drivers<br />
(not necessarily the well educated ones) might make a good deal of money, truck drivers<br />
on the whole neither are very well educated nor make much money.<br />
4.5
150000<br />
Observed<br />
Linear fit<br />
Mean|Yrs. of School<br />
Annual Income in 1993<br />
100000<br />
50000<br />
0<br />
0 4 8 12 16 20<br />
Years of Schooling<br />
Fig. 1. Annual Income in 2003 by Years of School Completed, U.S. Males, 2004 (Open-ended<br />
Upper Category Coded to $150,000); N=855.<br />
Appendix - Log File for the Computations<br />
--------------------------------------------------------------------------------------<br />
log: d:\teach\soc212ab\2007-2008\computing\ex05.log<br />
log type: text<br />
opened on: 2 Nov 2007, 22:07:26<br />
. version 10.0<br />
. #delimit;<br />
delimiter now ;<br />
. clear;<br />
. set more 1;<br />
. program drop _all;<br />
. set mem 100m;<br />
Current memory allocation<br />
current<br />
memory usage<br />
settable value description (1M = 1024k)<br />
--------------------------------------------------------------------<br />
set maxvar 5000 max. variables allowed 1.909M<br />
set memory 100M max. data space 100.000M<br />
set matsize 500 max. RHS vars in models 1.949M<br />
-----------<br />
103.858M<br />
. *EX05.DO (DJT 8/31/03, last modified 11/2/06);<br />
4.6
. ***********************************************<br />
> This -do- file does the computations for Ex. 5.<br />
> ***********************************************;<br />
. ********************************<br />
> *Part A: Direct standardization.<br />
> ********************************;<br />
. use c:\e_old\data\gss\gssy2004.dta,replace;<br />
. *Create a categorical variable for education.;<br />
. recode educ<br />
> (0/11=1 "Some H.S. or less")<br />
> (12=2 "H.S. grad")<br />
> (13/15=3 "Some college")<br />
> (16/20=4 "Coll.grad.or more")<br />
> (*=.),gen(edcat) label(edcat);<br />
(2809 differences between educ and edcat)<br />
. *Create a categorical variable for age.;<br />
. recode age<br />
> (25/34=1 "25-34")<br />
> (35/44=2 "35-44")<br />
> (45/54=3 "45-54")<br />
> (55/64=4 "55-64")<br />
> (65/89=5 "65+")<br />
> (*=.),gen(agecat) label(agecat);<br />
(2803 differences between age and agecat)<br />
. *Convert the missing data codes for premarital sex to the Stata missing<br />
> value.;<br />
. recode premarsx 0 8 9=.;<br />
(premarsx: 18 changes made)<br />
. tab premarsx;<br />
sex before |<br />
marriage | Freq. Percent Cum.<br />
-----------------+-----------------------------------<br />
always wrong | 238 26.95 26.95<br />
almst always wrg | 79 8.95 35.90<br />
sometimes wrong | 157 17.78 53.68<br />
not wrong at all | 409 46.32 100.00<br />
-----------------+-----------------------------------<br />
Total | 883 100.00<br />
. *Mark the "good"--that is, non-missing--data to ensure that all<br />
> analysis is based on the same data set.;<br />
. mark good if premarsx~=. & agecat~=. & edcat~=.;<br />
. *Get the necessary cross-tabs to understand the relationships among<br />
> the three variables.;<br />
. tab premarsx edcat if good==1,col;<br />
4.7
+-------------------+<br />
| Key |<br />
|-------------------|<br />
| frequency |<br />
| column percentage |<br />
+-------------------+<br />
| RECODE of educ (highest year of school<br />
sex before |<br />
completed)<br />
marriage | Some H.S. H.S. grad Some coll Coll.grad | Total<br />
-----------------+--------------------------------------------+----------<br />
always wrong | 37 69 63 47 | 216<br />
| 37.37 32.24 26.47 19.67 | 27.34<br />
-----------------+--------------------------------------------+----------<br />
almst always wrg | 12 16 21 24 | 73<br />
| 12.12 7.48 8.82 10.04 | 9.24<br />
-----------------+--------------------------------------------+----------<br />
sometimes wrong | 10 35 35 49 | 129<br />
| 10.10 16.36 14.71 20.50 | 16.33<br />
-----------------+--------------------------------------------+----------<br />
not wrong at all | 40 94 119 119 | 372<br />
| 40.40 43.93 50.00 49.79 | 47.09<br />
-----------------+--------------------------------------------+----------<br />
Total | 99 214 238 239 | 790<br />
| 100.00 100.00 100.00 100.00 | 100.00<br />
. tab premarsx agecat if good==1,col;<br />
+-------------------+<br />
| Key |<br />
|-------------------|<br />
| frequency |<br />
| column percentage |<br />
+-------------------+<br />
sex before |<br />
RECODE of age (age of respondent)<br />
marriage | 25-34 35-44 45-54 55-64 65+ | Total<br />
-----------------+-------------------------------------------------------+----------<br />
always wrong | 41 49 38 33 55 | 216<br />
| 22.78 25.39 24.84 27.05 38.73 | 27.34<br />
-----------------+-------------------------------------------------------+----------<br />
almst always wrg | 11 14 13 11 24 | 73<br />
| 6.11 7.25 8.50 9.02 16.90 | 9.24<br />
-----------------+-------------------------------------------------------+----------<br />
sometimes wrong | 31 29 23 20 26 | 129<br />
| 17.22 15.03 15.03 16.39 18.31 | 16.33<br />
-----------------+-------------------------------------------------------+----------<br />
not wrong at all | 97 101 79 58 37 | 372<br />
| 53.89 52.33 51.63 47.54 26.06 | 47.09<br />
-----------------+-------------------------------------------------------+----------<br />
Total | 180 193 153 122 142 | 790<br />
| 100.00 100.00 100.00 100.00 100.00 | 100.00<br />
. tab agecat edcat if good==1,row;<br />
+----------------+<br />
| Key |<br />
|----------------|<br />
| frequency |<br />
| row percentage |<br />
+----------------+<br />
4.8
RECODE of |<br />
age (age |<br />
of | RECODE of educ (highest year of school<br />
respondent |<br />
completed)<br />
) | Some H.S. H.S. grad Some coll Coll.grad | Total<br />
-----------+--------------------------------------------+----------<br />
25-34 | 22 32 57 69 | 180<br />
| 12.22 17.78 31.67 38.33 | 100.00<br />
-----------+--------------------------------------------+----------<br />
35-44 | 16 61 71 45 | 193<br />
| 8.29 31.61 36.79 23.32 | 100.00<br />
-----------+--------------------------------------------+----------<br />
45-54 | 7 39 48 59 | 153<br />
| 4.58 25.49 31.37 38.56 | 100.00<br />
-----------+--------------------------------------------+----------<br />
55-64 | 19 32 38 33 | 122<br />
| 15.57 26.23 31.15 27.05 | 100.00<br />
-----------+--------------------------------------------+----------<br />
65+ | 35 50 24 33 | 142<br />
| 24.65 35.21 16.90 23.24 | 100.00<br />
-----------+--------------------------------------------+----------<br />
Total | 99 214 238 239 | 790<br />
| 12.53 27.09 30.13 30.25 | 100.00<br />
. *Convert the premarital sex variable to a set of dichotomous variables,<br />
> necessary to get directly standardized percentages.;<br />
. tab premarsx if good==1,gen(pm);<br />
sex before |<br />
marriage | Freq. Percent Cum.<br />
-----------------+-----------------------------------<br />
always wrong | 216 27.34 27.34<br />
almst always wrg | 73 9.24 36.58<br />
sometimes wrong | 129 16.33 52.91<br />
not wrong at all | 372 47.09 100.00<br />
-----------------+-----------------------------------<br />
Total | 790 100.00<br />
. *Create the "popvar" variable, here called "tot," required by Stata's<br />
> "dstdize" command. Note that the way to make dstdize work on unit<br />
> rather than tabular data is to set the popvar variable = 1 for each<br />
> case.;<br />
. gen tot=1 if good==1;<br />
(2022 missing values generated)<br />
. *Do the direct standardization.;<br />
. for num 1/4:dstdize pmX tot agecat if good==1,by(edcat);<br />
-> dstdize pm1 tot agecat if good==1,by(edcat)<br />
----------------------------------------------------------<br />
-> edcat= 1<br />
-----Unadjusted----- Std.<br />
Pop. Stratum Pop.<br />
Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P<br />
----------------------------------------------------------<br />
25-34 22 4 0.222 0.1818 0.228 0.0414<br />
35-44 16 2 0.162 0.1250 0.244 0.0305<br />
45-54 7 0 0.071 0.0000 0.194 0.0000<br />
55-64 19 9 0.192 0.4737 0.154 0.0732<br />
65+ 35 22 0.354 0.6286 0.180 0.1130<br />
----------------------------------------------------------<br />
4.9
Totals: 99 37 Adjusted Cases: 25.6<br />
Crude Rate: 0.3737<br />
Adjusted Rate: 0.2581<br />
95% Conf. Interval: [0.1878, 0.3284]<br />
----------------------------------------------------------<br />
-> edcat= 2<br />
-----Unadjusted----- Std.<br />
Pop. Stratum Pop.<br />
Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P<br />
----------------------------------------------------------<br />
25-34 32 9 0.150 0.2813 0.228 0.0641<br />
35-44 61 21 0.285 0.3443 0.244 0.0841<br />
45-54 39 13 0.182 0.3333 0.194 0.0646<br />
55-64 32 6 0.150 0.1875 0.154 0.0290<br />
65+ 50 20 0.234 0.4000 0.180 0.0719<br />
----------------------------------------------------------<br />
Totals: 214 69 Adjusted Cases: 67.1<br />
Crude Rate: 0.3224<br />
Adjusted Rate: 0.3136<br />
95% Conf. Interval: [0.2507, 0.3765]<br />
----------------------------------------------------------<br />
-> edcat= 3<br />
-----Unadjusted----- Std.<br />
Pop. Stratum Pop.<br />
Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P<br />
----------------------------------------------------------<br />
25-34 57 16 0.239 0.2807 0.228 0.0640<br />
35-44 71 18 0.298 0.2535 0.244 0.0619<br />
45-54 48 11 0.202 0.2292 0.194 0.0444<br />
55-64 38 12 0.160 0.3158 0.154 0.0488<br />
65+ 24 6 0.101 0.2500 0.180 0.0449<br />
----------------------------------------------------------<br />
Totals: 238 63 Adjusted Cases: 62.8<br />
Crude Rate: 0.2647<br />
Adjusted Rate: 0.2640<br />
95% Conf. Interval: [0.2062, 0.3218]<br />
----------------------------------------------------------<br />
-> edcat= 4<br />
-----Unadjusted----- Std.<br />
Pop. Stratum Pop.<br />
Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P<br />
----------------------------------------------------------<br />
25-34 69 12 0.289 0.1739 0.228 0.0396<br />
35-44 45 8 0.188 0.1778 0.244 0.0434<br />
45-54 59 14 0.247 0.2373 0.194 0.0460<br />
55-64 33 6 0.138 0.1818 0.154 0.0281<br />
65+ 33 7 0.138 0.2121 0.180 0.0381<br />
----------------------------------------------------------<br />
Totals: 239 47 Adjusted Cases: 46.7<br />
Crude Rate: 0.1967<br />
Adjusted Rate: 0.1952<br />
95% Conf. Interval: [0.1438, 0.2466]<br />
Summary of Study Populations:<br />
edcat N Crude Adj_Rate Confidence Interval<br />
--------------------------------------------------------------------------<br />
1 99 0.373737 0.258100 [ 0.187773, 0.328426]<br />
2 214 0.322430 0.313598 [ 0.250660, 0.376536]<br />
3 238 0.264706 0.263981 [ 0.206202, 0.321759]<br />
4 239 0.196653 0.195220 [ 0.143805, 0.246635]<br />
-> dstdize pm2 tot agecat if good==1,by(edcat)<br />
<strong>4.1</strong>0
[Log omitted for pm2-pm4 to save space.]<br />
. *****************************<br />
> Part B.1: Correlation ratios.<br />
> *****************************;<br />
. preserve;<br />
. *Read in the frequencies from Table 1 in the assignment and confirm correct.;<br />
. clear;<br />
. infile score p c j n tot using ex6b.raw;<br />
(4 observations read)<br />
. list;<br />
+-----------------------------------+<br />
| score p c j n tot |<br />
|-----------------------------------|<br />
1. | 3 265 151 32 65 513 |<br />
2. | 2 212 93 3 18 326 |<br />
3. | 1 155 50 2 5 212 |<br />
4. | 0 322 82 7 13 424 |<br />
+-----------------------------------+<br />
. *Get the within-group sum of squared deviations.;<br />
. for var p c j n: sum score [aw=X] \ gen sdX=X*((score-r(mean))^2)<br />
> \ gen ssX=sum(sdX);<br />
-> sum score [aw=p]<br />
Variable | Obs Weight Mean Std. Dev. Min Max<br />
-------------+-----------------------------------------------------------------<br />
score | 4 954 1.440252 1.403347 0 3<br />
-> gen sdp=p*((score-r(mean))^2)<br />
-> gen ssp=sum(sdp)<br />
-> sum score [aw=c]<br />
Variable | Obs Weight Mean Std. Dev. Min Max<br />
-------------+-----------------------------------------------------------------<br />
score | 4 376 1.832447 1.355896 0 3<br />
-> gen sdc=c*((score-r(mean))^2)<br />
-> gen ssc=sum(sdc)<br />
-> sum score [aw=j]<br />
Variable | Obs Weight Mean Std. Dev. Min Max<br />
-------------+-----------------------------------------------------------------<br />
score | 4 44 2.363636 1.304791 0 3<br />
-> gen sdj=j*((score-r(mean))^2)<br />
-> gen ssj=sum(sdj)<br />
-> sum score [aw=n]<br />
<strong>4.1</strong>1
Variable | Obs Weight Mean Std. Dev. Min Max<br />
-------------+-----------------------------------------------------------------<br />
score | 4 101 2.336634 1.208083 0 3<br />
-> gen sdn=n*((score-r(mean))^2)<br />
-> gen ssn=sum(sdn)<br />
. egen wgss=rsum(ssp ssc ssj ssn) in 4;<br />
(3 missing values generated)<br />
. *Get the total sum of squared deviations.;<br />
. sum score [aw=tot];<br />
Variable | Obs Weight Mean Std. Dev. Min Max<br />
-------------+-----------------------------------------------------------------<br />
score | 4 1475 1.629153 1.416017 0 3<br />
. gen gmean=r(mean);<br />
. for var p c j n: gen tdX=X*((score-gmean)^2) \ gen tsX=sum(tdX);<br />
-> gen tdp=p*((score-gmean)^2)<br />
-> gen tsp=sum(tdp)<br />
-> gen tdc=c*((score-gmean)^2)<br />
-> gen tsc=sum(tdc)<br />
-> gen tdj=j*((score-gmean)^2)<br />
-> gen tsj=sum(tdj)<br />
-> gen tdn=n*((score-gmean)^2)<br />
-> gen tsn=sum(tdn)<br />
. egen tss=rsum(tsp tsc tsj tsn) in 4;<br />
(3 missing values generated)<br />
. *Get eta-squared.;<br />
. gen etasq=1 - wgss/tss;<br />
(3 missing values generated)<br />
. list etasq;<br />
+----------+<br />
| etasq |<br />
|----------|<br />
1. | . |<br />
2. | . |<br />
3. | . |<br />
4. | .0558446 |<br />
+----------+<br />
. restore;<br />
. **************************************<br />
> *Part B.2: Correlation and Regression.<br />
> **************************************;<br />
. *Recode income;<br />
<strong>4.1</strong>2
. recode rincom98 1=500 2=2000 3=3500 4=4500 5=5500 6=6500 7=7500 8=9000<br />
> 9=11250 10=13750 11=16250 12=18750 13=21250 14=23750 15=27500<br />
> 16=32500 17=37500 18=45000 19=55000 20=67500 21=82500<br />
> 22=100000 23=150000 *=.,gen(inc);<br />
(1849 differences between rincom98 and inc)<br />
. *Mark complete data for Problem 2.;<br />
. replace educ=. if educ>20;<br />
(0 real changes made)<br />
. mark good2 if sex==1;<br />
. markout good2 inc educ;<br />
. *Get the regression of income on education.;<br />
. reg inc educ if good2==1;<br />
Source | SS df MS Number of obs = 855<br />
-------------+------------------------------ F( 1, 853) = 161.61<br />
Model | 2.0172e+11 1 2.0172e+11 Prob > F = 0.0000<br />
Residual | 1.0647e+12 853 1.2482e+09 R-squared = 0.1593<br />
-------------+------------------------------ Adj R-squared = 0.1583<br />
Total | 1.2664e+12 854 1.4829e+09 Root MSE = 35329<br />
------------------------------------------------------------------------------<br />
inc | Coef. Std. Err. t P>|t| [95% Conf. Interval]<br />
-------------+----------------------------------------------------------------<br />
educ | 5141.039 404.4016 12.71 0.000 4347.3 5934.778<br />
_cons | -24846.61 5824.733 -4.27 0.000 -36279.1 -1341<strong>4.1</strong>2<br />
------------------------------------------------------------------------------<br />
. *Get the correlation ratio of income on education.;<br />
. anova inc educ if good2==1;<br />
Number of obs = 855 R-squared = 0.1881<br />
Root MSE = 35068.8 Adj R-squared = 0.1707<br />
Source | Partial SS df MS F Prob > F<br />
-----------+----------------------------------------------------<br />
Model | 2.3826e+11 18 1.3237e+10 10.76 0.0000<br />
|<br />
educ | 2.3826e+11 18 1.3237e+10 10.76 0.0000<br />
|<br />
Residual | 1.0281e+12 836 1.2298e+09<br />
-----------+----------------------------------------------------<br />
Total | 1.2664e+12 854 1.4829e+09<br />
. *Now get the mean income for each year of schooling. It is a<br />
> good idea to make a tabulation of mean income by year of schooling,<br />
> to determine how many cases there are in each category.;<br />
. tab educ if good2==1,s(inc) mean freq;<br />
highest | Summary of RECODE of<br />
year of | rincom98 (respondents<br />
school |<br />
income)<br />
completed | Mean Freq.<br />
------------+------------------------<br />
1 | 37500 1<br />
2 | 50000 2<br />
3 | 36875 2<br />
5 | 24375 2<br />
<strong>4.1</strong>3
6 | 27750 5<br />
7 | 17583.333 3<br />
8 | 16250 6<br />
9 | 21973.684 19<br />
10 | 28009.615 26<br />
11 | 25481.25 40<br />
12 | 36769.9 201<br />
13 | 40492.308 65<br />
14 | 43408.397 131<br />
15 | 42310 50<br />
16 | 56506.494 154<br />
17 | 65546.875 32<br />
18 | 72534.091 44<br />
19 | 86650 25<br />
20 | 84984.043 47<br />
------------+------------------------<br />
Total | 47590.936 855<br />
. egen avginc = mean(inc) if good2==1, by(educ);<br />
(1957 missing values generated)<br />
. *Graph the relationship between actual income, mean income given<br />
> education, predicted income from the linear regression of income<br />
> on education, and education.;<br />
. *Note: The graphics commands in Stata 8.0 and 9.0 are completely rewritten<br />
> from previous versions. They are much more powerful, but take some<br />
> studying to figure out. The thing to do is first to make a simple<br />
> graph, without labels, and then to successively add refinements.<br />
> This is what I did here. We will discuss graphics many times in class.<br />
> But the commands are too complicated for a simple exposition here to be<br />
> very helpful. One way you can take advantage of my work is to study my<br />
> command and compare it to the graph, to see how I have achieved various<br />
> labeling.;<br />
. label var inc "Observed";<br />
. label var avginc "Mean|Yrs. of School";<br />
. *I strongly prefer the "lean1" scheme programmed by Juul (see the article<br />
> posted on the course web page). The most important reason for this is that<br />
> the y-axis labels are shown horizontally rather than vertically. But there<br />
> are other improvements as well, discussed in the article. Thus, I set "lean1"<br />
> (which I have downloaded from Stata's web page) as my permanent graphics<br />
> scheme.;<br />
. set scheme lean1,permanent;<br />
(set scheme preference recorded)<br />
. graph twoway<br />
> (scatter inc educ,msymbol(Oh) mcolor(black) jitter(5))<br />
> (lfit inc educ,sort clwidth(thick) clpattern(solid) clcolor(red))<br />
> (line avginc educ,sort clwidth(thick) clpattern(solid) clcolor(blue))<br />
> if good2==1,<br />
> legend(label(2 "Linear fit") cols(1) ring(0) position(11))<br />
> ylab(0(50000)150000)<br />
> ymtick(0(10000)150000)<br />
> xlab(0(4)20)<br />
> xmtick(0(1)20)<br />
> ytitle("Annual Income in 1993")<br />
> xtitle("Years of Schooling")<br />
> saving(ex05.gph,replace);<br />
(file ex05.gph saved)<br />
<strong>4.1</strong>4
. *A note on getting graphs into your word processor document. The<br />
> simplest way to do this (there may be others) is, when the graph<br />
> is on the screen, to click on "edit" and then "copy graph" and then<br />
> toggle to your word processor and paste the graph into it. (When<br />
> I tried this in MS Word, it didn't work for me but an alternative<br />
> did: simply click cntrl-c (the standard Windows copy command) with<br />
> your cursor on the graph, then toggle to MS Word and click cntrl-v<br />
> (the standard Windows paste command).<br />
><br />
> You should always save your graphs so that you can access them without<br />
> having to rerun the do file that created them---a useful time saving<br />
> when one has a complex do file that takes a long time to execute. A<br />
> new and happy feature of Stata 8.0 and 9.0 is that you can edit saved graphs.;<br />
. log close;<br />
log: d:\teach\soc212ab\2007-2008\computing\ex05.log<br />
log type: text<br />
closed on: 2 Nov 2007, 22:07:39<br />
--------------------------------------------------------------------------------------<br />
<strong>4.1</strong>5