4.1 Quantitative Data Analysis Prof. Treiman Exercise 4: Illustrative ...

Quantitative Data Analysis 

Prof. Treiman 

Exercise 4: Illustrative Answer 

A. Direct Standardization via Stata 

Here is the table called for in the assignment. 

Premarital sex is... 

Education 

< H.S. grad H.S. grad Some coll. College + 

Observed percentages 

Always wrong 37.4 32.2 26.5 19.7 

Almost always wrong 12.1 7.5 8.8 10.0 

Sometimes wrong 10.1 16.4 14.7 20.5 

Not wrong at all 40.4 43.9 50.0 49.8 

Total 100.0 100.0 100.0 100.0 

N (99) (214) (238) (239) 

Percentages adjusted for age 

Always wrong 25.8 31.4 26.4 19.5 

Almost always wrong 15.8 6.9 10.2 10.4 

Sometimes wrong 10.3 16.2 15.3 20.8 

Not wrong at all 48.1 45.5 48.1 49.3 

The top panel of the table shows a modest association between education and attitudes regarding 

premarital sex, with the better educated less likely to think that premarital sex is wrong. About 

37 per cent of those lacking high school education think that premarital sex is always wrong, 

compared to less than 20 per cent of those with college degrees. At the other extreme, only 

about 40 per cent of those with less than high school think that premarital sex is not wrong at all 

compared to about half of those with at least some college. 

We might suspect that these results simply reflect the tendency of older people both to be more 

poorly educated and to have less liberal attitudes regarding premarital sex than younger people. 

Indeed, both of these associations hold for late 20 th century Americans. [I have estimated but 

4.1

have not presented the tables, but you can easily make them. In a full analysis, you would want 

to show these tables.] But standardizing for age reveals that the association between education 

and acceptance of premarital does not arise simply from their joint dependence on age. Indeed, 

standardizing for age has virtually no impact on the coefficients in the table, except for those 

with less than a high school education—the disapproval of premarital sex by those with less than 

a high school education is apparently due in part to the fact that the poorly educated are 

disproportionately elderly since there is a noticeable shift toward greater acceptance of 

premarital sex in this group after standardizing for age. 

The -do- file that created these computations is shown in the Appendix. 

B. Simple Correlation and Regression 

1. Correlation ratios 

a) Table 1A shows a strong relationship between religious affiliation and tolerance of those 

opposed to religion, among Americans queried in 1974. Nearly three quarters of the 

Jews, nearly two thirds of those without religion, but only two fifths of the Catholics and 

hardly more than a quarter of the Protestants are in the highest tolerance category, 

endorsing the right of those “against religion” to speak at a public meeting, teach in a 

college or university, and have their books included in public library collections. 

b) On a three point tolerance scale (the number of endorsements of the rights mentioned in 

answer to (1a)), the means (and standard deviations) for the four religious categories are 

as follows: 

Mean 

S.D. 

Protestants 1.44 1.40 

Catholics 1.83 1.36 

Jews 2.36 1.30 

No religion 2.34 1.21 

Total 1.63 1.42 

From visual inspection ("eye balling the data") we are led to a similar conclusion from 

the means as from the percentage distributions. Jews and those without religion are 

substantially more tolerant than are Catholics, who are in turn substantially more tolerant 

than are Protestants. 

c. The correlation ratio, 0 2 , is a measure of the ratio of the sum of squared standard deviations 

of observations from subgroup means (the within-group sum of squares) to the sum of 

squared deviations of observations from the grand mean for the entire sample. It tells us 

what proportion of the variance in the dependent variable can be explained by knowledge of 

4.2

which of several categories of the independent variable an observation falls into. In the 

present case, the correlation ratio tells us how much of the variance in tolerance of antireligionists 

we can explain by knowing the religious affiliation of respondents. The easiest 

way to compute 0 2 is to compute sums of squares separately for each subgroup and then add 

them up to find the “within group sum of squares.” Then compute the sum of squares for the 

total column. This, obviously, is the “total sum of squares.” Then simply make the 

computation indicated in the assignment. Recall from the assignment that: 

2 

η = 1− 

Within group sum of squares 

Total sum of squares 

This can be tabulated by hand. But it is possible to exploit Stata to shorten your hand 

computations and improve the accuracy. The trick is to read the data into Stata, treating the 

tolerance score, the frequencies for each religious group, and the total frequency as variables 

and the four rows (excluding the total) as observations. The -do- file I used to get the 

correlation ratio is shown as the second section of the Appendix. 

For this problem, 0 2 = .056. This indicates that about six per cent of the variance in tolerance 

of anti-religionists can be accounted for by the religious affiliations of the population. The 

apparent inconsistency between the small percentage of variance explained by religious 

affiliation and the large differences in the tolerance levels of the religious groups indicated 

by the percentage tables is due to the fact that the more tolerant groups, Jews and those 

without religion, are very small, and hence cannot contribute much to the total variance in 

the population. Put differently, religious affiliation cannot explain much of the variance in 

tolerance because most people have the same religion (65 per cent of the sample is 

Protestant). This is an important point, which will recur repeatedly in our subsequent 

discussion. Be sure you understand it. 

2. Correlation and regression 

The Stata do file that generates the output for Part B.2 of the exercise is shown as the third 

section of the Appendix. I have extensively documented my do file. 

2.a. 

2.b. 

See the do file. 


2.b.1. As we see from the R-squared, 15.9 per cent of the variance in income is explained by 

variance in education. 

2.b.2. To get predicted levels of income for particular levels of education, we substitute the 

level of education into the equation: 

4.3

For high school graduates: = - 24,846 + 5141(12) = 36,846 

For someone with a B.A.: = - 24,846 + 5141(16) = 57,410 

For someone with 4 years after B.A.: = - 24,846 + 5141(20) = 77,974 

Yˆ 

Yˆ 

Yˆ 

2.c. 

2.d. 

2.e. 

The correlation ratio is .188. This is three percentage points larger than the squared 

product moment correlation (.159). The correlation ratio will always be larger (or, at the 

limit, equal to) the squared product moment correlation, since the squared product 

moment correlation gives 1 minus the ratio of the sum of squared deviations around a 

straight regression line to the sum of squared deviations around the mean of the 

dependent variable while the correlation ratio gives 1 minus the ratio of the sum of 

squared deviations around the subgroup means to the sum of squared deviations around 

the mean of the dependent variable. Recall from your elementary statistics course that 

the mean is the point that minimizes the sum of squared deviations. If the means do not 

increase in an exactly linear fashion, the sum of squares around the subgroup means 

necessarily will be smaller than the sum of squares around the regression line. 


The graph is shown as Fig. 1. The relationship between education and income appears to 

be curvilinear. First, no one (in these data) has negative income and so the points for 

those with little education are uniformly above the regression line (although we should be 

cautious since there are relatively few cases in this part of the graph; very few men in the 

sample have less than 8 years of schooling). 1 [I know this not only from inspecting the 

scatter plot but also from a tabulation of education that I made but haven’t shown; see the 

do file. In the course of your work, you should make many such tabulations to check 

whether the data are behaving as expected, even if you never include them in your final 

paper.] Second, it is probable that the presence of very high (arbitrarily coded at 

$130,000) incomes among some of those with high amounts of education would pull the 

line up on the right side were it not constrained to be linear. A better fit to the data might 

be found by positing a curvilinear relationship between education and income, which we 

will learn how to do shortly. The relatively large variance in income at each level of 

education tells us why the squared correlation is so low: while average income tends to 

increase with education, there is a great deal of individual variation in the incomes of 

equally well educated men. Note finally that the mean for those with 19 years of school 

is higher than for those with 20 or more years of school, which probably reflects the fact 

that lawyers are better paid than professors. 

1 When estimating correlation ratios, one should be cautious about the possibility of “over-fitting” the data 

by capitalizing on sampling variation. When the subgroups are small, the subgroup means may reflect sampling 

variability rather than true non-linearity in the population. Because of the danger of over-fitting data, it usually is 

preferable to estimate r 2 rather than 0 2 when both variables are measured at the interval level—or to fit a smooth 

curve if the relationship is curvilinear. An alternative is to collapse small categories into a single larger category. In 

the present case, each year of schooling less than eight has fewer than 10 cases, so the estimated means are highly 

subject to sampling variability. I might, therefore, have collapsed years 0-8 into a single category to get a more 

reliable estimate of the mean for this group. 

4.4

3. Individual and aggregate correlations 

I ˆ = a+ 

b ( E ) 

3.a. We wish to estimate an equation of the form where I = the mean annual 

income of white males in each occupation and E = the years of schooling of white males 

in each occupation. Our problem is to solve for a and b. We get these from Table 2 by 

making use of the relationship and the 

b = r ( s / s ) = .72(2153/ 2.3) = 674 

YX YX Y X 

Y = a+ b( X) 

a = Y −b( X) 

relationship , which implies that . Thus, a = 5,311 - 

674(10.2) = -1,564. 

2 

3.b. r = .72 2 IE 

= .5184. So 52 per cent of the variance over occupations in mean annual 

income can be attributed to variance in the average years of school completed by 

incumbents. 

3.c. 

Recall from Part 2 that 16 per cent of the variance in the income of individuals can be 

explained by variance in their education, compared to 52 per cent of the variance in the 

average income of occupations explained by variance in their average income. It will 

almost always be the case that aggregated data will show higher correlations than 

disaggregated data involving the same variables, because most of the random 

variability is removed when summary characteristics of central tendency (medians, 

means, per cent above a certain level, etc.) are correlated. Be sure you understand this 

point. In the present case, this is easy to see by thinking about specific occupations. For 

example, while an occasional truck driver might be well educated and a few truck drivers 

(not necessarily the well educated ones) might make a good deal of money, truck drivers 

on the whole neither are very well educated nor make much money. 

4.5

150000 

Observed 

Linear fit 

Mean|Yrs. of School 

Annual Income in 1993 

100000 

50000 

0 

0 4 8 12 16 20 

Years of Schooling 

Fig. 1. Annual Income in 2003 by Years of School Completed, U.S. Males, 2004 (Open-ended 

Upper Category Coded to $150,000); N=855. 

Appendix - Log File for the Computations 

-------------------------------------------------------------------------------------- 

log: d:\teach\soc212ab\2007-2008\computing\ex05.log 

log type: text 

opened on: 2 Nov 2007, 22:07:26 

. version 10.0 

. #delimit; 

delimiter now ; 

. clear; 

. set more 1; 

. program drop _all; 

. set mem 100m; 

Current memory allocation 

current 

memory usage 

settable value description (1M = 1024k) 

-------------------------------------------------------------------- 

set maxvar 5000 max. variables allowed 1.909M 

set memory 100M max. data space 100.000M 

set matsize 500 max. RHS vars in models 1.949M 

----------- 

103.858M 

. *EX05.DO (DJT 8/31/03, last modified 11/2/06); 

4.6

. *********************************************** 

> This -do- file does the computations for Ex. 5. 

> ***********************************************; 

. ******************************** 

> *Part A: Direct standardization. 

> ********************************; 

. use c:\e_old\data\gss\gssy2004.dta,replace; 

. *Create a categorical variable for education.; 

. recode educ 

> (0/11=1 "Some H.S. or less") 

> (12=2 "H.S. grad") 

> (13/15=3 "Some college") 

> (16/20=4 "Coll.grad.or more") 

> (*=.),gen(edcat) label(edcat); 

(2809 differences between educ and edcat) 

. *Create a categorical variable for age.; 

. recode age 

> (25/34=1 "25-34") 

> (35/44=2 "35-44") 

> (45/54=3 "45-54") 

> (55/64=4 "55-64") 

> (65/89=5 "65+") 

> (*=.),gen(agecat) label(agecat); 

(2803 differences between age and agecat) 

. *Convert the missing data codes for premarital sex to the Stata missing 

> value.; 

. recode premarsx 0 8 9=.; 

(premarsx: 18 changes made) 

. tab premarsx; 

sex before | 

marriage | Freq. Percent Cum. 

-----------------+----------------------------------- 

always wrong | 238 26.95 26.95 

almst always wrg | 79 8.95 35.90 

sometimes wrong | 157 17.78 53.68 

not wrong at all | 409 46.32 100.00 

-----------------+----------------------------------- 

Total | 883 100.00 

. *Mark the "good"--that is, non-missing--data to ensure that all 

> analysis is based on the same data set.; 

. mark good if premarsx~=. & agecat~=. & edcat~=.; 

. *Get the necessary cross-tabs to understand the relationships among 

> the three variables.; 

. tab premarsx edcat if good==1,col; 

4.7

+-------------------+ 

| Key | 

|-------------------| 

| frequency | 

| column percentage | 

+-------------------+ 

| RECODE of educ (highest year of school 

sex before | 

completed) 

marriage | Some H.S. H.S. grad Some coll Coll.grad | Total 

-----------------+--------------------------------------------+---------- 

always wrong | 37 69 63 47 | 216 

| 37.37 32.24 26.47 19.67 | 27.34 

-----------------+--------------------------------------------+---------- 

almst always wrg | 12 16 21 24 | 73 

| 12.12 7.48 8.82 10.04 | 9.24 

-----------------+--------------------------------------------+---------- 

sometimes wrong | 10 35 35 49 | 129 

| 10.10 16.36 14.71 20.50 | 16.33 

-----------------+--------------------------------------------+---------- 

not wrong at all | 40 94 119 119 | 372 

| 40.40 43.93 50.00 49.79 | 47.09 

-----------------+--------------------------------------------+---------- 

Total | 99 214 238 239 | 790 

| 100.00 100.00 100.00 100.00 | 100.00 

. tab premarsx agecat if good==1,col; 

+-------------------+ 

| Key | 

|-------------------| 

| frequency | 

| column percentage | 

+-------------------+ 

sex before | 

RECODE of age (age of respondent) 

marriage | 25-34 35-44 45-54 55-64 65+ | Total 

-----------------+-------------------------------------------------------+---------- 

always wrong | 41 49 38 33 55 | 216 

| 22.78 25.39 24.84 27.05 38.73 | 27.34 

-----------------+-------------------------------------------------------+---------- 

almst always wrg | 11 14 13 11 24 | 73 

| 6.11 7.25 8.50 9.02 16.90 | 9.24 

-----------------+-------------------------------------------------------+---------- 

sometimes wrong | 31 29 23 20 26 | 129 

| 17.22 15.03 15.03 16.39 18.31 | 16.33 

-----------------+-------------------------------------------------------+---------- 

not wrong at all | 97 101 79 58 37 | 372 

| 53.89 52.33 51.63 47.54 26.06 | 47.09 

-----------------+-------------------------------------------------------+---------- 

Total | 180 193 153 122 142 | 790 

| 100.00 100.00 100.00 100.00 100.00 | 100.00 

. tab agecat edcat if good==1,row; 

+----------------+ 

| Key | 

|----------------| 

| frequency | 

| row percentage | 

+----------------+ 

4.8

RECODE of | 

age (age | 

of | RECODE of educ (highest year of school 

respondent | 

completed) 

) | Some H.S. H.S. grad Some coll Coll.grad | Total 

-----------+--------------------------------------------+---------- 

25-34 | 22 32 57 69 | 180 

| 12.22 17.78 31.67 38.33 | 100.00 

-----------+--------------------------------------------+---------- 

35-44 | 16 61 71 45 | 193 

| 8.29 31.61 36.79 23.32 | 100.00 

-----------+--------------------------------------------+---------- 

45-54 | 7 39 48 59 | 153 

| 4.58 25.49 31.37 38.56 | 100.00 

-----------+--------------------------------------------+---------- 

55-64 | 19 32 38 33 | 122 

| 15.57 26.23 31.15 27.05 | 100.00 

-----------+--------------------------------------------+---------- 

65+ | 35 50 24 33 | 142 

| 24.65 35.21 16.90 23.24 | 100.00 

-----------+--------------------------------------------+---------- 

Total | 99 214 238 239 | 790 

| 12.53 27.09 30.13 30.25 | 100.00 

. *Convert the premarital sex variable to a set of dichotomous variables, 

> necessary to get directly standardized percentages.; 

. tab premarsx if good==1,gen(pm); 

sex before | 

marriage | Freq. Percent Cum. 

-----------------+----------------------------------- 

always wrong | 216 27.34 27.34 

almst always wrg | 73 9.24 36.58 

sometimes wrong | 129 16.33 52.91 

not wrong at all | 372 47.09 100.00 

-----------------+----------------------------------- 

Total | 790 100.00 

. *Create the "popvar" variable, here called "tot," required by Stata's 

> "dstdize" command. Note that the way to make dstdize work on unit 

> rather than tabular data is to set the popvar variable = 1 for each 

> case.; 

. gen tot=1 if good==1; 

(2022 missing values generated) 

. *Do the direct standardization.; 

. for num 1/4:dstdize pmX tot agecat if good==1,by(edcat); 

-> dstdize pm1 tot agecat if good==1,by(edcat) 

---------------------------------------------------------- 

-> edcat= 1 

-----Unadjusted----- Std. 

Pop. Stratum Pop. 

Stratum Pop. Cases Dist. Rate[s] Dst[P] s*P 

---------------------------------------------------------- 

25-34 22 4 0.222 0.1818 0.228 0.0414 

35-44 16 2 0.162 0.1250 0.244 0.0305 

45-54 7 0 0.071 0.0000 0.194 0.0000 

55-64 19 9 0.192 0.4737 0.154 0.0732 

65+ 35 22 0.354 0.6286 0.180 0.1130 

---------------------------------------------------------- 

4.9

Totals: 99 37 Adjusted Cases: 25.6 

Crude Rate: 0.3737 

Adjusted Rate: 0.2581 

95% Conf. Interval: [0.1878, 0.3284] 

---------------------------------------------------------- 

-> edcat= 2 




---------------------------------------------------------- 

25-34 32 9 0.150 0.2813 0.228 0.0641 

35-44 61 21 0.285 0.3443 0.244 0.0841 

45-54 39 13 0.182 0.3333 0.194 0.0646 

55-64 32 6 0.150 0.1875 0.154 0.0290 

65+ 50 20 0.234 0.4000 0.180 0.0719 

---------------------------------------------------------- 




95% Conf. Interval: [0.2507, 0.3765] 

---------------------------------------------------------- 

-> edcat= 3 




---------------------------------------------------------- 

25-34 57 16 0.239 0.2807 0.228 0.0640 

35-44 71 18 0.298 0.2535 0.244 0.0619 

45-54 48 11 0.202 0.2292 0.194 0.0444 

55-64 38 12 0.160 0.3158 0.154 0.0488 

65+ 24 6 0.101 0.2500 0.180 0.0449 

---------------------------------------------------------- 




95% Conf. Interval: [0.2062, 0.3218] 

---------------------------------------------------------- 

-> edcat= 4 




---------------------------------------------------------- 

25-34 69 12 0.289 0.1739 0.228 0.0396 

35-44 45 8 0.188 0.1778 0.244 0.0434 

45-54 59 14 0.247 0.2373 0.194 0.0460 

55-64 33 6 0.138 0.1818 0.154 0.0281 

65+ 33 7 0.138 0.2121 0.180 0.0381 

---------------------------------------------------------- 




95% Conf. Interval: [0.1438, 0.2466] 

Summary of Study Populations: 

edcat N Crude Adj_Rate Confidence Interval 

-------------------------------------------------------------------------- 

1 99 0.373737 0.258100 [ 0.187773, 0.328426] 

2 214 0.322430 0.313598 [ 0.250660, 0.376536] 

3 238 0.264706 0.263981 [ 0.206202, 0.321759] 

4 239 0.196653 0.195220 [ 0.143805, 0.246635] 

-> dstdize pm2 tot agecat if good==1,by(edcat) 

4.10

[Log omitted for pm2-pm4 to save space.] 

. ***************************** 

> Part B.1: Correlation ratios. 

> *****************************; 

. preserve; 

. *Read in the frequencies from Table 1 in the assignment and confirm correct.; 

. clear; 

. infile score p c j n tot using ex6b.raw; 

(4 observations read) 

. list; 

+-----------------------------------+ 

| score p c j n tot | 

|-----------------------------------| 

1. | 3 265 151 32 65 513 | 

2. | 2 212 93 3 18 326 | 

3. | 1 155 50 2 5 212 | 

4. | 0 322 82 7 13 424 | 

+-----------------------------------+ 

. *Get the within-group sum of squared deviations.; 

. for var p c j n: sum score [aw=X] \ gen sdX=X*((score-r(mean))^2) 

> \ gen ssX=sum(sdX); 

-> sum score [aw=p] 

Variable | Obs Weight Mean Std. Dev. Min Max 

-------------+----------------------------------------------------------------- 

score | 4 954 1.440252 1.403347 0 3 

-> gen sdp=p*((score-r(mean))^2) 

-> gen ssp=sum(sdp) 

-> sum score [aw=c] 


-------------+----------------------------------------------------------------- 

score | 4 376 1.832447 1.355896 0 3 

-> gen sdc=c*((score-r(mean))^2) 

-> gen ssc=sum(sdc) 

-> sum score [aw=j] 


-------------+----------------------------------------------------------------- 

score | 4 44 2.363636 1.304791 0 3 

-> gen sdj=j*((score-r(mean))^2) 

-> gen ssj=sum(sdj) 

-> sum score [aw=n] 



-------------+----------------------------------------------------------------- 

score | 4 101 2.336634 1.208083 0 3 

-> gen sdn=n*((score-r(mean))^2) 

-> gen ssn=sum(sdn) 

. egen wgss=rsum(ssp ssc ssj ssn) in 4; 


. *Get the total sum of squared deviations.; 

. sum score [aw=tot]; 


-------------+----------------------------------------------------------------- 

score | 4 1475 1.629153 1.416017 0 3 

. gen gmean=r(mean); 

. for var p c j n: gen tdX=X*((score-gmean)^2) \ gen tsX=sum(tdX); 

-> gen tdp=p*((score-gmean)^2) 

-> gen tsp=sum(tdp) 

-> gen tdc=c*((score-gmean)^2) 

-> gen tsc=sum(tdc) 

-> gen tdj=j*((score-gmean)^2) 

-> gen tsj=sum(tdj) 

-> gen tdn=n*((score-gmean)^2) 

-> gen tsn=sum(tdn) 

. egen tss=rsum(tsp tsc tsj tsn) in 4; 


. *Get eta-squared.; 

. gen etasq=1 - wgss/tss; 


. list etasq; 

+----------+ 

| etasq | 

|----------| 

1. | . | 

2. | . | 

3. | . | 

4. | .0558446 | 

+----------+ 

. restore; 

. ************************************** 

> *Part B.2: Correlation and Regression. 

> **************************************; 

. *Recode income; 


. recode rincom98 1=500 2=2000 3=3500 4=4500 5=5500 6=6500 7=7500 8=9000 

> 9=11250 10=13750 11=16250 12=18750 13=21250 14=23750 15=27500 

> 16=32500 17=37500 18=45000 19=55000 20=67500 21=82500 

> 22=100000 23=150000 *=.,gen(inc); 

(1849 differences between rincom98 and inc) 

. *Mark complete data for Problem 2.; 

. replace educ=. if educ>20; 

(0 real changes made) 

. mark good2 if sex==1; 

. markout good2 inc educ; 

. *Get the regression of income on education.; 

. reg inc educ if good2==1; 

Source | SS df MS Number of obs = 855 

-------------+------------------------------ F( 1, 853) = 161.61 

Model | 2.0172e+11 1 2.0172e+11 Prob > F = 0.0000 

Residual | 1.0647e+12 853 1.2482e+09 R-squared = 0.1593 

-------------+------------------------------ Adj R-squared = 0.1583 

Total | 1.2664e+12 854 1.4829e+09 Root MSE = 35329 

------------------------------------------------------------------------------ 

inc | Coef. Std. Err. t P>|t| [95% Conf. Interval] 

-------------+---------------------------------------------------------------- 

educ | 5141.039 404.4016 12.71 0.000 4347.3 5934.778 

_cons | -24846.61 5824.733 -4.27 0.000 -36279.1 -13414.12 

------------------------------------------------------------------------------ 

. *Get the correlation ratio of income on education.; 

. anova inc educ if good2==1; 

Number of obs = 855 R-squared = 0.1881 

Root MSE = 35068.8 Adj R-squared = 0.1707 

Source | Partial SS df MS F Prob > F 

-----------+---------------------------------------------------- 

Model | 2.3826e+11 18 1.3237e+10 10.76 0.0000 

| 

educ | 2.3826e+11 18 1.3237e+10 10.76 0.0000 

| 

Residual | 1.0281e+12 836 1.2298e+09 

-----------+---------------------------------------------------- 

Total | 1.2664e+12 854 1.4829e+09 

. *Now get the mean income for each year of schooling. It is a 

> good idea to make a tabulation of mean income by year of schooling, 

> to determine how many cases there are in each category.; 

. tab educ if good2==1,s(inc) mean freq; 

highest | Summary of RECODE of 

year of | rincom98 (respondents 

school | 

income) 

completed | Mean Freq. 

------------+------------------------ 

1 | 37500 1 

2 | 50000 2 

3 | 36875 2 

5 | 24375 2 


6 | 27750 5 

7 | 17583.333 3 

8 | 16250 6 

9 | 21973.684 19 

10 | 28009.615 26 

11 | 25481.25 40 

12 | 36769.9 201 

13 | 40492.308 65 

14 | 43408.397 131 

15 | 42310 50 

16 | 56506.494 154 

17 | 65546.875 32 

18 | 72534.091 44 

19 | 86650 25 

20 | 84984.043 47 

------------+------------------------ 

Total | 47590.936 855 

. egen avginc = mean(inc) if good2==1, by(educ); 


. *Graph the relationship between actual income, mean income given 

> education, predicted income from the linear regression of income 

> on education, and education.; 

. *Note: The graphics commands in Stata 8.0 and 9.0 are completely rewritten 

> from previous versions. They are much more powerful, but take some 

> studying to figure out. The thing to do is first to make a simple 

> graph, without labels, and then to successively add refinements. 

> This is what I did here. We will discuss graphics many times in class. 

> But the commands are too complicated for a simple exposition here to be 

> very helpful. One way you can take advantage of my work is to study my 

> command and compare it to the graph, to see how I have achieved various 

> labeling.; 

. label var inc "Observed"; 

. label var avginc "Mean|Yrs. of School"; 

. *I strongly prefer the "lean1" scheme programmed by Juul (see the article 

> posted on the course web page). The most important reason for this is that 

> the y-axis labels are shown horizontally rather than vertically. But there 

> are other improvements as well, discussed in the article. Thus, I set "lean1" 

> (which I have downloaded from Stata's web page) as my permanent graphics 

> scheme.; 

. set scheme lean1,permanent; 

(set scheme preference recorded) 

. graph twoway 

> (scatter inc educ,msymbol(Oh) mcolor(black) jitter(5)) 

> (lfit inc educ,sort clwidth(thick) clpattern(solid) clcolor(red)) 

> (line avginc educ,sort clwidth(thick) clpattern(solid) clcolor(blue)) 

> if good2==1, 

> legend(label(2 "Linear fit") cols(1) ring(0) position(11)) 

> ylab(0(50000)150000) 

> ymtick(0(10000)150000) 

> xlab(0(4)20) 

> xmtick(0(1)20) 

> ytitle("Annual Income in 1993") 

> xtitle("Years of Schooling") 

> saving(ex05.gph,replace); 

(file ex05.gph saved) 


. *A note on getting graphs into your word processor document. The 

> simplest way to do this (there may be others) is, when the graph 

> is on the screen, to click on "edit" and then "copy graph" and then 

> toggle to your word processor and paste the graph into it. (When 

> I tried this in MS Word, it didn't work for me but an alternative 

> did: simply click cntrl-c (the standard Windows copy command) with 

> your cursor on the graph, then toggle to MS Word and click cntrl-v 

> (the standard Windows paste command). 

> 

> You should always save your graphs so that you can access them without 

> having to rerun the do file that created them---a useful time saving 

> when one has a complex do file that takes a long time to execute. A 

> new and happy feature of Stata 8.0 and 9.0 is that you can edit saved graphs.; 

. log close; 

log: d:\teach\soc212ab\2007-2008\computing\ex05.log 

log type: text 

closed on: 2 Nov 2007, 22:07:39 

--------------------------------------------------------------------------------------

4.1 Quantitative Data Analysis Prof. Treiman Exercise 4: Illustrative ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?