Predictive Modelling of Undergraduate Student Intake - aair
Predictive Modelling of Undergraduate Student Intake - aair
Predictive Modelling of Undergraduate Student Intake - aair
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Predictive</strong> <strong>Modelling</strong> <strong>of</strong><br />
<strong>Undergraduate</strong> <strong>Student</strong> <strong>Intake</strong><br />
Anatoli Lightfoot<br />
Information Analyst, Statistical Services
Outline<br />
• Introduction (brief)<br />
• Theory <strong>of</strong> regression analysis (not so brief)<br />
• Some possible applications<br />
• What to aim for to obtain reliable predictions<br />
• Limitations <strong>of</strong> regression models<br />
• One model in detail (acceptance rates)<br />
AAIR Forum 2008<br />
2<br />
Anatoli Lightfoot – ANU
Why is this important?<br />
• Load management is vital to universities!<br />
So:<br />
• <strong>Student</strong> load has a major effect on university funding<br />
• The consequences <strong>of</strong> being under- or over-enrolled are<br />
potentially very serious<br />
• We want to get it right; and<br />
• We want to know how likely it is to go wrong<br />
AAIR Forum 2008<br />
3<br />
Anatoli Lightfoot – ANU
Why should you listen to me?<br />
• You shouldn’t! (necessarily)<br />
• Iaimto:<br />
• Explain some important basic statistics<br />
• Offer some food for thoughtht<br />
• But:<br />
• This is not a substitute for a statistics degree<br />
• I am not a pr<strong>of</strong>essional statistician (yet)<br />
AAIR Forum 2008<br />
4<br />
Anatoli Lightfoot – ANU
Things this presentation does not cover:<br />
• Setting intake targets<br />
• <strong>Modelling</strong> continuing load<br />
• Financial outcomes/consequences<br />
And a warning:<br />
• The next 10 slides are statistical theory<br />
• Now is your chance to bail out!<br />
AAIR Forum 2008<br />
5<br />
Anatoli Lightfoot – ANU
Time for some statistics!<br />
AAIR Forum 2008<br />
6<br />
Anatoli Lightfoot – ANU
Regression<br />
• Relationship between variables (X and Y)<br />
Y i = α + βX i + ε i<br />
• Y is the “response” or “independent” variable<br />
• X is an “explanatory” or “dependent” variable<br />
• α and β are constants<br />
• Equation <strong>of</strong> a straight line<br />
AAIR Forum 2008<br />
7<br />
Anatoli Lightfoot – ANU
The regression equation<br />
• What do the i and ε signify?<br />
Y i = α + βX i + ε i<br />
• The subscript i indexes observations<br />
• Each i-value represents a data point<br />
• Often omitted for clarity<br />
• ε i is an error term<br />
• The “residual” for each observation<br />
AAIR Forum 2008<br />
8<br />
Anatoli Lightfoot – ANU
The regression equation - example<br />
• Height vs 100m sprint time<br />
Y i = α + βX i + ε i<br />
• For the i-th observation (person):<br />
• Y i is 100m sprint time<br />
• X i is height<br />
• Determining α and β is “fitting” a model<br />
• This is done using statistical s<strong>of</strong>tware<br />
• α and β are chosen to minimise Σ(ε<br />
2<br />
i )<br />
AAIR Forum 2008<br />
9<br />
Anatoli Lightfoot – ANU
The regression equation - example<br />
i height time100<br />
1 140 17.6<br />
2 142 14.3<br />
3 147 16.4<br />
4 150 15.1<br />
1<br />
5 153 15.4<br />
6 159 15.2<br />
7 163 12.7<br />
8 164 13.9<br />
9 168 14.1<br />
10 170 13.7<br />
α = 30<br />
β = -0.1<br />
(s)<br />
100m time<br />
13<br />
14<br />
15<br />
16<br />
17<br />
Height vs 100m sprint times<br />
Y i = 30 – 0.1X i + ε i<br />
140 145 150 155 160 165 170<br />
Height (cm)<br />
AAIR Forum 2008<br />
10<br />
Anatoli Lightfoot – ANU
The regression equation<br />
• The ε are used in model diagnostics<br />
• They can be used to:<br />
• Check basic assumptions<br />
• Check goodness-<strong>of</strong>-fit<br />
• Identify outliers<br />
Y = α + βX +εε<br />
• They are also used to calculate l confidence<br />
intervals when using a model to predict<br />
AAIR Forum 2008<br />
11<br />
Anatoli Lightfoot – ANU
Regression – basic assumptions<br />
• The ε are independent<br />
• The ε are identically distributed<br />
• In particular, ε ~ N(0,σ 2 ) where σ 2 is a constant<br />
• The sample is representative <strong>of</strong> the population<br />
• Vital for useful predictions<br />
AAIR Forum 2008<br />
12<br />
Anatoli Lightfoot – ANU
Transformations<br />
• The variables used need not be “as measured”<br />
• Variables can be transformed:<br />
• Using square, square root, or higher order polynomial<br />
• Using inverse, logarithm, or exponential function<br />
• Using another function<br />
• By multiplying l i them together th (“interaction” ti terms)<br />
• Transformations are <strong>of</strong>ten used on response<br />
variables which are not defined on (-∞,∞)<br />
AAIR Forum 2008<br />
13<br />
Anatoli Lightfoot – ANU
Transformations – logit function<br />
• Maps (0,1) to (-∞,∞) ∞ ∞)<br />
• Used to transform a<br />
response variable which<br />
is a binomial proportion<br />
• Model is fitted to<br />
transformed Y-variable<br />
logit(Y) = α + βX + ε<br />
• Inverse function used to<br />
“un-transform” results<br />
AAIR Forum 2008<br />
14<br />
The logit function<br />
y = ln(x) - ln(1-x)<br />
Anatoli Lightfoot – ANU
Predictions<br />
• Model is fit on observed (historical) data<br />
• To make predictions:<br />
Y = α + βX +εε<br />
• Obtain new data which contains explanatory variables<br />
• Apply model equation to data<br />
• Output is predicted Y-values and confidence intervals<br />
• Make sure new data is from same population!<br />
AAIR Forum 2008<br />
15<br />
Anatoli Lightfoot – ANU
That’s it for the hard stuff<br />
So why use regression to model student intake?<br />
• You may already be using it!<br />
• Large body <strong>of</strong> knowledge exists<br />
• Ideally suited to large admissions datasets<br />
• Can provide confidence in predictions, not just<br />
an unqualified number!<br />
AAIR Forum 2008<br />
16<br />
Anatoli Lightfoot – ANU
Applications <strong>of</strong> regression<br />
• Many and varied<br />
• I will discuss just two:<br />
• Predicting enrolments from TAC preferences<br />
• Predicting enrolments from simulated TAC <strong>of</strong>fers<br />
AAIR Forum 2008<br />
17<br />
Anatoli Lightfoot – ANU
Applications <strong>of</strong> regression<br />
• Historical datasets available from UAC are large<br />
• Many possible explanatory variables present<br />
• Bio & demo data (age, gender, location)<br />
• Education data (UAI, prior studies)<br />
• Preference information (which courses, what order)<br />
• What are the observations?<br />
• Hard to tell where to start!<br />
t!<br />
AAIR Forum 2008<br />
18<br />
Anatoli Lightfoot – ANU
Applications <strong>of</strong> regression<br />
• Conversion <strong>of</strong> preferences to <strong>of</strong>fers depends<br />
only on type <strong>of</strong> course (eg. arts, science, etc.)<br />
• Model equation:<br />
• Results:<br />
logit(Y) = α + β 1 X 1 + ε<br />
• 1 st preferences for B Arts will result in the same proportion<br />
p<br />
<strong>of</strong> enrolments as 5 th preferences for B Arts<br />
AAIR Forum 2008<br />
19<br />
Anatoli Lightfoot – ANU
Applications <strong>of</strong> regression<br />
• Conversion <strong>of</strong> preferences to <strong>of</strong>fers depends on<br />
both preference number and faculty<br />
• Model equation:<br />
• Results:<br />
logit(Y) = α + β 1 X 1 + β 2 X 2 + ε<br />
• 1 st and 5 th preferences are now treated differently<br />
• What happens if the split between local and non-local<br />
applicants changes for arts courses?<br />
AAIR Forum 2008<br />
20<br />
Anatoli Lightfoot – ANU
Model refinement<br />
• Iterative process<br />
• Add or remove variables and refit model<br />
• Examine model diagnostics<br />
• Compare to previous models<br />
• Rinse and repeat<br />
• Important to revisit basic assumptions<br />
• No!<br />
• Can we treat each preference as a separate observation?<br />
AAIR Forum 2008<br />
21<br />
Anatoli Lightfoot – ANU
Model refinement<br />
• Preferences as observations is bad<br />
• Outcome <strong>of</strong> each preference is not independent<br />
• Each applicant as an observation<br />
• Group information from preferences together<br />
th<br />
• Create additional variables<br />
• Often datasets require modifying in some way<br />
AAIR Forum 2008<br />
22<br />
Anatoli Lightfoot – ANU
Reliable models<br />
• Simple models are usually better models<br />
• <strong>Modelling</strong> is not an exact science<br />
• But the theory behind it is!<br />
• Many different models are possible<br />
• All <strong>of</strong> them may produce acceptable results<br />
• A model should make intuitive sense<br />
• If it doesn’t, something is probably wrong with it!<br />
AAIR Forum 2008<br />
23<br />
Anatoli Lightfoot – ANU
Limitations <strong>of</strong> regression<br />
• There are times when it is not appropriate<br />
• Very small datasets can cause problems<br />
• Some datasets require specialised techniques<br />
• Time series analysis<br />
• Some datasets t simply resist analysis<br />
• Other methods available – eg. non-parametric statistics<br />
AAIR Forum 2008<br />
24<br />
Anatoli Lightfoot – ANU
Detailed example – acceptance rates<br />
• Model based on historical UAC data (3 years)<br />
• Basic observations are individual <strong>of</strong>fers<br />
• Observations are grouped<br />
• Response is proportion <strong>of</strong> acceptances<br />
• Each group is weighted when fitting model<br />
AAIR Forum 2008<br />
25<br />
Anatoli Lightfoot – ANU
Detailed example – acceptance rates<br />
• Simplified model equation:<br />
Y = α + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 6 X 1 X 2 + ε<br />
X 1 is an binary variable identifying ACT school-leavers<br />
X 2 represents 3 variables describing preference number<br />
X 3 identifies current and prior year school leavers<br />
X 4 represents 6 variables for different groups <strong>of</strong> courses<br />
The last term is an interaction term between preference number<br />
and ACT school-leaver<br />
AAIR Forum 2008<br />
26<br />
Anatoli Lightfoot – ANU
Detailed example – acceptance rates<br />
• Mostly additive model<br />
• Includes one interaction term<br />
• Preference number with ACT school-leaver<br />
• Many iterations to develop<br />
• More refinements are possible<br />
AAIR Forum 2008<br />
27<br />
Anatoli Lightfoot – ANU
Thank you<br />
• Questions?<br />
AAIR Forum 2008<br />
28<br />
Anatoli Lightfoot – ANU