Notes and Exercises on the SAS Data Step and Simulation
Notes and Exercises on the SAS Data Step and Simulation Notes and Exercises on the SAS Data Step and Simulation
- Page 2 and 3: Chapter 1 Introduction 1.1 Introduc
- Page 4 and 5: CHAPTER 1. INTRODUCTION 3 Figure 1.
- Page 6 and 7: CHAPTER 2. THE DATA STEP 5 2. How m
- Page 8 and 9: CHAPTER 2. THE DATA STEP 7 Figure 2
- Page 10 and 11: CHAPTER 2. THE DATA STEP 9 /* Examp
- Page 12 and 13: CHAPTER 2. THE DATA STEP 11 2.6.1 E
- Page 14 and 15: CHAPTER 2. THE DATA STEP 13 2.7.1 E
- Page 16 and 17: Chapter 3 If: Controlling Flow of O
- Page 18 and 19: Chapter 4 DOing things repeatedly T
- Page 20 and 21: CHAPTER 4. DOING THINGS REPEATEDLY
- Page 22 and 23: Chapter 5 Simulation 5.1 Generation
- Page 24 and 25: CHAPTER 5. SIMULATION 23 FILE ’RA
- Page 26 and 27: CHAPTER 5. SIMULATION 25 2. Write a
- Page 28 and 29: CHAPTER 5. SIMULATION 27 of positiv
- Page 30 and 31: CHAPTER 5. SIMULATION 29 P(T ≤ t)
- Page 32 and 33: CHAPTER 5. SIMULATION 31 5.7 Normal
- Page 34 and 35: Chapter 6 REFERENCE: Other Data Ste
- Page 36 and 37: CHAPTER 6. REFERENCE: OTHER DATA ST
<str<strong>on</strong>g>Notes</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>SAS</strong> <strong>Data</strong> <strong>Step</strong><br />
<str<strong>on</strong>g>and</str<strong>on</strong>g> an Introducti<strong>on</strong> to Simulati<strong>on</strong><br />
W. John Braun<br />
University of Western Ontario<br />
Department of Statistical <str<strong>on</strong>g>and</str<strong>on</strong>g> Actuarial Sciences
Chapter 1<br />
Introducti<strong>on</strong><br />
1.1 Introducti<strong>on</strong> to <strong>Data</strong> Analysis <str<strong>on</strong>g>and</str<strong>on</strong>g> Simulati<strong>on</strong><br />
Given a set of data, <strong>on</strong>e wishes to analyze it appropriately in order to make a decisi<strong>on</strong> or to<br />
acquire some new insights into <strong>the</strong> populati<strong>on</strong> from which <strong>the</strong> data was extracted.<br />
A data set is a collecti<strong>on</strong> of letters (characters) <str<strong>on</strong>g>and</str<strong>on</strong>g>/or numbers each representing informati<strong>on</strong><br />
in <strong>the</strong> form of measurements, counts or labels. The data sets we will c<strong>on</strong>sider in<br />
this course will usually be in case-by-variable format which is a rectangular array (or matrix)<br />
of data, where each row represents a set of measurements taken <strong>on</strong> a single subject or<br />
case. Each column of <strong>the</strong> data set refers to a specific variable, such as age, gender or annual<br />
income.<br />
There are many different types of analysis that are possible. A few of <strong>the</strong>m should be<br />
familiar from an earlier course, such as simple regressi<strong>on</strong> analysis or ANOVA. O<strong>the</strong>r kinds<br />
of analyses will be introduced in this course. In all cases, <strong>the</strong> analysis of a data set involves<br />
<strong>on</strong>e or more of <strong>the</strong> following:<br />
• checking for errors, missing values, etc. (data cleaning)<br />
• graphical displays<br />
• estimati<strong>on</strong><br />
• predicti<strong>on</strong><br />
• c<strong>on</strong>trol<br />
• measuring uncertainty<br />
• statistical testing<br />
• interpreting results<br />
In order to be able to analyze a data set satisfactorily, a computer package is usually<br />
necessary. Several are available, such as SPSS, Minitab, S-Plus <str<strong>on</strong>g>and</str<strong>on</strong>g> R. This course will focus<br />
mainly <strong>on</strong> <strong>the</strong> use of <strong>SAS</strong>, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> goal of this set of notes <strong>on</strong> <strong>the</strong> <strong>SAS</strong> <strong>Data</strong> <strong>Step</strong> is to teach<br />
you how to use <strong>SAS</strong> to simulate different kinds of data. Simulated data are generated by<br />
<strong>the</strong> computer according to a pre-specified probability model, such as a normal distributi<strong>on</strong><br />
or a t-distributi<strong>on</strong>, or perhaps, something much more complicated. The way in which <strong>the</strong><br />
simulated data are generated is designed to make <strong>the</strong> data appear to be r<str<strong>on</strong>g>and</str<strong>on</strong>g>om, though in<br />
fact, <strong>the</strong>y are not truly r<str<strong>on</strong>g>and</str<strong>on</strong>g>om.<br />
1
CHAPTER 1. INTRODUCTION 2<br />
There are at least 2 reas<strong>on</strong>s for learning how simulate data: first, it gives you a way<br />
of ’making up’ data for your own future exercises so that you can test out different <strong>SAS</strong><br />
analysis procedures, <str<strong>on</strong>g>and</str<strong>on</strong>g> you will be able to find out what kinds of data are appropriate<br />
for a given procedure; sec<strong>on</strong>d, knowing how to simulate a set of data is a step towards<br />
underst<str<strong>on</strong>g>and</str<strong>on</strong>g>ing what kind of structure underlies <strong>the</strong> data or <strong>the</strong> ma<strong>the</strong>matical model which is<br />
being studied as an approximati<strong>on</strong> to <strong>the</strong> real populati<strong>on</strong>. Thus, we will first be using <strong>SAS</strong><br />
to create artificial data of different types. Later <strong>on</strong>, we will learn how to use <strong>SAS</strong> procedures<br />
to analyze real data; <strong>the</strong> artificial data can <strong>the</strong>n be used for practice.<br />
1.2 Introducti<strong>on</strong> to <strong>SAS</strong><br />
You are about to be introduced to <strong>on</strong>e of <strong>the</strong> most comm<strong>on</strong>ly used statistical packages: <strong>SAS</strong><br />
(Statistical Analysis System). Many companies use <strong>SAS</strong>, especially in <strong>the</strong> pharmaceutical<br />
industry. Certain insurance companies <str<strong>on</strong>g>and</str<strong>on</strong>g> banks are also happy to have employees who can<br />
use <strong>SAS</strong> to analyze data.<br />
<strong>SAS</strong> is a software system for data analysis. <strong>SAS</strong> has been (<str<strong>on</strong>g>and</str<strong>on</strong>g> is c<strong>on</strong>tinuing to be)<br />
developed at <strong>the</strong> <strong>SAS</strong> Institute in Research Triangle Park at Cary, North Carolina. We will<br />
be using <strong>the</strong> <strong>SAS</strong> Versi<strong>on</strong> 9.3 in this course. It has been in development for over 30 years,<br />
<str<strong>on</strong>g>and</str<strong>on</strong>g> it now has capabilities to perform hundreds of kinds of data analyses. A number of<br />
extensi<strong>on</strong>s, such as IML, have also been developed which give <strong>SAS</strong> even more flexibility <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
power.<br />
In this course, we will <strong>on</strong>ly learn <strong>the</strong> basics. What you learn here will give you <strong>the</strong> ability<br />
to self-learn <strong>the</strong> rest of <strong>the</strong> system as needed.<br />
In <strong>the</strong>se notes, we will begin our introducti<strong>on</strong> to <strong>the</strong> <strong>SAS</strong> system by showing to get it<br />
started in <strong>the</strong> computing lab in WSC 256 <str<strong>on</strong>g>and</str<strong>on</strong>g> how to use <strong>the</strong> graphical user interface. Then,<br />
<strong>the</strong> <strong>Data</strong> <strong>Step</strong> will be c<strong>on</strong>sidered in some detail. Matters of input/output <str<strong>on</strong>g>and</str<strong>on</strong>g> flow c<strong>on</strong>trol<br />
will be discussed. The main applicati<strong>on</strong> will be to <strong>the</strong> generati<strong>on</strong> of r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
<strong>the</strong> creati<strong>on</strong> of artificial data. The very important issue of documentati<strong>on</strong> for <strong>SAS</strong> programs<br />
will be c<strong>on</strong>sidered briefly.<br />
1.3 Accessing <strong>SAS</strong> at Western<br />
We will begin by learning how to run <strong>SAS</strong> jobs in <strong>the</strong> Windows envir<strong>on</strong>ment. In practice,<br />
<strong>SAS</strong> is often run <strong>on</strong> Unix platforms in which case <strong>the</strong> procedures for running <strong>the</strong> <strong>SAS</strong> jobs<br />
differs from what will be described here, but <strong>the</strong> c<strong>on</strong>tent of <strong>the</strong> <strong>SAS</strong> programs is almost<br />
identical.<br />
To invoke <strong>SAS</strong> in <strong>the</strong> lab (Room 256 WSC), begin by logging into <strong>the</strong> network using your<br />
UWO id <str<strong>on</strong>g>and</str<strong>on</strong>g> password. Proceed through <strong>the</strong> following steps as illustrated in Figure 1.1:<br />
1. Click <strong>on</strong> <strong>the</strong> Windows ic<strong>on</strong> <str<strong>on</strong>g>and</str<strong>on</strong>g> choose “All Programs”.<br />
2. Scroll down to <strong>the</strong> “STATISTICS” folder <str<strong>on</strong>g>and</str<strong>on</strong>g> click <strong>on</strong> it.<br />
3. Click <strong>on</strong> <strong>the</strong> <strong>SAS</strong> folder <str<strong>on</strong>g>and</str<strong>on</strong>g> choose “<strong>SAS</strong> 9.3”.<br />
You will <strong>the</strong>n see <strong>the</strong> Program Editor window <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> Log Window. You should see<br />
something similar to what is shown in Figure 1.2.<br />
The Program Editor is ready for you to type in a <strong>SAS</strong> program or to open an existing<br />
program (using <strong>the</strong> File Menu).
CHAPTER 1. INTRODUCTION 3<br />
Figure 1.1: Locating <strong>the</strong> <strong>SAS</strong> program <strong>on</strong> <strong>the</strong> Lab’s Windows system.<br />
Figure 1.2: What should appear <strong>on</strong> <strong>the</strong> computer screen after invoking <strong>SAS</strong> 9.3.<br />
1.4 Main Comp<strong>on</strong>ents of a <strong>SAS</strong> program<br />
1. DATA step - for reading <str<strong>on</strong>g>and</str<strong>on</strong>g> manipulating data. Sometimes programming is d<strong>on</strong>e in<br />
this step.<br />
2. PROC step - for analyzing data. A <strong>SAS</strong> procedure is used to c<strong>on</strong>duct <strong>the</strong> analysis <strong>on</strong><br />
data that is c<strong>on</strong>tained in a <strong>SAS</strong> dataset prepared during <strong>the</strong> DATA step. Thus, <strong>the</strong><br />
PROC step usually follows a DATA step.
Chapter 2<br />
The <strong>Data</strong> <strong>Step</strong><br />
2.1 Some Definiti<strong>on</strong>s<br />
1. <strong>Data</strong> Value - a single measurement. e.g. <strong>the</strong> height of a pers<strong>on</strong> (Joe).<br />
2. Observati<strong>on</strong> - a set of data values for <strong>the</strong> same individual. e.g. name, height, weight,<br />
age <str<strong>on</strong>g>and</str<strong>on</strong>g> sex of Joe.<br />
3. Variable - a set of data values for <strong>the</strong> same measurement. e.g. <strong>the</strong> heights of 10 different<br />
people.<br />
4. <strong>Data</strong> set - a collecti<strong>on</strong> of observati<strong>on</strong>s. We usually think of <strong>the</strong> observati<strong>on</strong>s as being<br />
<strong>the</strong> rows of <strong>the</strong> data set, while <strong>the</strong> variables make up <strong>the</strong> columns of <strong>the</strong> data set.<br />
2.1.1 Example<br />
C<strong>on</strong>sider <strong>the</strong> following data set which c<strong>on</strong>sists of 4 observati<strong>on</strong>s <strong>on</strong> 5 different<br />
variables (NAME, HEIGHT, WEIGHT, AGE, SEX).<br />
NAME HEIGHT WEIGHT AGE SEX<br />
JOE 149 54 13 M<br />
MARY 151 60 28 F<br />
SUE 154 45 21 F<br />
TOM 174 72 26 M<br />
Here, we have 3 numeric variables (HEIGHT, WEIGHT, AGE) <str<strong>on</strong>g>and</str<strong>on</strong>g> 2 character variables<br />
(NAME, SEX).<br />
2.1.2 Exercise<br />
C<strong>on</strong>sider <strong>the</strong> following data set:<br />
TEMPERATURE PRESSURE MINIMUM WIND SPEED MAXIMUM WIND SPEED<br />
32 101.5 21 42<br />
31 101.3 15 28<br />
30 101.8 7 35<br />
24 101.2 12 23<br />
21 100.8 4 22<br />
22 100.9 18 27<br />
1. How many variables are <strong>the</strong>re?<br />
4
CHAPTER 2. THE DATA STEP 5<br />
2. How many observati<strong>on</strong>s <strong>on</strong> each variable?<br />
The <strong>Data</strong> <strong>Step</strong> is <strong>the</strong> point in <strong>the</strong> <strong>SAS</strong> program at which <strong>on</strong>e or more <strong>SAS</strong> data sets are<br />
created. These data sets may be read in from external files or created from within <strong>the</strong> <strong>SAS</strong><br />
program itself. It should be noted that a single <strong>SAS</strong> program can c<strong>on</strong>sist of more than <strong>on</strong>e<br />
<strong>Data</strong> <strong>Step</strong>, though we shall find a single <strong>Data</strong> <strong>Step</strong> sufficient for present purposes.<br />
The <strong>Data</strong> <strong>Step</strong> c<strong>on</strong>sists of a sequence of statements, each ending with a semi-col<strong>on</strong>. These<br />
statements are primarily c<strong>on</strong>cerned with <strong>the</strong> c<strong>on</strong>structi<strong>on</strong> of data sets <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> management<br />
of data.<br />
2.2 <strong>Data</strong><br />
The first line of <strong>the</strong> <strong>Data</strong> <strong>Step</strong> c<strong>on</strong>sists of <strong>the</strong> <strong>Data</strong> statement. This statement indicates that<br />
a data step is starting, <str<strong>on</strong>g>and</str<strong>on</strong>g> it tells <strong>SAS</strong> <strong>the</strong> name of <strong>the</strong> <strong>SAS</strong> data set which is being created.<br />
Syntax:<br />
DATA setname;<br />
The data set name is a word which is somehow descriptive of <strong>the</strong> data set with which it<br />
is associated. It must c<strong>on</strong>sist of at most 32 letters <str<strong>on</strong>g>and</str<strong>on</strong>g>/or numbers. The first character must<br />
be a letter.<br />
2.2.1 Examples<br />
The following statement tells <strong>SAS</strong> that a <strong>SAS</strong> data set called WEATHER is going to<br />
be created.<br />
DATA WEATHER;<br />
The following statement tells <strong>SAS</strong> that a <strong>SAS</strong> data set called GRADES98 is going to<br />
be created.<br />
DATA GRADES98;<br />
Some programming applicati<strong>on</strong>s do not involve a data set. The following statement<br />
tells <strong>SAS</strong> to begin a data step without creating a data set.<br />
DATA _NULL_;<br />
This type of data statement frees up memory that would possibly be used unnecessarily.<br />
We will use it when doing simulati<strong>on</strong>s.<br />
2.3 Numeric Assignment<br />
The Assignment statement is used for creating new variables <str<strong>on</strong>g>and</str<strong>on</strong>g> modifying existing variables.<br />
Syntax:<br />
varname = value;<br />
Naming Variables in <strong>SAS</strong>: A variable name must begin with a letter <str<strong>on</strong>g>and</str<strong>on</strong>g> may be 1 to 8<br />
characters l<strong>on</strong>g. e.g. NAME HEIGHT WEIGHT AGE SEX. e.g. If we have two samples of heights,<br />
we could label <strong>the</strong> 2 height variables HEIGHT1 <str<strong>on</strong>g>and</str<strong>on</strong>g> HEIGHT2. 1HEIGHT <str<strong>on</strong>g>and</str<strong>on</strong>g> 2HEIGHT are not<br />
valid variable names.
CHAPTER 2. THE DATA STEP 6<br />
2.3.1 Example<br />
TEMP = -21.7;<br />
The above statement assigns <strong>the</strong> value -21.7 to <strong>the</strong> variable TEMP.<br />
2.3.2 Example<br />
We can create a <strong>SAS</strong> data set called WEATHER c<strong>on</strong>sisting of <strong>on</strong>e observati<strong>on</strong> <strong>on</strong> each<br />
of 4 variables using <strong>the</strong> following sequence of assignment statements. Figure 2.1<br />
shows what this should look like <strong>on</strong> your computer screen.<br />
DATA WEATHER;<br />
DATE = 22;<br />
PRESSURE= 100.55;<br />
WIND = 19;<br />
TEMP = -21.7;<br />
RUN;<br />
QUIT;<br />
When <strong>the</strong> program has run (as shown, for example, by pressing <strong>the</strong> ‘Runner’<br />
butt<strong>on</strong>, in Figure 2.2), <strong>the</strong> resulting <strong>SAS</strong> data set is as follows:<br />
WEATHER<br />
DATE PRESSURE WIND TEMP<br />
22 100.55 19 -21.7<br />
Note that <strong>the</strong> data set is not actually visible in <strong>the</strong> output. In fact, no output is actually<br />
available; clicking <strong>on</strong> <strong>the</strong> ‘Output’ butt<strong>on</strong> at <strong>the</strong> bottom of <strong>the</strong> screen opens <strong>the</strong> ‘Output’<br />
window, but nothing appears <strong>the</strong>re, as indicated in <strong>the</strong> bottom panel of Figure 2.2.<br />
Figure 2.1: Entering comm<str<strong>on</strong>g>and</str<strong>on</strong>g>s into <strong>the</strong> Editor window to assign data values to a number of variables.
CHAPTER 2. THE DATA STEP 7<br />
Figure 2.2: To execute lines of <strong>SAS</strong> code, press <strong>the</strong> ‘Runner’ butt<strong>on</strong> as shown in <strong>the</strong> top panel. What appears<br />
<strong>on</strong> <strong>the</strong> screen after <strong>the</strong> lines of <strong>SAS</strong> code have been successfully executed: a record of what was d<strong>on</strong>e in <strong>the</strong><br />
log window. In this case, no errors were reported.<br />
The problem is that we have simply created a ‘<strong>SAS</strong> dataset’ which is held internally by<br />
<strong>the</strong> program. In order to see it, we would need to explicitly ask for it somehow. Later, we<br />
will see how to do this.<br />
A simpler way to read in data involves he <strong>the</strong> Input <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>Data</strong>lines statements. The<br />
following lines of code, up<strong>on</strong> executi<strong>on</strong>, will produce <strong>the</strong> same <strong>SAS</strong> dataset as before.<br />
DATA WEATHER;<br />
INPUT DATE PRESSURE WIND TEMP;<br />
DATALINES;<br />
22 100.55 19 -21.7<br />
;
CHAPTER 2. THE DATA STEP 8<br />
RUN;<br />
QUIT;<br />
A major advantage of this approach is that it allows us to read in more than <strong>on</strong>e observati<strong>on</strong><br />
<strong>on</strong> <strong>the</strong> variables specified by <strong>the</strong> Input statement. This is accomplished by inserting<br />
additi<strong>on</strong>al lines of data, noting that <strong>the</strong> data will be read into <strong>the</strong> resulting <strong>SAS</strong> dataset case<br />
by case, where each case c<strong>on</strong>sists of observati<strong>on</strong>s <strong>on</strong> each of <strong>the</strong> Input variables.<br />
2.3.3 Example<br />
Create a <strong>SAS</strong> data set called GRADES98 c<strong>on</strong>taining <strong>the</strong> following data:<br />
ID EXAM FINAL<br />
3237332 58 61<br />
4136229 71 68<br />
2838823 43 49<br />
2881266 62 58<br />
The following lines of code will give <strong>the</strong> required <strong>SAS</strong> dataset.<br />
DATA GRADES98;<br />
INPUT ID EXAM FINAL;<br />
DATALINES;<br />
3237332 58 61<br />
4136229 71 68<br />
2838823 43 49<br />
2881266 62 58<br />
;<br />
RUN;<br />
QUIT;<br />
2.4 INFILE <str<strong>on</strong>g>and</str<strong>on</strong>g> INPUT: Importing <strong>Data</strong> from an External File<br />
Often, a data set has been entered into a text file, for example, from a spreadsheet or data<br />
editor, or perhaps from ano<strong>the</strong>r <strong>SAS</strong> program. The INFILE statement is used in <strong>the</strong> <strong>Data</strong><br />
<strong>Step</strong> to tell <strong>SAS</strong> where to find <strong>the</strong> data. Then, <strong>the</strong> INPUT statement specifies how to assign<br />
<strong>the</strong> data values to specific variables in <strong>the</strong> newly created <strong>SAS</strong> dataset.<br />
Syntax:<br />
INFILE ’filename’;<br />
INPUT var1 var2 ... varn;<br />
2.4.1 Example<br />
Suppose <strong>the</strong> data set of <strong>the</strong> exercise in <strong>the</strong> previous secti<strong>on</strong> had been previously<br />
entered into a file called wea<strong>the</strong>r.dat. We can produce a <strong>SAS</strong> data set called<br />
WEATHER by executing <strong>the</strong> following program.
CHAPTER 2. THE DATA STEP 9<br />
/* Example of reading data */<br />
DATA WEATHER;<br />
INFILE ’WEATHER.DAT’;<br />
INPUT TEMP PRESSURE MINWIND MAXWIND;<br />
PROC PRINT NOOBS; /* This statement is NOT necessary, but it<br />
allows <strong>on</strong>e to see <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> <strong>SAS</strong><br />
data set in <strong>the</strong> Output window. */<br />
RUN; /* This statement IS necessary. The program<br />
will not run o<strong>the</strong>rwise. */<br />
QUIT;<br />
The PROC PRINT statement invokes <strong>the</strong> ‘Print Procedure’ which prints <strong>the</strong> <strong>SAS</strong> dataset<br />
to <strong>the</strong> Output window. In this case, it c<strong>on</strong>sists of a single case <strong>on</strong> <strong>the</strong> four given variables.<br />
It is pictured in Figure 2.3.<br />
Figure 2.3: Output from <strong>the</strong> <strong>SAS</strong> Print Procedure. In this case, <strong>the</strong> single case of <strong>the</strong> <strong>SAS</strong> dataset WEATHER<br />
has been printed to <strong>the</strong> Output window.<br />
2.5 Comments <str<strong>on</strong>g>and</str<strong>on</strong>g> Documentati<strong>on</strong><br />
It is often important to add documentati<strong>on</strong> to any computer programs which you create.<br />
Comment statements should be used to describe program c<strong>on</strong>tents. Proper documentati<strong>on</strong><br />
allows you or o<strong>the</strong>r users to read <str<strong>on</strong>g>and</str<strong>on</strong>g> underst<str<strong>on</strong>g>and</str<strong>on</strong>g> your program more easily. This is<br />
particularly useful if <strong>the</strong> program is to be updated later.<br />
In <strong>SAS</strong>, <strong>the</strong>re are two forms of comment statements:<br />
1. /* comment */<br />
e.g.<br />
/* The variable RADIUS measures <strong>the</strong><br />
cross-secti<strong>on</strong>al radius of each tree at a distance of 1 meter from<br />
<strong>the</strong> ground. */<br />
2. * comment;<br />
e.g.
CHAPTER 2. THE DATA STEP 10<br />
* The variable RADIUS measures <strong>the</strong> cross-secti<strong>on</strong>al<br />
radius of each tree at a distance of 1 meter from <strong>the</strong><br />
ground.;<br />
A useful form of documentati<strong>on</strong> includes a statement at <strong>the</strong> beginning of <strong>the</strong> program<br />
c<strong>on</strong>sisting of <strong>the</strong> title of <strong>the</strong> program, <strong>the</strong> name of <strong>the</strong> programmer, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> date (dates<br />
of later revisi<strong>on</strong>s are important as well). Sometimes variables are defined here. A brief<br />
descripti<strong>on</strong> of <strong>the</strong> purpose of <strong>the</strong> program is useful as well. In <strong>the</strong> body of <strong>the</strong> program, it is<br />
often useful to explain any special comm<str<strong>on</strong>g>and</str<strong>on</strong>g>s used <strong>the</strong>re.<br />
2.5.1 Example<br />
The following lines would make up a <strong>SAS</strong> file:<br />
/* Descriptive Analysis of a Sample of Four Individuals<br />
By P. Brooks<br />
January 15, 2007<br />
This program computes <strong>the</strong> mean <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> for <strong>the</strong><br />
height, weight <str<strong>on</strong>g>and</str<strong>on</strong>g> age of a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om sample of people.<br />
Variables: HEIGHT = height in centimeters.<br />
WEIGHT = weight in kilograms.<br />
AGE = age in years. */<br />
DATA SIZES; INFILE ’sizes.dat’;<br />
INPUT HEIGHT AGE WEIGHT;<br />
PROC MEANS MEAN STD;<br />
* The extra arguments produce <strong>on</strong>ly <strong>the</strong> sample mean <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
sample st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> for each variable;<br />
2.6 File <str<strong>on</strong>g>and</str<strong>on</strong>g> Put<br />
• The FILE statement is used to specify an external output file.<br />
Syntax:<br />
FILE filename;<br />
• The PUT statement causes <strong>SAS</strong> to print to <strong>the</strong> external file named in an earlier FILE<br />
statement.<br />
Syntax:<br />
PUT varname1 varname2 ...;
CHAPTER 2. THE DATA STEP 11<br />
2.6.1 Example<br />
The following lines cause <strong>SAS</strong> to print <strong>the</strong> values 22, 100.55, 19, -21.7 to a file<br />
called wea<strong>the</strong>r.txt.<br />
DATA WEATHER;<br />
FILE ’wea<strong>the</strong>r.txt’;<br />
INPUT DATE PRESSURE WIND TEMP;<br />
PUT DATE PRESSURE WIND TEMP;<br />
DATALINES;<br />
22 100.55 19 -21.7<br />
;<br />
RUN;<br />
QUIT;<br />
Each occurrence of a Put statement causes <strong>the</strong> current value of <strong>the</strong> relevant variables to<br />
be output to <strong>the</strong> file named in <strong>the</strong> File statement.<br />
2.6.2 Example<br />
DATA _NULL_;<br />
FILE ’GRADES.08’;<br />
IF _N_=1 THEN PUT ’2008 GRADES’; /* _N_ counts <strong>the</strong> observati<strong>on</strong>s<br />
as <strong>the</strong>y are input to <strong>the</strong> dataset */<br />
LENGTH NAME $ 8; /* This Length statement ensures that <strong>the</strong><br />
variable NAME can c<strong>on</strong>tain values up to<br />
INPUT NAME $ GRADE;<br />
PUT NAME GRADE;<br />
DATALINES;<br />
JOE 57.5<br />
MARY 83<br />
JENNIFER 64.5<br />
;<br />
RUN;<br />
QUIT;<br />
8 characters in length. */<br />
/* The $ tells <strong>SAS</strong> that NAME is a character<br />
variable. */<br />
This produces a file called GRADES.08 c<strong>on</strong>taining <strong>the</strong> lines<br />
2008 GRADES<br />
JOE 57.5<br />
MARY 83<br />
JENNIFER 64.5<br />
Note that <strong>the</strong> use of DATA _NULL_ results in no <strong>SAS</strong> dataset being created.
CHAPTER 2. THE DATA STEP 12<br />
2.6.3 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Write out <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> file epa.dat produced by <strong>the</strong> following:<br />
DATA _NULL_;<br />
FILE ’epa.dat’;<br />
PUT ’SOME MILEAGE MEASUREMENTS’;<br />
LENGTH CAR $ 13;<br />
CAR = ’BUICK CENTURY’;<br />
DISTANCE = 540;<br />
FUEL = 40;<br />
PUT CAR DISTANCE FUEL;<br />
CAR = ’HONDA CRX’;<br />
DISTANCE = 720;<br />
FUEL = 30;<br />
PUT CAR DISTANCE FUEL;<br />
RUN;<br />
QUIT;<br />
2. Check your answer by executing <strong>the</strong> above lines <strong>on</strong> a computer.<br />
3. Was a <strong>SAS</strong> data set created? Check this by adding <strong>the</strong> line PROC PRINT NOOBS;<br />
(<strong>the</strong>n look in <strong>the</strong> Output window <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> Log file for more informati<strong>on</strong>.)<br />
4. Reorganize <strong>the</strong> program so that it uses <strong>the</strong> <strong>Data</strong>lines statement.<br />
2.7 Arithmetic<br />
<strong>SAS</strong> can be used as a calculator to perform simple arithmetic.<br />
1. Additi<strong>on</strong>:<br />
varname = varname1 + varname2;<br />
2. Subtracti<strong>on</strong>:<br />
varname = varname1 - varname2;<br />
3. Multiplicati<strong>on</strong>:<br />
varname = varname1 * varname2;<br />
4. Divisi<strong>on</strong>:<br />
varname = varname1 / varname2;<br />
5. Power (varname1 varname2 ):<br />
varname = varname1 ** varname2;<br />
6. Modular arithmetic:<br />
varname = MOD(varname1, varname2);<br />
this computes <strong>the</strong> remainder resulting from divisi<strong>on</strong> of varname1 by varname2 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
assigns this value to varname.
CHAPTER 2. THE DATA STEP 13<br />
2.7.1 Example<br />
DATA _NULL_;<br />
/* some examples of arithmetic calculati<strong>on</strong>s */<br />
FILE ’arith.out’;<br />
X = 15; Y = 6;<br />
SUM = X + Y;<br />
DIFF = X - Y; /* DIFF = DIFFERENCE */<br />
PRODUCT = X * Y;<br />
QUOTIENT = X/Y;<br />
POWER = X ** Y;<br />
REMAIND = MOD(X,Y); /* REMAIND = REMAINDER */<br />
PUT X Y SUM DIFF PRODUCT;<br />
PUT QUOTIENT POWER REMAIND;<br />
RUN;<br />
QUIT;<br />
Executi<strong>on</strong> of <strong>the</strong> above <strong>SAS</strong> program produces a file called arith.out which c<strong>on</strong>tains<br />
<strong>the</strong> following lines:<br />
15 6 21 9<br />
90 2.5 11390600 3<br />
2.7.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. What are <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> file c<strong>on</strong>vert.tmp produced by <strong>the</strong> following<br />
program?<br />
DATA _NULL_;<br />
FILE ’c<strong>on</strong>vert.tmp’;<br />
TEMPC = 20;<br />
TEMPF = TEMPC*1.8 + 32;<br />
PUT TEMPC ’ degrees Celsius = ’ TEMPF ’ degrees Fahrenheit.’<br />
RUN;<br />
QUIT;<br />
2. Suppose X = 45, Y = 32, <str<strong>on</strong>g>and</str<strong>on</strong>g> Z = 7. Find <strong>the</strong> value of <strong>the</strong> variable ANSWER<br />
in each of <strong>the</strong> following:<br />
(a) ANSWER = X - Y;<br />
(b) ANSWER = Z ** Z;<br />
(c) ANSWER = MOD(X,Y);<br />
(d) ANSWER = MOD(Y,Z);<br />
(e) ANSWER = MOD(X,Y)+ MOD(X,Z);<br />
3. Using <strong>the</strong> fact that 1 mile = 1.6 kilometers, write a complete <strong>SAS</strong> program<br />
which c<strong>on</strong>verts a distance of 26 miles into kilometer units, <str<strong>on</strong>g>and</str<strong>on</strong>g> which prints<br />
<strong>the</strong> following into a file called c<strong>on</strong>vert.dst:<br />
A distance of 26 miles<br />
is <strong>the</strong> same as a distance of 41.6<br />
kilometers.
CHAPTER 2. THE DATA STEP 14<br />
The Floor Functi<strong>on</strong><br />
Syntax:<br />
varname = FLOOR(varname1);<br />
This statement assigns <strong>the</strong> greatest integer less than varname1 to <strong>the</strong> variable varname.<br />
For example, <strong>the</strong> greatest integer less than 27.34 is 27, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> greatest integer less than<br />
-16.4 is -17.<br />
2.7.3 Example<br />
DATA _NULL_;<br />
X = 47.39;<br />
Y = FLOOR(X);<br />
The value of Y is 47.<br />
2.7.4 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Write out <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> file arith.dat produced by<br />
DATA _NULL_;<br />
FILE ’arith.dat’;<br />
X = -42.49;<br />
Y = FLOOR(X);<br />
PUT X Y;<br />
RUN;<br />
QUIT;<br />
2. Modify <strong>the</strong> above program to compute <strong>the</strong> greatest integer less than<br />
(a) 0.47.<br />
(b) -0.47.<br />
(c) W, where W = 32X, <str<strong>on</strong>g>and</str<strong>on</strong>g> X = 0.217.
Chapter 3<br />
If: C<strong>on</strong>trolling Flow of Operati<strong>on</strong>s<br />
The IF statement is very important in database management. It is used to c<strong>on</strong>trol <strong>the</strong> flow<br />
of operati<strong>on</strong>s which are applied to variables depending <strong>on</strong> <strong>the</strong> values of relevant variables.<br />
In o<strong>the</strong>r words, if a certain variable takes <strong>on</strong> a certain value, a certain operati<strong>on</strong> might be<br />
performed; o<strong>the</strong>rwise, <strong>the</strong> operati<strong>on</strong> is not performed or a different operati<strong>on</strong> is performed<br />
in its place.<br />
Syntax:<br />
IF (c<strong>on</strong>diti<strong>on</strong>) THEN (<strong>SAS</strong> statement);<br />
ELSE (<strong>SAS</strong> statement);<br />
<strong>SAS</strong> evaluates <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> to determine whe<strong>the</strong>r it is true or false. If <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> is true,<br />
<strong>SAS</strong> proceeds to carry out <strong>the</strong> <strong>SAS</strong> statement. The ELSE statement is opti<strong>on</strong>al. It provides<br />
an alternative acti<strong>on</strong> if <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> is false.<br />
Possible c<strong>on</strong>diti<strong>on</strong>s to test are<br />
varname GE c<strong>on</strong>stant, varname LE c<strong>on</strong>stant<br />
varname < c<strong>on</strong>stant, varname > c<strong>on</strong>stant<br />
varname = c<strong>on</strong>stant, varname NE c<strong>on</strong>stant<br />
Testing <strong>the</strong> first c<strong>on</strong>diti<strong>on</strong> above amounts to testing whe<strong>the</strong>r <strong>the</strong> variable with name<br />
varname is greater than or equal to <strong>the</strong> specified c<strong>on</strong>stant (ano<strong>the</strong>r variable name could be<br />
used here as well). The sec<strong>on</strong>d c<strong>on</strong>diti<strong>on</strong> listed c<strong>on</strong>cerns less than or equal, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> last<br />
c<strong>on</strong>diti<strong>on</strong> involves testing for inequality.<br />
3.0.5 Example – Coding<br />
The variable SEX can take values ’M’ <str<strong>on</strong>g>and</str<strong>on</strong>g> ’F’. It is sometimes more c<strong>on</strong>venient<br />
to code this variable numerically using 1 for males <str<strong>on</strong>g>and</str<strong>on</strong>g> 0 for females. The IF<br />
statement can be used to do this as follows:<br />
IF SEX = ’M’ THEN SEXCODE = 1;<br />
ELSE SEXCODE = 0;<br />
In o<strong>the</strong>r words, if <strong>the</strong> variable SEX takes <strong>the</strong> value ’M’, <strong>the</strong>n <strong>the</strong> new variable<br />
SEXCODE takes <strong>the</strong> value 1. O<strong>the</strong>rwise, SEXCODE takes <strong>the</strong> value 0.<br />
15
CHAPTER 3. IF: CONTROLLING FLOW OF OPERATIONS 16<br />
3.0.6 Example – Outlier Detecti<strong>on</strong><br />
Suppose X is a variable whose mean is MU <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> is SIGMA. We may<br />
decide that <strong>the</strong> value of X is to be c<strong>on</strong>sidered outlying if it is more than 3 st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard<br />
deviati<strong>on</strong>s from MU. The following <strong>SAS</strong> lines determine if <strong>the</strong> value of X is outlying.<br />
The variable OUTLIER is assigned <strong>the</strong> value 1 if X is an outlier, <str<strong>on</strong>g>and</str<strong>on</strong>g> it is assigned<br />
<strong>the</strong> value 0 if X is not an outlier.<br />
OUTLIER = 0;<br />
Z = (X - MU)/SIGMA;<br />
IF Z > 3 THEN OUTLIER = 1;<br />
ELSE IF Z < -3 THEN OUTLIER = 1;<br />
3.0.7 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Execute <strong>the</strong> following program <str<strong>on</strong>g>and</str<strong>on</strong>g> view <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> file demog.dat.<br />
DATA DEMOGRAP;<br />
FILE ’demog.dat’;<br />
INPUT SEX $;<br />
IF SEX = ’M’ THEN SEXCODE = 1;<br />
ELSE SEXCODE = 0;<br />
PUT SEXCODE;<br />
DATALINES;<br />
M<br />
F<br />
M<br />
M<br />
F<br />
;<br />
RUN;<br />
QUIT;<br />
2. The following data has been recorded over a period of 5 hours at a switch:<br />
0,1,1,1,0. The switch is off when <strong>the</strong> value of <strong>the</strong> above variable (called<br />
testcode) is 0, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>on</strong> when <strong>the</strong> value is 1.<br />
Write a <strong>SAS</strong> program which assigns <strong>the</strong> value ’<strong>on</strong>’ to <strong>the</strong> variable test when<br />
<strong>the</strong> testcode value is 1 <str<strong>on</strong>g>and</str<strong>on</strong>g> ’off’ when testcode is ’0’.<br />
3. A r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable X has mean 14 <str<strong>on</strong>g>and</str<strong>on</strong>g> variance 49. Write a <strong>SAS</strong> program<br />
which determines which of <strong>the</strong> following values of X are outliers: 15, 23, -8,<br />
31, 17. The results should be output to a file called ’outliers.ex’.
Chapter 4<br />
DOing things repeatedly<br />
The DO statement is often useful for simulati<strong>on</strong>. It is also sometimes useful in o<strong>the</strong>r kinds<br />
of data preparati<strong>on</strong> <str<strong>on</strong>g>and</str<strong>on</strong>g> analysis.<br />
4.1 Simple DO<br />
The simple DO statement (which is usually used in associati<strong>on</strong> with an IF statement) tells<br />
<strong>SAS</strong> to execute a set of <strong>SAS</strong> statements. This set of statements is usually referred to as a<br />
DO group.<br />
Syntax:<br />
DO;<br />
<strong>SAS</strong> statements<br />
END;<br />
4.1.1 Example<br />
DATA _NULL_;<br />
FILE ’do.eg’;<br />
INPUT X Y;<br />
IF X > Y THEN DO;<br />
Z1 = X+Y;<br />
Z2 = X-Y;<br />
END;<br />
ELSE DO;<br />
Z1 = X-Y;<br />
Z2 = X+Y;<br />
END;<br />
PUT X Y Z1 Z2;<br />
DATALINES;<br />
3 4<br />
5 4<br />
;<br />
RUN;<br />
17
CHAPTER 4. DOING THINGS REPEATEDLY 18<br />
QUIT;<br />
Executing <strong>the</strong> above program results in a file called ’do.eg’ which c<strong>on</strong>tains <strong>the</strong><br />
following:<br />
3 4 -1 7<br />
5 4 9 -1<br />
4.2 Iterative DO<br />
The iterative DO statement tells <strong>SAS</strong> to perform a computati<strong>on</strong> several times.<br />
Syntax:<br />
DO varname = c<strong>on</strong>stant1 TO c<strong>on</strong>stant2 BY c<strong>on</strong>stant3;<br />
END;<br />
<strong>SAS</strong> statements<br />
4.2.1 Example<br />
Suppose we wish to add up all <strong>the</strong> numbers from 1 to 100. The following <strong>SAS</strong><br />
program does this for us:<br />
DATA _NULL_;<br />
NUMSUM = 0;<br />
DO INDEX = 1 TO 100;<br />
NUMSUM = NUMSUM + INDEX;<br />
END;<br />
FILE ’sum.100’;<br />
PUT NUMSUM;<br />
RUN;<br />
QUIT;<br />
/* NUMSUM is <strong>the</strong> variable which will<br />
ultimately c<strong>on</strong>tain <strong>the</strong> sum we are<br />
interested in.*/<br />
/* At each iterati<strong>on</strong> of <strong>the</strong> DO group,<br />
<strong>the</strong> current value of INDEX is added to<br />
<strong>the</strong> current value of NUMSUM. */<br />
The file sum.100 will <strong>the</strong>n c<strong>on</strong>tain <strong>the</strong> value 5050, which is <strong>the</strong> sum of <strong>the</strong> first 100<br />
integers.<br />
4.2.2 Example<br />
Suppose we wish to add up all <strong>the</strong> even numbers between 1 <str<strong>on</strong>g>and</str<strong>on</strong>g> 101. The following<br />
<strong>SAS</strong> program does this for us:<br />
DATA _NULL_;<br />
NUMSUM = 0;
CHAPTER 4. DOING THINGS REPEATEDLY 19<br />
DO INDEX = 2 TO 100 BY 2;<br />
NUMSUM = NUMSUM + INDEX;<br />
END;<br />
FILE ’even.sum’;<br />
PUT NUMSUM;<br />
RUN;<br />
QUIT;<br />
The file even.sum will <strong>the</strong>n c<strong>on</strong>tain <strong>the</strong> value 2550, which is <strong>the</strong> sum of <strong>the</strong> first<br />
50 even numbers.<br />
4.2.3 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Write a <strong>SAS</strong> program which calculates <strong>the</strong> sum of all multiples of 3 between 1<br />
<str<strong>on</strong>g>and</str<strong>on</strong>g> 121. Ans. 2460<br />
2. Modify <strong>the</strong> above program so that it calculates <strong>the</strong> sum of all integers from 51<br />
through 100. Ans. 3775<br />
3. Modify <strong>the</strong> above program so that it calculates <strong>the</strong> sum of all squares from 1<br />
to 100.<br />
4. Modify <strong>the</strong> above program so that it calculates <strong>the</strong> sum of square roots of even<br />
numbers between 1 <str<strong>on</strong>g>and</str<strong>on</strong>g> 101.<br />
5. Modify <strong>the</strong> above program so that it calculates 20! (<strong>the</strong> product of all integers<br />
between 1 <str<strong>on</strong>g>and</str<strong>on</strong>g> 20).<br />
4.3 DO While (opti<strong>on</strong>al)<br />
In order to use <strong>the</strong> iterative DO, <strong>on</strong>e needs to know <strong>the</strong> number of times <strong>the</strong> computati<strong>on</strong> is<br />
to be performed. Often, this number is not known beforeh<str<strong>on</strong>g>and</str<strong>on</strong>g>. Instead, <strong>on</strong>e might require<br />
that <strong>the</strong> computati<strong>on</strong> is performed while a particular c<strong>on</strong>diti<strong>on</strong> is satisfied.<br />
Syntax:<br />
DO WHILE (c<strong>on</strong>diti<strong>on</strong>);<br />
END;<br />
<strong>SAS</strong> statements<br />
The <strong>SAS</strong> statements in <strong>the</strong> DO group are executed as l<strong>on</strong>g as <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> is found to<br />
be true. The c<strong>on</strong>diti<strong>on</strong> is tested <strong>on</strong>ce before <strong>the</strong> beginning of each loop. The first time that<br />
<strong>the</strong> c<strong>on</strong>diti<strong>on</strong> is found to be false, <strong>the</strong> DO group statements are no l<strong>on</strong>ger executed <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>SAS</strong><br />
moves <strong>on</strong> bey<strong>on</strong>d <strong>the</strong> END; statement.<br />
4.3.1 Example<br />
Suppose we want to determine <strong>the</strong> largest value of n so that<br />
n∑<br />
i 2 < 10000.<br />
i=1
CHAPTER 4. DOING THINGS REPEATEDLY 20<br />
One approach to this problem is to successively add terms to <strong>the</strong> sum, while <strong>the</strong><br />
sum is less than 10000, <str<strong>on</strong>g>and</str<strong>on</strong>g> to stop accumulating as so<strong>on</strong> as <strong>the</strong> sum exceeds this<br />
amount. The following statements accomplish this:<br />
DATA _NULL_;<br />
NUMSUM = 0;<br />
INDEX=0;<br />
DO WHILE (NUMSUM < 10000);<br />
INDEX=INDEX+1;<br />
NUMSUM = NUMSUM + INDEX**2;<br />
END;<br />
INDEX=INDEX-1;<br />
FILE ’sum.out’;<br />
PUT INDEX;<br />
RUN;<br />
QUIT;<br />
The final value of INDEX is <strong>the</strong> soluti<strong>on</strong> n. This single number should be c<strong>on</strong>tained<br />
in <strong>the</strong> file ‘sum.out’ after executing <strong>the</strong> above lines of code.<br />
4.3.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Write a <strong>SAS</strong> program which finds <strong>the</strong> largest n satisfying<br />
n∑<br />
i 3 < 20000.<br />
i=1<br />
2. Write a <strong>SAS</strong> program which finds <strong>the</strong> largest n satisfying n! < 100000.<br />
3. Write a <strong>SAS</strong> program which finds <strong>the</strong> smallest n satisfying n! > 100000.
Chapter 5<br />
Simulati<strong>on</strong><br />
5.1 Generati<strong>on</strong> of Pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />
We begin our discussi<strong>on</strong> of simulati<strong>on</strong> with a brief explorati<strong>on</strong> of <strong>the</strong> mechanics of pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />
number generati<strong>on</strong>. Pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers are useful in simulati<strong>on</strong> studies.<br />
We will briefly describe a comm<strong>on</strong> method for simulating independent uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />
variables <strong>on</strong> <strong>the</strong> interval [0,1]. A multiplicative c<strong>on</strong>gruential r<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generator produces<br />
a sequence of pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers, u 0 , u 1 , u 2 , . . . , which are approximately independent<br />
uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables <strong>on</strong> <strong>the</strong> interval [0,1]. We now describe how to c<strong>on</strong>struct<br />
such a generator.<br />
Let m be a large integer, <str<strong>on</strong>g>and</str<strong>on</strong>g> let b be ano<strong>the</strong>r integer which is smaller than m. b is often<br />
somewhere around <strong>the</strong> square root of m. To begin, an integer x 0 is chosen between 1 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
m. x 0 is called <strong>the</strong> seed. It is best chosen in some n<strong>on</strong>-systematic manner.<br />
Once <strong>the</strong> seed has been chosen, <strong>the</strong> generator proceeds as follows:<br />
x 1 = bx 0 (mod m)<br />
u 1 = x 1 /m.<br />
u 1 is <strong>the</strong> first pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om number. Dividing by m ensures that <strong>the</strong> number lies between<br />
0 <str<strong>on</strong>g>and</str<strong>on</strong>g> 1. Note that it takes some value between 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> 1. If m <str<strong>on</strong>g>and</str<strong>on</strong>g> b are chosen properly, it<br />
is difficult to predict <strong>the</strong> value of u 1 , given <strong>the</strong> value of x 0 <strong>on</strong>ly. The sec<strong>on</strong>d pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />
number is <strong>the</strong>n obtained in <strong>the</strong> same manner:<br />
x 2 = bx 1 (mod m)<br />
u 2 = x 2 /m.<br />
u 2 is ano<strong>the</strong>r pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om number, which is approximately independent of u 1 . The method<br />
c<strong>on</strong>tinues using <strong>the</strong> following formulas:<br />
x n = bx n−1 (mod m)<br />
u n = x n /m.<br />
This method produces numbers which are in reality n<strong>on</strong>-r<str<strong>on</strong>g>and</str<strong>on</strong>g>om, but if d<strong>on</strong>e properly,<br />
<strong>the</strong> numbers appear to be r<str<strong>on</strong>g>and</str<strong>on</strong>g>om (i.e. unpredictable).<br />
Different values of b <str<strong>on</strong>g>and</str<strong>on</strong>g> m give rise to pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generators of varying<br />
quality. If <strong>the</strong>y are not chosen with some care, <strong>the</strong>n <strong>the</strong> generator will produce numbers that<br />
do not appear to be r<str<strong>on</strong>g>and</str<strong>on</strong>g>om. A number of statistical tests have been developed for assessing<br />
<strong>the</strong> quality of a pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generator.<br />
21
CHAPTER 5. SIMULATION 22<br />
5.1.1 Example<br />
The following lines of <strong>SAS</strong> create a file called RANDOM.DAT which c<strong>on</strong>tains 5 pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />
numbers based <strong>on</strong> <strong>the</strong> multiplicative c<strong>on</strong>gruential generator:<br />
x n = 171x n−1 (mod 30269)<br />
with initial seed x 0 = 23121.<br />
u n = x n /30269<br />
/* Rudimentary Pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om Number Generator */<br />
DATA _NULL_;<br />
FILE ’RANDOM.DAT’;<br />
B = 171;<br />
M = 30269;<br />
SEED = 23121;<br />
X = SEED;<br />
DO I = 1 TO 5;<br />
X = MOD(B*X, M);<br />
U = X/M;<br />
PUT X U;<br />
END;<br />
RUN;<br />
QUIT;<br />
The results which are stored in <strong>the</strong> file RANDOM.DAT are as follows. The first column<br />
c<strong>on</strong>sists of <strong>the</strong> integers x 1 , x 2 , . . . , x 5 . The sec<strong>on</strong>d column c<strong>on</strong>sists of numbers ranging<br />
between 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> 1. These are <strong>the</strong> uniform pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers, u 1 , u 2 , . . . , u 5 .<br />
18721 0.61849<br />
23046 0.76137<br />
5896 0.19479<br />
9339 0.30853<br />
22981 0.75923<br />
A related operati<strong>on</strong> is used internally by <strong>SAS</strong> to produce pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers automatically<br />
with <strong>the</strong> functi<strong>on</strong> UNIFORM.<br />
5.1.2 Example<br />
The following lines of <strong>SAS</strong> create a file called RANDOM.DAT which c<strong>on</strong>tains 50 uniform<br />
pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers based <strong>on</strong> <strong>the</strong> <strong>SAS</strong> generator UNIFORM with initial seed<br />
x 0 = 27218.<br />
/* Example dem<strong>on</strong>strating use of <strong>SAS</strong> RNG with fixed seed. */<br />
DATA _NULL_;<br />
SEED = 27218;
CHAPTER 5. SIMULATION 23<br />
FILE ’RANDOM.DAT’;<br />
DO I = 1 TO 50;<br />
U = UNIFORM(SEED);<br />
PUT U;<br />
END;<br />
RUN;<br />
QUIT;<br />
It is often of interest to look at <strong>the</strong> distributi<strong>on</strong> of a set of pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers.<br />
For <strong>the</strong> numbers generated in <strong>the</strong> previous example, we would proceed as follows:<br />
DATA RANDOM;<br />
INFILE ’RANDOM.DAT’;<br />
INPUT U;<br />
PROC CHART;<br />
VBAR U;<br />
RUN;<br />
QUIT;<br />
The bars of <strong>the</strong> histogram should all be roughly <strong>the</strong> same height, if <strong>the</strong> numbers<br />
are really uniformly distributed.<br />
5.1.3 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Generate 200 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers using <strong>the</strong> generator from <strong>the</strong> first example with<br />
an initial seed of 2018.<br />
2. Write a program (or modify <strong>the</strong> sec<strong>on</strong>d program in <strong>the</strong> sec<strong>on</strong>d example) which<br />
produces a histogram of <strong>the</strong> numbers produced in <strong>the</strong> previous exercise.<br />
3. Generate 200 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers using <strong>the</strong> <strong>SAS</strong> UNIFORM generator from example<br />
2 with an initial seed of 2018. Produce a histogram of this simulated data.<br />
4. Modify <strong>the</strong> generator of <strong>the</strong> first example so that it produces 200 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />
numbers from <strong>the</strong> generator<br />
with initial seed x 0 = 17218.<br />
x n = 172x n−1 (mod 30307)<br />
5. Generate 1000 pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers using <strong>the</strong> <strong>SAS</strong> functi<strong>on</strong> UNIFORM, <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
store <strong>the</strong>m in a file called UNIF.DAT.<br />
6. Modify <strong>the</strong> above program to simulate <strong>the</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable Y = 1/(U +<br />
1) where U is a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> <strong>the</strong> interval [0,1]. Specifically,<br />
generate 1000 values of this r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <str<strong>on</strong>g>and</str<strong>on</strong>g> put <strong>the</strong>m in a file called<br />
RANDOM.DAT.<br />
Also, plot <strong>the</strong> histogram of <strong>the</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers y 1 , . . . , y 1000 . Since Y is no<br />
l<strong>on</strong>ger a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable, <strong>the</strong> histogram will not be flat any l<strong>on</strong>ger;<br />
what is <strong>the</strong> shape of <strong>the</strong> distributi<strong>on</strong>?
CHAPTER 5. SIMULATION 24<br />
7. Write a program which generates 100 independent observati<strong>on</strong>s <strong>on</strong> a uniformly<br />
distributed r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> <strong>the</strong> interval [0, 100]. Estimate <strong>the</strong> mean, variance<br />
<str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> of such a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable.<br />
8. Use <strong>the</strong> FLOOR functi<strong>on</strong> toge<strong>the</strong>r with UNIFORM to simulate 100 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om integers<br />
between 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> 99.<br />
5.2 Simulati<strong>on</strong> of Bernoulli Trials<br />
A Bernoulli trial is an experiment in which <strong>the</strong>re are 2 possible outcomes. For example, a<br />
light bulb may work or it may not work; <strong>the</strong>se are <strong>the</strong> <strong>on</strong>ly possibilities. For ano<strong>the</strong>r example,<br />
c<strong>on</strong>sider a student who guesses <strong>on</strong> a multiple choice test questi<strong>on</strong> which has 5 opti<strong>on</strong>s; <strong>the</strong><br />
student may guess correctly with probability 0.2 <str<strong>on</strong>g>and</str<strong>on</strong>g> incorrectly with probability 0.8.<br />
Suppose we would like to know how well such a student would do <strong>on</strong> a multiple choice<br />
test c<strong>on</strong>sisting of 100 questi<strong>on</strong>s. We can get an idea by using simulati<strong>on</strong>:<br />
Each questi<strong>on</strong> corresp<strong>on</strong>ds to an independent Bernoulli trial with probability of success<br />
equal to 0.2. We can simulate <strong>the</strong> correctness of <strong>the</strong> student for each questi<strong>on</strong> by generating<br />
an independent uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om number. If this number is less than .2, we say that <strong>the</strong><br />
student guessed correctly; o<strong>the</strong>rwise, we say that <strong>the</strong> student guessed incorrectly.<br />
This will work because <strong>the</strong> probability that a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable is less than .2 is<br />
exactly .2, while <strong>the</strong> probability that a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable exceeds .2 is exactly .8,<br />
which is <strong>the</strong> same as <strong>the</strong> probability that <strong>the</strong> student guesses incorrectly. Thus, <strong>the</strong> uniform<br />
r<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generator is simulating <strong>the</strong> student. The <strong>SAS</strong> versi<strong>on</strong> of this is as follows:<br />
DATA _NULL_;<br />
SEED = 12883;<br />
FILE ‘STUDENT.ANS’;<br />
PUT ’CORRECT U’;<br />
DO QUESTION = 1 TO 100;<br />
U = UNIFORM(SEED);<br />
IF U < .2 THEN CORRECT = 1;<br />
ELSE CORRECT = 0;<br />
PUT CORRECT U;<br />
END;<br />
RUN;<br />
QUIT;<br />
The first column of <strong>the</strong> file STUDENT.ANS c<strong>on</strong>tains <strong>the</strong> results of <strong>the</strong> student’s guesses. A 1<br />
is recorded each time <strong>the</strong> student correctly guesses <strong>the</strong> answer, while a 0 is recorded each<br />
time <strong>the</strong> student is wr<strong>on</strong>g. The sec<strong>on</strong>d column records <strong>the</strong> value of <strong>the</strong> variable U; note<br />
that whenever its value is less than .2, <strong>the</strong> value of CORRECT is 1, <str<strong>on</strong>g>and</str<strong>on</strong>g> when U takes a value<br />
exceeding .2, <strong>the</strong> value of CORRECT is 0.<br />
5.2.1 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Write a <strong>SAS</strong> program which simulates a student guessing at a True-False test<br />
c<strong>on</strong>sisting of 40 questi<strong>on</strong>s.
CHAPTER 5. SIMULATION 25<br />
2. Write a <strong>SAS</strong> program which simulates 500 light bulbs, each of which has<br />
probability .99 of working.<br />
3. Write a <strong>SAS</strong> program which simulates a binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable Y with<br />
parameters n = 25 <str<strong>on</strong>g>and</str<strong>on</strong>g> p = .4. (Y is <strong>the</strong> sum of 25 independent Bernoulli<br />
r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables with p = .4.)<br />
• Now, modify <strong>the</strong> program so that it generates 100 of <strong>the</strong>se binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />
variables <str<strong>on</strong>g>and</str<strong>on</strong>g> writes <strong>the</strong>m to a file called binom.dat. In order to do this,<br />
you will need to nest <strong>on</strong>e DO group inside ano<strong>the</strong>r.<br />
• Write ano<strong>the</strong>r program which reads <strong>the</strong> data from binom.dat into a <strong>SAS</strong><br />
data set <str<strong>on</strong>g>and</str<strong>on</strong>g> produces a histogram. Estimate <strong>the</strong> mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance using<br />
PROC MEANS. Compare <strong>the</strong>se estimates with <strong>the</strong>ir <strong>the</strong>oretical counterparts.<br />
Recall that <strong>the</strong> <strong>the</strong>oretical mean of a binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable is np <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
<strong>the</strong> <strong>the</strong>oretical variance is np(1 − p).<br />
5.3 The Logistic Model<br />
In many biostatistical applicati<strong>on</strong>s, interest centers <strong>on</strong> a dose-resp<strong>on</strong>se relati<strong>on</strong>ship. For<br />
example, what dosage of a carcinogenic substance will produce cancer in a given percentage<br />
of a populati<strong>on</strong>? One would expect that higher dosages of carcinogen will yield higher rates<br />
of cancer. A first attempt at modelling this kind of relati<strong>on</strong>ship might be<br />
p = α 0 + α 1 x<br />
where p is <strong>the</strong> proporti<strong>on</strong> of <strong>the</strong> populati<strong>on</strong> that would acquire cancer at dosage x; α 0 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
α 1 are c<strong>on</strong>stants. This model is linear, <str<strong>on</strong>g>and</str<strong>on</strong>g> will almost have <strong>the</strong> correct behaviour if α 1 is<br />
positive. However, it will give values of p outside <strong>the</strong> interval [0, 1] if x is too large or too<br />
small.<br />
The logistic model is often used as an alternative to h<str<strong>on</strong>g>and</str<strong>on</strong>g>le this kind of problem. It<br />
is based <strong>on</strong> <strong>the</strong> logit transformati<strong>on</strong> which maps values in (0, 1) to (−∞, ∞). The logit<br />
transformati<strong>on</strong> is given by l(p) = log(p/(1 − p)). Its inverse is given by <strong>the</strong> logistic functi<strong>on</strong><br />
p(l) = exp(l)/(1 + exp(l)).<br />
We can <strong>the</strong>n model <strong>the</strong> dose-resp<strong>on</strong>se relati<strong>on</strong>ship with<br />
l(p) = β 0 + β 1 x<br />
where β 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 are c<strong>on</strong>stants. This model says that when <strong>the</strong> dosage is x, <strong>the</strong> proporti<strong>on</strong><br />
of <strong>the</strong> populati<strong>on</strong> acquiring cancer will be p, where<br />
Example<br />
p =<br />
eβ 0+β 1 x<br />
1 + e β 0+β 1 x .<br />
Write <strong>SAS</strong> code to simulate <strong>the</strong> resp<strong>on</strong>ses of 20 subjects who have been exposed to<br />
varying amounts of carcinogen under <strong>the</strong> logistic model assumpti<strong>on</strong> with β 0 = −1.5<br />
<str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 = 0.7. Assume that <strong>the</strong> dosages are given by x = 0.1, 0.2, . . . , 2.0. Output<br />
should be printed to a file called ‘doseresp<strong>on</strong>sesim.txt’.<br />
DATA _NULL_;
CHAPTER 5. SIMULATION 26<br />
SEED = 81818; B0 = -1.5; B1 = 0.7;<br />
FILE ‘doseresp<strong>on</strong>sesim.txt’;<br />
PUT ’Resp<strong>on</strong>se Dosage’;<br />
DO X = 0.1 TO 2.0 BY 0.1;<br />
U = UNIFORM(SEED);<br />
TMP = EXP(B0 + B1*X);<br />
P = TMP/(1+TMP);<br />
IF U < P THEN CANCER = 1;<br />
ELSE CANCER = 0;<br />
PUT CANCER X;<br />
END;<br />
RUN;<br />
QUIT;<br />
Up<strong>on</strong> running <strong>the</strong> code, it should be clear that as x increases, <strong>the</strong> incidence of<br />
cancer increases (i.e. <strong>the</strong> incidence of 1’s in <strong>the</strong> first column of simulated data<br />
increases).<br />
<str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Run <strong>the</strong> code for <strong>the</strong> logistic model given in <strong>the</strong> above example. Then change <strong>the</strong> slope<br />
parameter β 1 to −0.7. How does this affect <strong>the</strong> pattern in <strong>the</strong> resp<strong>on</strong>se?<br />
2. Modify <strong>the</strong> code given in <strong>the</strong> example so that dosages are given by 1.5, 1.7, 1.9, . . . , 3.5.<br />
3. Modify <strong>the</strong> example code so that <strong>the</strong> output enters a <strong>SAS</strong> dataset called ’DOSERESP’.<br />
Next, use <strong>the</strong> PLOT procedure to plot CANCER against X. Experiment with various<br />
values of β 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 in order to see how <strong>the</strong>se values affect <strong>the</strong> pattern of resp<strong>on</strong>se.<br />
5.4 Binomial R<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />
The RANBIN functi<strong>on</strong> can be used to automatically generate binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers.<br />
Syntax:<br />
Y = RANBIN(seed,n,p);<br />
The seed is any positive integer, while n <str<strong>on</strong>g>and</str<strong>on</strong>g> p are <strong>the</strong> binomial parameters. The functi<strong>on</strong><br />
assigns a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om binomial realizati<strong>on</strong> to <strong>the</strong> variable Y.<br />
5.4.1 Example<br />
Suppose 12% of a large populati<strong>on</strong> has recently been infected by a virus whose<br />
incubati<strong>on</strong> period is 2 weeks l<strong>on</strong>g, but whose presence can be detected by a blood<br />
test. Suppose r<str<strong>on</strong>g>and</str<strong>on</strong>g>om testing for <strong>the</strong> virus is c<strong>on</strong>ducted, <str<strong>on</strong>g>and</str<strong>on</strong>g> 15 individuals are<br />
tested each hour. Simulate <strong>the</strong> number of positive test results for each hour over<br />
a 24-hour period. Record <strong>the</strong> simulated numbers of positive test results in a file<br />
called viruscounts.txt.<br />
Since 15 individuals are tested each hour <str<strong>on</strong>g>and</str<strong>on</strong>g> each individual has a 0.12 probability<br />
of being infected, independent of <strong>the</strong> state of <strong>the</strong> o<strong>the</strong>r individuals, <strong>the</strong> number
CHAPTER 5. SIMULATION 27<br />
of positive test results in <strong>on</strong>e hour is a binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with n = 15<br />
<str<strong>on</strong>g>and</str<strong>on</strong>g> p = 0.12. To simulate <strong>the</strong> numbers of positive test results for each hour in a<br />
24-hour period, we need to generate 24 binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers:<br />
/* Simulati<strong>on</strong> of infected individuals */<br />
DATA _NULL_;<br />
SEED = 3728;<br />
N = 15;<br />
P = .12;<br />
FILE ’viruscounts.txt’;<br />
PUT ’HOUR NUMBER OF INFECTED’;<br />
DO HOUR = 1 TO 24;<br />
INFECTED = RANBIN(SEED,N,P);<br />
PUT HOUR INFECTED;<br />
END;<br />
RUN;<br />
QUIT;<br />
5.4.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Generate 1000 binomial variates with n = 18 <str<strong>on</strong>g>and</str<strong>on</strong>g> p = .75 using RANBIN. Then use<br />
PROC MEANS to estimate <strong>the</strong> average <str<strong>on</strong>g>and</str<strong>on</strong>g> variance. Compare with <strong>the</strong> <strong>the</strong>oretical mean<br />
<str<strong>on</strong>g>and</str<strong>on</strong>g> variance. Repeat for binomial variates with n = 50 <str<strong>on</strong>g>and</str<strong>on</strong>g> p = .4.<br />
2. Generate 50 binomial variates B 1 , B 2 , . . . , B 50 , having n = 20 <str<strong>on</strong>g>and</str<strong>on</strong>g> where p satisfies<br />
l(p) = −2.0 + 0.5x<br />
where x = 0.1, 0.2, 0.3, . . . , 5.0. Use <strong>the</strong> Plot procedure to plot B against x <str<strong>on</strong>g>and</str<strong>on</strong>g> note<br />
<strong>the</strong> pattern of plotted points.<br />
3. Refer to <strong>the</strong> previous questi<strong>on</strong>. Calculate <strong>the</strong> expected value of B i , for i = 1, 2, . . . , 50.<br />
Plot <strong>the</strong>se expected values against x.<br />
5.5 Poiss<strong>on</strong> R<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />
We can generate Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers using <strong>SAS</strong> with <strong>the</strong> RANPOI functi<strong>on</strong>. It is similar<br />
to <strong>the</strong> RANBIN functi<strong>on</strong>, but <strong>the</strong>re is <strong>on</strong>ly <strong>on</strong>e parameter instead of two.<br />
Syntax:<br />
Y = RANPOI(seed, lambda);<br />
In this case, lambda is <strong>the</strong> mean of <strong>the</strong> Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable.
CHAPTER 5. SIMULATION 28<br />
5.5.1 Example<br />
Suppose traffic accidents occur at an intersecti<strong>on</strong> with a mean of 3.7 per year.<br />
Simulate <strong>the</strong> annual number of accidents for a 10-year period, assuming that <strong>the</strong><br />
numbers occurring from year to year are independent.<br />
/* Example of Poiss<strong>on</strong> variate generati<strong>on</strong> -- Simulati<strong>on</strong> of Traffic<br />
Accidents */<br />
DATA _NULL_;<br />
SEED = 497765;<br />
LAMBDA = 3.7;<br />
FILE ’ACCIDENT.DAT’;<br />
PUT ’YEAR NUMBER OF ACCIDENTS’;<br />
DO YEAR = 1 TO 10;<br />
ACCIDENT = RANPOI(SEED, LAMBDA);<br />
PUT YEAR ACCIDENT;<br />
END;<br />
RUN;<br />
QUIT;<br />
5.5.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Modify <strong>the</strong> above program to simulate <strong>the</strong> number of accidents per year for<br />
15 years, when <strong>the</strong> average rate is 2.8 accidents per year.<br />
2. Simulate <strong>the</strong> number of surface defects in <strong>the</strong> finish of a sports car for 20 cars,<br />
where <strong>the</strong> mean is 1.2 defects per car.<br />
3. Estimate <strong>the</strong> mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance of a Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable whose mean<br />
rate is 7.2 by simulating 1000 such variates <str<strong>on</strong>g>and</str<strong>on</strong>g> using PROC MEANS. Compare<br />
with <strong>the</strong> <strong>the</strong>oretical values, recalling that <strong>the</strong> variance <str<strong>on</strong>g>and</str<strong>on</strong>g> mean are equal for<br />
Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables.<br />
4. A comm<strong>on</strong>ly used model is <strong>the</strong> Poiss<strong>on</strong> regressi<strong>on</strong> model<br />
log(λ) = β 0 + β 1 x<br />
where β 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 are c<strong>on</strong>stants. Take β 0 = −3 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 = 0.5, <str<strong>on</strong>g>and</str<strong>on</strong>g> suppose<br />
x = 0.1, 0.2, 0.3, . . . , 4.0. Calculate <strong>the</strong> corresp<strong>on</strong>ding values of λ. (Store <strong>the</strong>se<br />
values in a <strong>SAS</strong> variable called lambda.)<br />
5. Refer to <strong>the</strong> previous questi<strong>on</strong>. Simulate Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variates which have<br />
<strong>the</strong> λ values. Plot <strong>the</strong> Poiss<strong>on</strong> variates against <strong>the</strong> corresp<strong>on</strong>ding values of x.<br />
5.6 Exp<strong>on</strong>ential R<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />
The exp<strong>on</strong>ential distributi<strong>on</strong> can be used as a simple model for <strong>the</strong> time until a comp<strong>on</strong>ent<br />
fails, or until a light bulb burns out.<br />
A r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable T has an exp<strong>on</strong>ential distributi<strong>on</strong> with mean λ if
CHAPTER 5. SIMULATION 29<br />
P(T ≤ t) = 1 − e −t/λ<br />
for any n<strong>on</strong>-negative t. The mean or expected value of T is 1/λ <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> variance of T is<br />
1/λ 2 .<br />
The simplest way to simulate exp<strong>on</strong>ential r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables is to generate a uniform<br />
r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable U <strong>on</strong> [0,1], <str<strong>on</strong>g>and</str<strong>on</strong>g> set<br />
Solving this for T , we have<br />
1 − e −T/λ = U<br />
T = −λ log(1 − U).<br />
It can be shown that T defined in this way has an exp<strong>on</strong>ential distributi<strong>on</strong> with mean λ. The<br />
<strong>SAS</strong> functi<strong>on</strong> RANEXP can be used to generate r<str<strong>on</strong>g>and</str<strong>on</strong>g>om exp<strong>on</strong>ential variates with mean 1.<br />
Syntax:<br />
T = RANEXP(seed);<br />
This produces an exp<strong>on</strong>ential variate T having mean 1. To change <strong>the</strong> mean to lambda, we<br />
must use<br />
T = lambda * RANEXP(seed);<br />
5.6.1 Example<br />
/* SIMULATION OF N EXPONENTIAL LAMBDA RANDOM VARIATES */<br />
DATA _NULL_;<br />
SEED = 12238;<br />
LAMBDA = 2.5;<br />
N = 10;<br />
FILE ’EXPO.RVS’<br />
DO I = 1 TO N;<br />
T = RANEXP(SEED)*LAMBDA;<br />
PUT T;<br />
END;<br />
RUN;<br />
QUIT;<br />
5.6.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Suppose that a certain type of battery has a lifetime which is exp<strong>on</strong>entially<br />
distributed with mean 55 hours. Simulate 1000 such lifetimes to estimate <strong>the</strong><br />
mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance of <strong>the</strong> lifetime for this type of battery. Compare with <strong>the</strong><br />
<strong>the</strong>oretical mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance.<br />
2. The central limit <strong>the</strong>orem says that <strong>the</strong> sample mean for a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om sample<br />
of size n from a populati<strong>on</strong> with mean µ <str<strong>on</strong>g>and</str<strong>on</strong>g> variance σ 2 is approximately<br />
normally distributed with mean µ <str<strong>on</strong>g>and</str<strong>on</strong>g> variance σ 2 /n, where <strong>the</strong> approximati<strong>on</strong><br />
improves as n increases.
CHAPTER 5. SIMULATION 30<br />
The following programs provides a dem<strong>on</strong>strati<strong>on</strong> for <strong>the</strong> case where <strong>the</strong> underlying<br />
populati<strong>on</strong> is exp<strong>on</strong>entially distributed:<br />
/* PROGRAM 1: Computati<strong>on</strong> of averages of samples of size N coming<br />
from exp<strong>on</strong>ential lambda populati<strong>on</strong>s */<br />
DATA _NULL_;<br />
SEED = 12238;<br />
LAMBDA = 2.5;<br />
NSAMPLES = 1000;<br />
N = 10;<br />
FILE ’EXPO.AVG’<br />
DO NSAMPLE = 1 TO NSAMPLES;<br />
TSUM = 0;<br />
DO I = 1 TO N;<br />
T = RANEXP(SEED)*LAMBDA;<br />
TSUM = TSUM + T;<br />
END;<br />
RUN;<br />
QUIT;<br />
END;<br />
TAVG = TSUM/N;<br />
PUT TAVG;<br />
/* We are going to simulate NSAMPLES<br />
independent samples of size N, computing <strong>the</strong> average<br />
in each case. */<br />
/* Accumulating <strong>the</strong> sample<br />
values to form a sum */<br />
/* TAVG = average of <strong>the</strong> current<br />
sample. */<br />
/* Storing sample averages for<br />
use in next program where <strong>the</strong>y will be<br />
plotted as a histogram. */<br />
/* PROGRAM 2: Histogram of averages to dem<strong>on</strong>strate CLT */<br />
DATA EXPO_AVG;<br />
INFILE ’EXPO.AVG’;<br />
INPUT TAVG;<br />
PROC CHART;<br />
VBAR TAVG;<br />
PROC MEANS MEAN VAR;<br />
VAR TAVG;<br />
RUN;<br />
QUIT;<br />
/* We’ve included this procedure to compare<br />
<strong>the</strong> mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance of <strong>the</strong> averages with what is<br />
expected by <strong>the</strong> <strong>the</strong>ory */<br />
Run <strong>the</strong> above programs for N = 3, 6, 10, 20, 30, 40. Note how <strong>the</strong> histogram<br />
begins to resemble <strong>the</strong> familiar bell-shaped curve as N increases. How large<br />
would you say N should be in order for <strong>the</strong> normal approximati<strong>on</strong> to be c<strong>on</strong>sidered<br />
accurate, when <strong>the</strong> underlying populati<strong>on</strong> is exp<strong>on</strong>ential?
CHAPTER 5. SIMULATION 31<br />
5.7 Normal R<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />
St<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables can be generated using <strong>the</strong> RANNOR functi<strong>on</strong> in <strong>SAS</strong>.<br />
Syntax:<br />
Z = RANNOR(seed);<br />
This produces a value of a normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable Z which has mean 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> variance 1.<br />
Recall that if X has mean µ <str<strong>on</strong>g>and</str<strong>on</strong>g> variance σ 2 , <strong>the</strong>n<br />
X = µ + σZ<br />
where Z has mean 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> variance 1. Therefore, to simulate a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable X having<br />
mean mu <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> sigma, use<br />
X = mu + sigma*RANNOR(seed);<br />
5.7.1 Example<br />
Use simulati<strong>on</strong> to estimate P (Z < 1.25) where Z is a st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />
variable.<br />
Idea: Simulate a large number (say, 1000) of st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variates <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
compute <strong>the</strong> proporti<strong>on</strong> that lie below 1.25.<br />
DATA _NULL_;<br />
FILE ’NORMAL.PRB’;<br />
SEED = 19218;<br />
N = 1000;<br />
VALUE = 1.25;<br />
COUNT = 0;<br />
DO I = 1 TO N;<br />
Z = RANNOR(SEED);<br />
IF Z < VALUE THEN COUNT = COUNT + 1;<br />
END;<br />
PROBEST = COUNT/N;<br />
PUT ’AN EMPIRICAL ESTIMATE OF P(Z < ’ VALUE ’) IS ’ PROBEST;<br />
RUN;<br />
QUIT;<br />
5.7.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Simulate 100 normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variates having mean 51 <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong><br />
5.2. Compute <strong>the</strong> average <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> of your simulated sample<br />
<str<strong>on</strong>g>and</str<strong>on</strong>g> compare with <strong>the</strong> <strong>the</strong>oretical values.<br />
2. Simulate 1000 st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variates Z, <str<strong>on</strong>g>and</str<strong>on</strong>g> use your simulated<br />
sample to estimate<br />
(a) P (Z > 2.5).<br />
(b) P (0 < Z < 1.645).
CHAPTER 5. SIMULATION 32<br />
(c) P (1.2 < Z < 1.45).<br />
(d) P (−1.2 < Z < 1.3).<br />
Compare with <strong>the</strong> <strong>the</strong>oretical values (i.e. c<strong>on</strong>sult a normal table).<br />
3. Using <strong>the</strong> fact that a χ 2 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> 1 degree of freedom has <strong>the</strong> same<br />
distributi<strong>on</strong> as <strong>the</strong> square of a st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable, simulate 100<br />
independent values of such a χ 2 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable, <str<strong>on</strong>g>and</str<strong>on</strong>g> estimate its mean <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
variance. (Compare with <strong>the</strong> <strong>the</strong>oretical values: 1, 2.)<br />
4. A χ 2 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> n degrees of freedom has <strong>the</strong> same distributi<strong>on</strong> as<br />
<strong>the</strong> sum of n independent st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables. Simulate a χ 2<br />
r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> 8 degrees of freedom, <str<strong>on</strong>g>and</str<strong>on</strong>g> estimate its mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance.<br />
(Compare with <strong>the</strong> <strong>the</strong>oretical values: 8, 16.)<br />
5. A comm<strong>on</strong>ly used model is <strong>the</strong> simple regressi<strong>on</strong> model<br />
y = β 0 + β 1 x + ε<br />
where β 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 are c<strong>on</strong>stants. ε is a normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with mean 0 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
variance σ 2 . Take β 0 = −3 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 = 0.5, <str<strong>on</strong>g>and</str<strong>on</strong>g> suppose x = 0.1, 0.2, 0.3, . . . , 4.0.<br />
(a) Simulate 40 independent normal variates ε, supposing σ = 0.4. (Store<br />
<strong>the</strong>se values in a <strong>SAS</strong> variable called epsil<strong>on</strong>.)<br />
(b) Simulate <strong>the</strong> corresp<strong>on</strong>ding values of y. (Store <strong>the</strong>se values in a <strong>SAS</strong> variable<br />
called y.)<br />
(c) Plot <strong>the</strong> normal variates against <strong>the</strong> corresp<strong>on</strong>ding values of x. Note <strong>the</strong><br />
pattern <strong>on</strong> <strong>the</strong> plot.<br />
6. Re-do <strong>the</strong> previous questi<strong>on</strong> using σ = 1.5.<br />
7. Repeat, using β 0 = 5 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 = −2.
Chapter 6<br />
REFERENCE: O<strong>the</strong>r <strong>Data</strong> <strong>Step</strong><br />
Functi<strong>on</strong>s<br />
A <strong>SAS</strong> DATASET<br />
X1 X2 X3 X4<br />
-1 3 2 2.3<br />
0.1 4 -1 2.1<br />
0.5 -1 -7 2.4<br />
1.9 -1.7 -4 1.9<br />
- used in some of <strong>the</strong> examples below.<br />
6.1 Arithmetic Functi<strong>on</strong>s<br />
• ABS(X) - returns <strong>the</strong> absolute value of X: |X|.<br />
EXAMPLE: Y=ABS(X1); (Y = 1 0.1 0.5 1.9).<br />
• MAX(X1,X2,...,XN) - returns <strong>the</strong> largest value am<strong>on</strong>g <strong>the</strong> values of <strong>the</strong> arguments.<br />
EXAMPLE: verb+Y=MAX(X1,X2,X3,X4);+ (Y = 3 4 2.4 1.9).<br />
• MIN(X1,X2,...,XN) - returns <strong>the</strong> smallest value am<strong>on</strong>g <strong>the</strong> values of <strong>the</strong> arguments.<br />
EXAMPLE: Y=MIN(X1,X2,X3,X4); (Y = -1 -1 -7 -4).<br />
• MOD(N1,N2) - returns <strong>the</strong> remainder when <strong>the</strong> quotient of N1 divided by N2 is calculated.<br />
EXAMPLE: Y=MOD(X1,X2); (Y= 2 0.1 0.5 0.2).<br />
• SIGN(X) - returns <strong>the</strong> sign of X, or 0, if X is 0.<br />
EXAMPLE: Y=SIGN(X1); (Y= -1 1 1 1)<br />
• SQRT(X) - returns <strong>the</strong> square root of X: √ X. When X is negative, it returns a missing<br />
value (.).<br />
EXAMPLE: Y=SQRT(X1); (Y = . 0.31622 0.70710 1.37840).<br />
6.2 Truncati<strong>on</strong> Functi<strong>on</strong>s<br />
• CEIL(X) - returns <strong>the</strong> smallest integer greater than X.<br />
• FLOOR(X) - returns <strong>the</strong> largest integer smaller than X.<br />
33
CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 34<br />
• INT(X) - returns <strong>the</strong> same value as FLOOR(X), if X is positive, <str<strong>on</strong>g>and</str<strong>on</strong>g> returns <strong>the</strong> same<br />
value as CEIL(X), if X is negative.<br />
• ROUND(X,Z) - returns <strong>the</strong> value of X rounded to <strong>the</strong> nearest unit of Z.<br />
6.3 Special Ma<strong>the</strong>matical Functi<strong>on</strong>s<br />
• EXP(X): e X .<br />
• GAMMA(X): <strong>the</strong> complete gamma functi<strong>on</strong>, ∫ ∞<br />
0 t X−1 e −t dt.<br />
• LOG(X): <strong>the</strong> natural logarithm of X.<br />
• LOG2(X): <strong>the</strong> logarithm to <strong>the</strong> base 2 of X.<br />
• LOG10(X): <strong>the</strong> logarithm to <strong>the</strong> base 10 of X.<br />
6.4 Trig<strong>on</strong>ometric <str<strong>on</strong>g>and</str<strong>on</strong>g> Hyperbolic Functi<strong>on</strong>s<br />
• ARCOS(X): inverse cosine of X.<br />
• ARSIN(X): inverse sine of X.<br />
• ATAN(X): inverse tangent of X.<br />
• COS(X): cosine of X.<br />
• COSH(X): hyperbolic cosine of X.<br />
• SIN(X): sine of X.<br />
• SINH(X): hyperbolic sine of X.<br />
• TAN(X): tangent of X.<br />
• TANH(X): hyperbolic tangent of X.<br />
6.5 Statistical functi<strong>on</strong>s<br />
• CSS(X1,X2,...,XN): <strong>the</strong> corrected sum of squares<br />
N∑<br />
Xi 2 − N ¯X 2<br />
i=1<br />
• CV(X1,X2,...,XN): <strong>the</strong> coefficient of variati<strong>on</strong> - <strong>the</strong> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> of X 1 , . . . , X N<br />
divided by <strong>the</strong> mean of X 1 , . . . , X N .<br />
• MEAN(X1,...,XN)<br />
¯X = 1 N<br />
N∑<br />
X i<br />
i=1<br />
EXAMPLE: Y = MEAN(X1,X2,X3,X4); (Y = 1.575 1.3 -1.275 -0.475).
CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 35<br />
• N(X1,...,XN): number of n<strong>on</strong>missing arguments.<br />
EXAMPLE: Y=N(.,4.1,.3.7,5.7); (Y = 3).<br />
• NMISS($X_1,\ldots,X_N$): number of missing values.<br />
EXAMPLE: Y=NMISS(.,4.1,.3.7,5.7); (Y = 2).<br />
• RANGE(X1,...,XN): maximum minus <strong>the</strong> minimum.<br />
EXAMPLE: Y=RANGE(X1,X2,X3,X4); (Y = 4 5 9.4 5.9).<br />
• STD(X1,...,XN): st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong>.<br />
• STDERR(X1,...,XN): st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard error (st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> divided by √ N).<br />
• SUM(X1,...,XN): ∑ N<br />
i=1 X i<br />
• USS(X1,...,XN): uncorrected sum of squares ∑ N<br />
i=1 Xi<br />
2<br />
• VAR(X1,...,XN): variance<br />
6.6 Probability functi<strong>on</strong>s<br />
The following functi<strong>on</strong>s can be used to determine various probabilities. The syntax is similar<br />
to that used for <strong>the</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generator functi<strong>on</strong>s.<br />
• GAMINV(P,eta): returns <strong>the</strong> value of x such that<br />
P =<br />
∫ x<br />
0 tη−1 e −t dt<br />
Γ(η)<br />
(0 ≤ P < 1, <str<strong>on</strong>g>and</str<strong>on</strong>g> η > 0).<br />
• POISSON(lambda,N): returns <strong>the</strong> probability that an observati<strong>on</strong> from a Poiss<strong>on</strong> distributi<strong>on</strong><br />
is less than or equal to N. λ is <strong>the</strong> mean parameter.<br />
i.e. POISSON(lambda,N) = ∑ N<br />
j=0<br />
e −λ (λ) j<br />
j!<br />
• PROBBNML(p,n,m): returns <strong>the</strong> probability that an observati<strong>on</strong> from a binomial distributi<strong>on</strong><br />
with parameters p <str<strong>on</strong>g>and</str<strong>on</strong>g> n is less than or equal to m.<br />
)<br />
i.e. PROBBNML(p,n,m) = ∑ m<br />
j=0<br />
(<br />
n<br />
j<br />
p j (1 − p) n−j .<br />
• PROBCHI(x,nu): returns <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with a chi-square distributi<strong>on</strong><br />
<strong>on</strong> ν degrees of freedom falls below x.<br />
• PROBF(x,ndf,ddf): returns <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with an F distributi<strong>on</strong><br />
<strong>on</strong> ndf numerator degrees of freedom <str<strong>on</strong>g>and</str<strong>on</strong>g> ddf denominator degrees of freedom falls<br />
below x.<br />
• PROBGAM(x,eta): returns <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with a gamma distributi<strong>on</strong><br />
with shape parameter η falls below x.<br />
∫ x<br />
0<br />
i.e. PROBGAM(x,eta) =<br />
tη−1 e −t<br />
.<br />
Γ(η)
CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 36<br />
• PROBIT(x): returns <strong>the</strong> inverse of <strong>the</strong> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal cumulative distributi<strong>on</strong> functi<strong>on</strong>.<br />
i.e. If X is a st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable, <strong>the</strong>n x is <strong>the</strong> probability that X will<br />
take <strong>on</strong> a value less PROBIT(X).<br />
• PROBNORM(x): returns <strong>the</strong> probability that a st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable will fall<br />
below x.<br />
• PROBT(x,nu): returns <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with student’s t distributi<strong>on</strong><br />
<strong>on</strong> ν degrees of freedom will fall below x.<br />
• TINV(p,nu): returns <strong>the</strong> pth percentile of <strong>the</strong> student’s t distributi<strong>on</strong> <strong>on</strong> ν degrees of<br />
freedom.<br />
6.6.1 Example<br />
Find <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with a t distributi<strong>on</strong> <strong>on</strong> 8 degrees of freedom is<br />
less than 1.4.<br />
i.e. P (T < 1.4) =? where T is t-distributed <strong>on</strong> 8 d.f. The following program writes <strong>the</strong><br />
correct probability into <strong>the</strong> file PROB.T.<br />
DATA _NULL_;<br />
FILE ’PROB.T’;<br />
PROB = PROBT(1.4, 8);<br />
PUT PROB;<br />
6.6.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />
1. Compute <strong>the</strong> probability that a Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with mean rate 11.4<br />
takes <strong>on</strong> values less than<br />
(a) 1.<br />
(b) 2.<br />
(c) 5.<br />
(d) 11.<br />
(e) 15.<br />
(f) 21.<br />
2. Repeat <strong>the</strong> previous questi<strong>on</strong> for a binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with p = .45 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />
n = 24.<br />
3. The time that it takes a bus to arrive at <strong>the</strong> next stop is normally distributed<br />
with mean 10.4 minutes <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> 1.2. Compute <strong>the</strong> probabilities<br />
that <strong>the</strong> bus will arrive in less than<br />
(a) 5 minutes.<br />
(b) 8 minutes.<br />
(c) 10.5 minutes.<br />
(d) 12.5 minutes.<br />
(e) 13.1 minutes.<br />
(f) 15.2 minutes.