Notes and Exercises on the SAS Data Step and Simulation

Notes and Exercises on the SAS Data Step and Simulation Notes and Exercises on the SAS Data Step and Simulation

<str<strong>on</strong>g>Notes</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> <strong>SAS</strong> <strong>Data</strong> <strong>Step</strong><br />

<str<strong>on</strong>g>and</str<strong>on</strong>g> an Introducti<strong>on</strong> to Simulati<strong>on</strong><br />

W. John Braun<br />

University of Western Ontario<br />

Department of Statistical <str<strong>on</strong>g>and</str<strong>on</strong>g> Actuarial Sciences


Chapter 1<br />

Introducti<strong>on</strong><br />

1.1 Introducti<strong>on</strong> to <strong>Data</strong> Analysis <str<strong>on</strong>g>and</str<strong>on</strong>g> Simulati<strong>on</strong><br />

Given a set of data, <strong>on</strong>e wishes to analyze it appropriately in order to make a decisi<strong>on</strong> or to<br />

acquire some new insights into <strong>the</strong> populati<strong>on</strong> from which <strong>the</strong> data was extracted.<br />

A data set is a collecti<strong>on</strong> of letters (characters) <str<strong>on</strong>g>and</str<strong>on</strong>g>/or numbers each representing informati<strong>on</strong><br />

in <strong>the</strong> form of measurements, counts or labels. The data sets we will c<strong>on</strong>sider in<br />

this course will usually be in case-by-variable format which is a rectangular array (or matrix)<br />

of data, where each row represents a set of measurements taken <strong>on</strong> a single subject or<br />

case. Each column of <strong>the</strong> data set refers to a specific variable, such as age, gender or annual<br />

income.<br />

There are many different types of analysis that are possible. A few of <strong>the</strong>m should be<br />

familiar from an earlier course, such as simple regressi<strong>on</strong> analysis or ANOVA. O<strong>the</strong>r kinds<br />

of analyses will be introduced in this course. In all cases, <strong>the</strong> analysis of a data set involves<br />

<strong>on</strong>e or more of <strong>the</strong> following:<br />

• checking for errors, missing values, etc. (data cleaning)<br />

• graphical displays<br />

• estimati<strong>on</strong><br />

• predicti<strong>on</strong><br />

• c<strong>on</strong>trol<br />

• measuring uncertainty<br />

• statistical testing<br />

• interpreting results<br />

In order to be able to analyze a data set satisfactorily, a computer package is usually<br />

necessary. Several are available, such as SPSS, Minitab, S-Plus <str<strong>on</strong>g>and</str<strong>on</strong>g> R. This course will focus<br />

mainly <strong>on</strong> <strong>the</strong> use of <strong>SAS</strong>, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> goal of this set of notes <strong>on</strong> <strong>the</strong> <strong>SAS</strong> <strong>Data</strong> <strong>Step</strong> is to teach<br />

you how to use <strong>SAS</strong> to simulate different kinds of data. Simulated data are generated by<br />

<strong>the</strong> computer according to a pre-specified probability model, such as a normal distributi<strong>on</strong><br />

or a t-distributi<strong>on</strong>, or perhaps, something much more complicated. The way in which <strong>the</strong><br />

simulated data are generated is designed to make <strong>the</strong> data appear to be r<str<strong>on</strong>g>and</str<strong>on</strong>g>om, though in<br />

fact, <strong>the</strong>y are not truly r<str<strong>on</strong>g>and</str<strong>on</strong>g>om.<br />

1


CHAPTER 1. INTRODUCTION 2<br />

There are at least 2 reas<strong>on</strong>s for learning how simulate data: first, it gives you a way<br />

of ’making up’ data for your own future exercises so that you can test out different <strong>SAS</strong><br />

analysis procedures, <str<strong>on</strong>g>and</str<strong>on</strong>g> you will be able to find out what kinds of data are appropriate<br />

for a given procedure; sec<strong>on</strong>d, knowing how to simulate a set of data is a step towards<br />

underst<str<strong>on</strong>g>and</str<strong>on</strong>g>ing what kind of structure underlies <strong>the</strong> data or <strong>the</strong> ma<strong>the</strong>matical model which is<br />

being studied as an approximati<strong>on</strong> to <strong>the</strong> real populati<strong>on</strong>. Thus, we will first be using <strong>SAS</strong><br />

to create artificial data of different types. Later <strong>on</strong>, we will learn how to use <strong>SAS</strong> procedures<br />

to analyze real data; <strong>the</strong> artificial data can <strong>the</strong>n be used for practice.<br />

1.2 Introducti<strong>on</strong> to <strong>SAS</strong><br />

You are about to be introduced to <strong>on</strong>e of <strong>the</strong> most comm<strong>on</strong>ly used statistical packages: <strong>SAS</strong><br />

(Statistical Analysis System). Many companies use <strong>SAS</strong>, especially in <strong>the</strong> pharmaceutical<br />

industry. Certain insurance companies <str<strong>on</strong>g>and</str<strong>on</strong>g> banks are also happy to have employees who can<br />

use <strong>SAS</strong> to analyze data.<br />

<strong>SAS</strong> is a software system for data analysis. <strong>SAS</strong> has been (<str<strong>on</strong>g>and</str<strong>on</strong>g> is c<strong>on</strong>tinuing to be)<br />

developed at <strong>the</strong> <strong>SAS</strong> Institute in Research Triangle Park at Cary, North Carolina. We will<br />

be using <strong>the</strong> <strong>SAS</strong> Versi<strong>on</strong> 9.3 in this course. It has been in development for over 30 years,<br />

<str<strong>on</strong>g>and</str<strong>on</strong>g> it now has capabilities to perform hundreds of kinds of data analyses. A number of<br />

extensi<strong>on</strong>s, such as IML, have also been developed which give <strong>SAS</strong> even more flexibility <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

power.<br />

In this course, we will <strong>on</strong>ly learn <strong>the</strong> basics. What you learn here will give you <strong>the</strong> ability<br />

to self-learn <strong>the</strong> rest of <strong>the</strong> system as needed.<br />

In <strong>the</strong>se notes, we will begin our introducti<strong>on</strong> to <strong>the</strong> <strong>SAS</strong> system by showing to get it<br />

started in <strong>the</strong> computing lab in WSC 256 <str<strong>on</strong>g>and</str<strong>on</strong>g> how to use <strong>the</strong> graphical user interface. Then,<br />

<strong>the</strong> <strong>Data</strong> <strong>Step</strong> will be c<strong>on</strong>sidered in some detail. Matters of input/output <str<strong>on</strong>g>and</str<strong>on</strong>g> flow c<strong>on</strong>trol<br />

will be discussed. The main applicati<strong>on</strong> will be to <strong>the</strong> generati<strong>on</strong> of r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

<strong>the</strong> creati<strong>on</strong> of artificial data. The very important issue of documentati<strong>on</strong> for <strong>SAS</strong> programs<br />

will be c<strong>on</strong>sidered briefly.<br />

1.3 Accessing <strong>SAS</strong> at Western<br />

We will begin by learning how to run <strong>SAS</strong> jobs in <strong>the</strong> Windows envir<strong>on</strong>ment. In practice,<br />

<strong>SAS</strong> is often run <strong>on</strong> Unix platforms in which case <strong>the</strong> procedures for running <strong>the</strong> <strong>SAS</strong> jobs<br />

differs from what will be described here, but <strong>the</strong> c<strong>on</strong>tent of <strong>the</strong> <strong>SAS</strong> programs is almost<br />

identical.<br />

To invoke <strong>SAS</strong> in <strong>the</strong> lab (Room 256 WSC), begin by logging into <strong>the</strong> network using your<br />

UWO id <str<strong>on</strong>g>and</str<strong>on</strong>g> password. Proceed through <strong>the</strong> following steps as illustrated in Figure 1.1:<br />

1. Click <strong>on</strong> <strong>the</strong> Windows ic<strong>on</strong> <str<strong>on</strong>g>and</str<strong>on</strong>g> choose “All Programs”.<br />

2. Scroll down to <strong>the</strong> “STATISTICS” folder <str<strong>on</strong>g>and</str<strong>on</strong>g> click <strong>on</strong> it.<br />

3. Click <strong>on</strong> <strong>the</strong> <strong>SAS</strong> folder <str<strong>on</strong>g>and</str<strong>on</strong>g> choose “<strong>SAS</strong> 9.3”.<br />

You will <strong>the</strong>n see <strong>the</strong> Program Editor window <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> Log Window. You should see<br />

something similar to what is shown in Figure 1.2.<br />

The Program Editor is ready for you to type in a <strong>SAS</strong> program or to open an existing<br />

program (using <strong>the</strong> File Menu).


CHAPTER 1. INTRODUCTION 3<br />

Figure 1.1: Locating <strong>the</strong> <strong>SAS</strong> program <strong>on</strong> <strong>the</strong> Lab’s Windows system.<br />

Figure 1.2: What should appear <strong>on</strong> <strong>the</strong> computer screen after invoking <strong>SAS</strong> 9.3.<br />

1.4 Main Comp<strong>on</strong>ents of a <strong>SAS</strong> program<br />

1. DATA step - for reading <str<strong>on</strong>g>and</str<strong>on</strong>g> manipulating data. Sometimes programming is d<strong>on</strong>e in<br />

this step.<br />

2. PROC step - for analyzing data. A <strong>SAS</strong> procedure is used to c<strong>on</strong>duct <strong>the</strong> analysis <strong>on</strong><br />

data that is c<strong>on</strong>tained in a <strong>SAS</strong> dataset prepared during <strong>the</strong> DATA step. Thus, <strong>the</strong><br />

PROC step usually follows a DATA step.


Chapter 2<br />

The <strong>Data</strong> <strong>Step</strong><br />

2.1 Some Definiti<strong>on</strong>s<br />

1. <strong>Data</strong> Value - a single measurement. e.g. <strong>the</strong> height of a pers<strong>on</strong> (Joe).<br />

2. Observati<strong>on</strong> - a set of data values for <strong>the</strong> same individual. e.g. name, height, weight,<br />

age <str<strong>on</strong>g>and</str<strong>on</strong>g> sex of Joe.<br />

3. Variable - a set of data values for <strong>the</strong> same measurement. e.g. <strong>the</strong> heights of 10 different<br />

people.<br />

4. <strong>Data</strong> set - a collecti<strong>on</strong> of observati<strong>on</strong>s. We usually think of <strong>the</strong> observati<strong>on</strong>s as being<br />

<strong>the</strong> rows of <strong>the</strong> data set, while <strong>the</strong> variables make up <strong>the</strong> columns of <strong>the</strong> data set.<br />

2.1.1 Example<br />

C<strong>on</strong>sider <strong>the</strong> following data set which c<strong>on</strong>sists of 4 observati<strong>on</strong>s <strong>on</strong> 5 different<br />

variables (NAME, HEIGHT, WEIGHT, AGE, SEX).<br />

NAME HEIGHT WEIGHT AGE SEX<br />

JOE 149 54 13 M<br />

MARY 151 60 28 F<br />

SUE 154 45 21 F<br />

TOM 174 72 26 M<br />

Here, we have 3 numeric variables (HEIGHT, WEIGHT, AGE) <str<strong>on</strong>g>and</str<strong>on</strong>g> 2 character variables<br />

(NAME, SEX).<br />

2.1.2 Exercise<br />

C<strong>on</strong>sider <strong>the</strong> following data set:<br />

TEMPERATURE PRESSURE MINIMUM WIND SPEED MAXIMUM WIND SPEED<br />

32 101.5 21 42<br />

31 101.3 15 28<br />

30 101.8 7 35<br />

24 101.2 12 23<br />

21 100.8 4 22<br />

22 100.9 18 27<br />

1. How many variables are <strong>the</strong>re?<br />

4


CHAPTER 2. THE DATA STEP 5<br />

2. How many observati<strong>on</strong>s <strong>on</strong> each variable?<br />

The <strong>Data</strong> <strong>Step</strong> is <strong>the</strong> point in <strong>the</strong> <strong>SAS</strong> program at which <strong>on</strong>e or more <strong>SAS</strong> data sets are<br />

created. These data sets may be read in from external files or created from within <strong>the</strong> <strong>SAS</strong><br />

program itself. It should be noted that a single <strong>SAS</strong> program can c<strong>on</strong>sist of more than <strong>on</strong>e<br />

<strong>Data</strong> <strong>Step</strong>, though we shall find a single <strong>Data</strong> <strong>Step</strong> sufficient for present purposes.<br />

The <strong>Data</strong> <strong>Step</strong> c<strong>on</strong>sists of a sequence of statements, each ending with a semi-col<strong>on</strong>. These<br />

statements are primarily c<strong>on</strong>cerned with <strong>the</strong> c<strong>on</strong>structi<strong>on</strong> of data sets <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> management<br />

of data.<br />

2.2 <strong>Data</strong><br />

The first line of <strong>the</strong> <strong>Data</strong> <strong>Step</strong> c<strong>on</strong>sists of <strong>the</strong> <strong>Data</strong> statement. This statement indicates that<br />

a data step is starting, <str<strong>on</strong>g>and</str<strong>on</strong>g> it tells <strong>SAS</strong> <strong>the</strong> name of <strong>the</strong> <strong>SAS</strong> data set which is being created.<br />

Syntax:<br />

DATA setname;<br />

The data set name is a word which is somehow descriptive of <strong>the</strong> data set with which it<br />

is associated. It must c<strong>on</strong>sist of at most 32 letters <str<strong>on</strong>g>and</str<strong>on</strong>g>/or numbers. The first character must<br />

be a letter.<br />

2.2.1 Examples<br />

The following statement tells <strong>SAS</strong> that a <strong>SAS</strong> data set called WEATHER is going to<br />

be created.<br />

DATA WEATHER;<br />

The following statement tells <strong>SAS</strong> that a <strong>SAS</strong> data set called GRADES98 is going to<br />

be created.<br />

DATA GRADES98;<br />

Some programming applicati<strong>on</strong>s do not involve a data set. The following statement<br />

tells <strong>SAS</strong> to begin a data step without creating a data set.<br />

DATA _NULL_;<br />

This type of data statement frees up memory that would possibly be used unnecessarily.<br />

We will use it when doing simulati<strong>on</strong>s.<br />

2.3 Numeric Assignment<br />

The Assignment statement is used for creating new variables <str<strong>on</strong>g>and</str<strong>on</strong>g> modifying existing variables.<br />

Syntax:<br />

varname = value;<br />

Naming Variables in <strong>SAS</strong>: A variable name must begin with a letter <str<strong>on</strong>g>and</str<strong>on</strong>g> may be 1 to 8<br />

characters l<strong>on</strong>g. e.g. NAME HEIGHT WEIGHT AGE SEX. e.g. If we have two samples of heights,<br />

we could label <strong>the</strong> 2 height variables HEIGHT1 <str<strong>on</strong>g>and</str<strong>on</strong>g> HEIGHT2. 1HEIGHT <str<strong>on</strong>g>and</str<strong>on</strong>g> 2HEIGHT are not<br />

valid variable names.


CHAPTER 2. THE DATA STEP 6<br />

2.3.1 Example<br />

TEMP = -21.7;<br />

The above statement assigns <strong>the</strong> value -21.7 to <strong>the</strong> variable TEMP.<br />

2.3.2 Example<br />

We can create a <strong>SAS</strong> data set called WEATHER c<strong>on</strong>sisting of <strong>on</strong>e observati<strong>on</strong> <strong>on</strong> each<br />

of 4 variables using <strong>the</strong> following sequence of assignment statements. Figure 2.1<br />

shows what this should look like <strong>on</strong> your computer screen.<br />

DATA WEATHER;<br />

DATE = 22;<br />

PRESSURE= 100.55;<br />

WIND = 19;<br />

TEMP = -21.7;<br />

RUN;<br />

QUIT;<br />

When <strong>the</strong> program has run (as shown, for example, by pressing <strong>the</strong> ‘Runner’<br />

butt<strong>on</strong>, in Figure 2.2), <strong>the</strong> resulting <strong>SAS</strong> data set is as follows:<br />

WEATHER<br />

DATE PRESSURE WIND TEMP<br />

22 100.55 19 -21.7<br />

Note that <strong>the</strong> data set is not actually visible in <strong>the</strong> output. In fact, no output is actually<br />

available; clicking <strong>on</strong> <strong>the</strong> ‘Output’ butt<strong>on</strong> at <strong>the</strong> bottom of <strong>the</strong> screen opens <strong>the</strong> ‘Output’<br />

window, but nothing appears <strong>the</strong>re, as indicated in <strong>the</strong> bottom panel of Figure 2.2.<br />

Figure 2.1: Entering comm<str<strong>on</strong>g>and</str<strong>on</strong>g>s into <strong>the</strong> Editor window to assign data values to a number of variables.


CHAPTER 2. THE DATA STEP 7<br />

Figure 2.2: To execute lines of <strong>SAS</strong> code, press <strong>the</strong> ‘Runner’ butt<strong>on</strong> as shown in <strong>the</strong> top panel. What appears<br />

<strong>on</strong> <strong>the</strong> screen after <strong>the</strong> lines of <strong>SAS</strong> code have been successfully executed: a record of what was d<strong>on</strong>e in <strong>the</strong><br />

log window. In this case, no errors were reported.<br />

The problem is that we have simply created a ‘<strong>SAS</strong> dataset’ which is held internally by<br />

<strong>the</strong> program. In order to see it, we would need to explicitly ask for it somehow. Later, we<br />

will see how to do this.<br />

A simpler way to read in data involves he <strong>the</strong> Input <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>Data</strong>lines statements. The<br />

following lines of code, up<strong>on</strong> executi<strong>on</strong>, will produce <strong>the</strong> same <strong>SAS</strong> dataset as before.<br />

DATA WEATHER;<br />

INPUT DATE PRESSURE WIND TEMP;<br />

DATALINES;<br />

22 100.55 19 -21.7<br />

;


CHAPTER 2. THE DATA STEP 8<br />

RUN;<br />

QUIT;<br />

A major advantage of this approach is that it allows us to read in more than <strong>on</strong>e observati<strong>on</strong><br />

<strong>on</strong> <strong>the</strong> variables specified by <strong>the</strong> Input statement. This is accomplished by inserting<br />

additi<strong>on</strong>al lines of data, noting that <strong>the</strong> data will be read into <strong>the</strong> resulting <strong>SAS</strong> dataset case<br />

by case, where each case c<strong>on</strong>sists of observati<strong>on</strong>s <strong>on</strong> each of <strong>the</strong> Input variables.<br />

2.3.3 Example<br />

Create a <strong>SAS</strong> data set called GRADES98 c<strong>on</strong>taining <strong>the</strong> following data:<br />

ID EXAM FINAL<br />

3237332 58 61<br />

4136229 71 68<br />

2838823 43 49<br />

2881266 62 58<br />

The following lines of code will give <strong>the</strong> required <strong>SAS</strong> dataset.<br />

DATA GRADES98;<br />

INPUT ID EXAM FINAL;<br />

DATALINES;<br />

3237332 58 61<br />

4136229 71 68<br />

2838823 43 49<br />

2881266 62 58<br />

;<br />

RUN;<br />

QUIT;<br />

2.4 INFILE <str<strong>on</strong>g>and</str<strong>on</strong>g> INPUT: Importing <strong>Data</strong> from an External File<br />

Often, a data set has been entered into a text file, for example, from a spreadsheet or data<br />

editor, or perhaps from ano<strong>the</strong>r <strong>SAS</strong> program. The INFILE statement is used in <strong>the</strong> <strong>Data</strong><br />

<strong>Step</strong> to tell <strong>SAS</strong> where to find <strong>the</strong> data. Then, <strong>the</strong> INPUT statement specifies how to assign<br />

<strong>the</strong> data values to specific variables in <strong>the</strong> newly created <strong>SAS</strong> dataset.<br />

Syntax:<br />

INFILE ’filename’;<br />

INPUT var1 var2 ... varn;<br />

2.4.1 Example<br />

Suppose <strong>the</strong> data set of <strong>the</strong> exercise in <strong>the</strong> previous secti<strong>on</strong> had been previously<br />

entered into a file called wea<strong>the</strong>r.dat. We can produce a <strong>SAS</strong> data set called<br />

WEATHER by executing <strong>the</strong> following program.


CHAPTER 2. THE DATA STEP 9<br />

/* Example of reading data */<br />

DATA WEATHER;<br />

INFILE ’WEATHER.DAT’;<br />

INPUT TEMP PRESSURE MINWIND MAXWIND;<br />

PROC PRINT NOOBS; /* This statement is NOT necessary, but it<br />

allows <strong>on</strong>e to see <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> <strong>SAS</strong><br />

data set in <strong>the</strong> Output window. */<br />

RUN; /* This statement IS necessary. The program<br />

will not run o<strong>the</strong>rwise. */<br />

QUIT;<br />

The PROC PRINT statement invokes <strong>the</strong> ‘Print Procedure’ which prints <strong>the</strong> <strong>SAS</strong> dataset<br />

to <strong>the</strong> Output window. In this case, it c<strong>on</strong>sists of a single case <strong>on</strong> <strong>the</strong> four given variables.<br />

It is pictured in Figure 2.3.<br />

Figure 2.3: Output from <strong>the</strong> <strong>SAS</strong> Print Procedure. In this case, <strong>the</strong> single case of <strong>the</strong> <strong>SAS</strong> dataset WEATHER<br />

has been printed to <strong>the</strong> Output window.<br />

2.5 Comments <str<strong>on</strong>g>and</str<strong>on</strong>g> Documentati<strong>on</strong><br />

It is often important to add documentati<strong>on</strong> to any computer programs which you create.<br />

Comment statements should be used to describe program c<strong>on</strong>tents. Proper documentati<strong>on</strong><br />

allows you or o<strong>the</strong>r users to read <str<strong>on</strong>g>and</str<strong>on</strong>g> underst<str<strong>on</strong>g>and</str<strong>on</strong>g> your program more easily. This is<br />

particularly useful if <strong>the</strong> program is to be updated later.<br />

In <strong>SAS</strong>, <strong>the</strong>re are two forms of comment statements:<br />

1. /* comment */<br />

e.g.<br />

/* The variable RADIUS measures <strong>the</strong><br />

cross-secti<strong>on</strong>al radius of each tree at a distance of 1 meter from<br />

<strong>the</strong> ground. */<br />

2. * comment;<br />

e.g.


CHAPTER 2. THE DATA STEP 10<br />

* The variable RADIUS measures <strong>the</strong> cross-secti<strong>on</strong>al<br />

radius of each tree at a distance of 1 meter from <strong>the</strong><br />

ground.;<br />

A useful form of documentati<strong>on</strong> includes a statement at <strong>the</strong> beginning of <strong>the</strong> program<br />

c<strong>on</strong>sisting of <strong>the</strong> title of <strong>the</strong> program, <strong>the</strong> name of <strong>the</strong> programmer, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> date (dates<br />

of later revisi<strong>on</strong>s are important as well). Sometimes variables are defined here. A brief<br />

descripti<strong>on</strong> of <strong>the</strong> purpose of <strong>the</strong> program is useful as well. In <strong>the</strong> body of <strong>the</strong> program, it is<br />

often useful to explain any special comm<str<strong>on</strong>g>and</str<strong>on</strong>g>s used <strong>the</strong>re.<br />

2.5.1 Example<br />

The following lines would make up a <strong>SAS</strong> file:<br />

/* Descriptive Analysis of a Sample of Four Individuals<br />

By P. Brooks<br />

January 15, 2007<br />

This program computes <strong>the</strong> mean <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> for <strong>the</strong><br />

height, weight <str<strong>on</strong>g>and</str<strong>on</strong>g> age of a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om sample of people.<br />

Variables: HEIGHT = height in centimeters.<br />

WEIGHT = weight in kilograms.<br />

AGE = age in years. */<br />

DATA SIZES; INFILE ’sizes.dat’;<br />

INPUT HEIGHT AGE WEIGHT;<br />

PROC MEANS MEAN STD;<br />

* The extra arguments produce <strong>on</strong>ly <strong>the</strong> sample mean <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

sample st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> for each variable;<br />

2.6 File <str<strong>on</strong>g>and</str<strong>on</strong>g> Put<br />

• The FILE statement is used to specify an external output file.<br />

Syntax:<br />

FILE filename;<br />

• The PUT statement causes <strong>SAS</strong> to print to <strong>the</strong> external file named in an earlier FILE<br />

statement.<br />

Syntax:<br />

PUT varname1 varname2 ...;


CHAPTER 2. THE DATA STEP 11<br />

2.6.1 Example<br />

The following lines cause <strong>SAS</strong> to print <strong>the</strong> values 22, 100.55, 19, -21.7 to a file<br />

called wea<strong>the</strong>r.txt.<br />

DATA WEATHER;<br />

FILE ’wea<strong>the</strong>r.txt’;<br />

INPUT DATE PRESSURE WIND TEMP;<br />

PUT DATE PRESSURE WIND TEMP;<br />

DATALINES;<br />

22 100.55 19 -21.7<br />

;<br />

RUN;<br />

QUIT;<br />

Each occurrence of a Put statement causes <strong>the</strong> current value of <strong>the</strong> relevant variables to<br />

be output to <strong>the</strong> file named in <strong>the</strong> File statement.<br />

2.6.2 Example<br />

DATA _NULL_;<br />

FILE ’GRADES.08’;<br />

IF _N_=1 THEN PUT ’2008 GRADES’; /* _N_ counts <strong>the</strong> observati<strong>on</strong>s<br />

as <strong>the</strong>y are input to <strong>the</strong> dataset */<br />

LENGTH NAME $ 8; /* This Length statement ensures that <strong>the</strong><br />

variable NAME can c<strong>on</strong>tain values up to<br />

INPUT NAME $ GRADE;<br />

PUT NAME GRADE;<br />

DATALINES;<br />

JOE 57.5<br />

MARY 83<br />

JENNIFER 64.5<br />

;<br />

RUN;<br />

QUIT;<br />

8 characters in length. */<br />

/* The $ tells <strong>SAS</strong> that NAME is a character<br />

variable. */<br />

This produces a file called GRADES.08 c<strong>on</strong>taining <strong>the</strong> lines<br />

2008 GRADES<br />

JOE 57.5<br />

MARY 83<br />

JENNIFER 64.5<br />

Note that <strong>the</strong> use of DATA _NULL_ results in no <strong>SAS</strong> dataset being created.


CHAPTER 2. THE DATA STEP 12<br />

2.6.3 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Write out <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> file epa.dat produced by <strong>the</strong> following:<br />

DATA _NULL_;<br />

FILE ’epa.dat’;<br />

PUT ’SOME MILEAGE MEASUREMENTS’;<br />

LENGTH CAR $ 13;<br />

CAR = ’BUICK CENTURY’;<br />

DISTANCE = 540;<br />

FUEL = 40;<br />

PUT CAR DISTANCE FUEL;<br />

CAR = ’HONDA CRX’;<br />

DISTANCE = 720;<br />

FUEL = 30;<br />

PUT CAR DISTANCE FUEL;<br />

RUN;<br />

QUIT;<br />

2. Check your answer by executing <strong>the</strong> above lines <strong>on</strong> a computer.<br />

3. Was a <strong>SAS</strong> data set created? Check this by adding <strong>the</strong> line PROC PRINT NOOBS;<br />

(<strong>the</strong>n look in <strong>the</strong> Output window <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> Log file for more informati<strong>on</strong>.)<br />

4. Reorganize <strong>the</strong> program so that it uses <strong>the</strong> <strong>Data</strong>lines statement.<br />

2.7 Arithmetic<br />

<strong>SAS</strong> can be used as a calculator to perform simple arithmetic.<br />

1. Additi<strong>on</strong>:<br />

varname = varname1 + varname2;<br />

2. Subtracti<strong>on</strong>:<br />

varname = varname1 - varname2;<br />

3. Multiplicati<strong>on</strong>:<br />

varname = varname1 * varname2;<br />

4. Divisi<strong>on</strong>:<br />

varname = varname1 / varname2;<br />

5. Power (varname1 varname2 ):<br />

varname = varname1 ** varname2;<br />

6. Modular arithmetic:<br />

varname = MOD(varname1, varname2);<br />

this computes <strong>the</strong> remainder resulting from divisi<strong>on</strong> of varname1 by varname2 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

assigns this value to varname.


CHAPTER 2. THE DATA STEP 13<br />

2.7.1 Example<br />

DATA _NULL_;<br />

/* some examples of arithmetic calculati<strong>on</strong>s */<br />

FILE ’arith.out’;<br />

X = 15; Y = 6;<br />

SUM = X + Y;<br />

DIFF = X - Y; /* DIFF = DIFFERENCE */<br />

PRODUCT = X * Y;<br />

QUOTIENT = X/Y;<br />

POWER = X ** Y;<br />

REMAIND = MOD(X,Y); /* REMAIND = REMAINDER */<br />

PUT X Y SUM DIFF PRODUCT;<br />

PUT QUOTIENT POWER REMAIND;<br />

RUN;<br />

QUIT;<br />

Executi<strong>on</strong> of <strong>the</strong> above <strong>SAS</strong> program produces a file called arith.out which c<strong>on</strong>tains<br />

<strong>the</strong> following lines:<br />

15 6 21 9<br />

90 2.5 11390600 3<br />

2.7.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. What are <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> file c<strong>on</strong>vert.tmp produced by <strong>the</strong> following<br />

program?<br />

DATA _NULL_;<br />

FILE ’c<strong>on</strong>vert.tmp’;<br />

TEMPC = 20;<br />

TEMPF = TEMPC*1.8 + 32;<br />

PUT TEMPC ’ degrees Celsius = ’ TEMPF ’ degrees Fahrenheit.’<br />

RUN;<br />

QUIT;<br />

2. Suppose X = 45, Y = 32, <str<strong>on</strong>g>and</str<strong>on</strong>g> Z = 7. Find <strong>the</strong> value of <strong>the</strong> variable ANSWER<br />

in each of <strong>the</strong> following:<br />

(a) ANSWER = X - Y;<br />

(b) ANSWER = Z ** Z;<br />

(c) ANSWER = MOD(X,Y);<br />

(d) ANSWER = MOD(Y,Z);<br />

(e) ANSWER = MOD(X,Y)+ MOD(X,Z);<br />

3. Using <strong>the</strong> fact that 1 mile = 1.6 kilometers, write a complete <strong>SAS</strong> program<br />

which c<strong>on</strong>verts a distance of 26 miles into kilometer units, <str<strong>on</strong>g>and</str<strong>on</strong>g> which prints<br />

<strong>the</strong> following into a file called c<strong>on</strong>vert.dst:<br />

A distance of 26 miles<br />

is <strong>the</strong> same as a distance of 41.6<br />

kilometers.


CHAPTER 2. THE DATA STEP 14<br />

The Floor Functi<strong>on</strong><br />

Syntax:<br />

varname = FLOOR(varname1);<br />

This statement assigns <strong>the</strong> greatest integer less than varname1 to <strong>the</strong> variable varname.<br />

For example, <strong>the</strong> greatest integer less than 27.34 is 27, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> greatest integer less than<br />

-16.4 is -17.<br />

2.7.3 Example<br />

DATA _NULL_;<br />

X = 47.39;<br />

Y = FLOOR(X);<br />

The value of Y is 47.<br />

2.7.4 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Write out <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> file arith.dat produced by<br />

DATA _NULL_;<br />

FILE ’arith.dat’;<br />

X = -42.49;<br />

Y = FLOOR(X);<br />

PUT X Y;<br />

RUN;<br />

QUIT;<br />

2. Modify <strong>the</strong> above program to compute <strong>the</strong> greatest integer less than<br />

(a) 0.47.<br />

(b) -0.47.<br />

(c) W, where W = 32X, <str<strong>on</strong>g>and</str<strong>on</strong>g> X = 0.217.


Chapter 3<br />

If: C<strong>on</strong>trolling Flow of Operati<strong>on</strong>s<br />

The IF statement is very important in database management. It is used to c<strong>on</strong>trol <strong>the</strong> flow<br />

of operati<strong>on</strong>s which are applied to variables depending <strong>on</strong> <strong>the</strong> values of relevant variables.<br />

In o<strong>the</strong>r words, if a certain variable takes <strong>on</strong> a certain value, a certain operati<strong>on</strong> might be<br />

performed; o<strong>the</strong>rwise, <strong>the</strong> operati<strong>on</strong> is not performed or a different operati<strong>on</strong> is performed<br />

in its place.<br />

Syntax:<br />

IF (c<strong>on</strong>diti<strong>on</strong>) THEN (<strong>SAS</strong> statement);<br />

ELSE (<strong>SAS</strong> statement);<br />

<strong>SAS</strong> evaluates <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> to determine whe<strong>the</strong>r it is true or false. If <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> is true,<br />

<strong>SAS</strong> proceeds to carry out <strong>the</strong> <strong>SAS</strong> statement. The ELSE statement is opti<strong>on</strong>al. It provides<br />

an alternative acti<strong>on</strong> if <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> is false.<br />

Possible c<strong>on</strong>diti<strong>on</strong>s to test are<br />

varname GE c<strong>on</strong>stant, varname LE c<strong>on</strong>stant<br />

varname < c<strong>on</strong>stant, varname > c<strong>on</strong>stant<br />

varname = c<strong>on</strong>stant, varname NE c<strong>on</strong>stant<br />

Testing <strong>the</strong> first c<strong>on</strong>diti<strong>on</strong> above amounts to testing whe<strong>the</strong>r <strong>the</strong> variable with name<br />

varname is greater than or equal to <strong>the</strong> specified c<strong>on</strong>stant (ano<strong>the</strong>r variable name could be<br />

used here as well). The sec<strong>on</strong>d c<strong>on</strong>diti<strong>on</strong> listed c<strong>on</strong>cerns less than or equal, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> last<br />

c<strong>on</strong>diti<strong>on</strong> involves testing for inequality.<br />

3.0.5 Example – Coding<br />

The variable SEX can take values ’M’ <str<strong>on</strong>g>and</str<strong>on</strong>g> ’F’. It is sometimes more c<strong>on</strong>venient<br />

to code this variable numerically using 1 for males <str<strong>on</strong>g>and</str<strong>on</strong>g> 0 for females. The IF<br />

statement can be used to do this as follows:<br />

IF SEX = ’M’ THEN SEXCODE = 1;<br />

ELSE SEXCODE = 0;<br />

In o<strong>the</strong>r words, if <strong>the</strong> variable SEX takes <strong>the</strong> value ’M’, <strong>the</strong>n <strong>the</strong> new variable<br />

SEXCODE takes <strong>the</strong> value 1. O<strong>the</strong>rwise, SEXCODE takes <strong>the</strong> value 0.<br />

15


CHAPTER 3. IF: CONTROLLING FLOW OF OPERATIONS 16<br />

3.0.6 Example – Outlier Detecti<strong>on</strong><br />

Suppose X is a variable whose mean is MU <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> is SIGMA. We may<br />

decide that <strong>the</strong> value of X is to be c<strong>on</strong>sidered outlying if it is more than 3 st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard<br />

deviati<strong>on</strong>s from MU. The following <strong>SAS</strong> lines determine if <strong>the</strong> value of X is outlying.<br />

The variable OUTLIER is assigned <strong>the</strong> value 1 if X is an outlier, <str<strong>on</strong>g>and</str<strong>on</strong>g> it is assigned<br />

<strong>the</strong> value 0 if X is not an outlier.<br />

OUTLIER = 0;<br />

Z = (X - MU)/SIGMA;<br />

IF Z > 3 THEN OUTLIER = 1;<br />

ELSE IF Z < -3 THEN OUTLIER = 1;<br />

3.0.7 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Execute <strong>the</strong> following program <str<strong>on</strong>g>and</str<strong>on</strong>g> view <strong>the</strong> c<strong>on</strong>tents of <strong>the</strong> file demog.dat.<br />

DATA DEMOGRAP;<br />

FILE ’demog.dat’;<br />

INPUT SEX $;<br />

IF SEX = ’M’ THEN SEXCODE = 1;<br />

ELSE SEXCODE = 0;<br />

PUT SEXCODE;<br />

DATALINES;<br />

M<br />

F<br />

M<br />

M<br />

F<br />

;<br />

RUN;<br />

QUIT;<br />

2. The following data has been recorded over a period of 5 hours at a switch:<br />

0,1,1,1,0. The switch is off when <strong>the</strong> value of <strong>the</strong> above variable (called<br />

testcode) is 0, <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>on</strong> when <strong>the</strong> value is 1.<br />

Write a <strong>SAS</strong> program which assigns <strong>the</strong> value ’<strong>on</strong>’ to <strong>the</strong> variable test when<br />

<strong>the</strong> testcode value is 1 <str<strong>on</strong>g>and</str<strong>on</strong>g> ’off’ when testcode is ’0’.<br />

3. A r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable X has mean 14 <str<strong>on</strong>g>and</str<strong>on</strong>g> variance 49. Write a <strong>SAS</strong> program<br />

which determines which of <strong>the</strong> following values of X are outliers: 15, 23, -8,<br />

31, 17. The results should be output to a file called ’outliers.ex’.


Chapter 4<br />

DOing things repeatedly<br />

The DO statement is often useful for simulati<strong>on</strong>. It is also sometimes useful in o<strong>the</strong>r kinds<br />

of data preparati<strong>on</strong> <str<strong>on</strong>g>and</str<strong>on</strong>g> analysis.<br />

4.1 Simple DO<br />

The simple DO statement (which is usually used in associati<strong>on</strong> with an IF statement) tells<br />

<strong>SAS</strong> to execute a set of <strong>SAS</strong> statements. This set of statements is usually referred to as a<br />

DO group.<br />

Syntax:<br />

DO;<br />

<strong>SAS</strong> statements<br />

END;<br />

4.1.1 Example<br />

DATA _NULL_;<br />

FILE ’do.eg’;<br />

INPUT X Y;<br />

IF X > Y THEN DO;<br />

Z1 = X+Y;<br />

Z2 = X-Y;<br />

END;<br />

ELSE DO;<br />

Z1 = X-Y;<br />

Z2 = X+Y;<br />

END;<br />

PUT X Y Z1 Z2;<br />

DATALINES;<br />

3 4<br />

5 4<br />

;<br />

RUN;<br />

17


CHAPTER 4. DOING THINGS REPEATEDLY 18<br />

QUIT;<br />

Executing <strong>the</strong> above program results in a file called ’do.eg’ which c<strong>on</strong>tains <strong>the</strong><br />

following:<br />

3 4 -1 7<br />

5 4 9 -1<br />

4.2 Iterative DO<br />

The iterative DO statement tells <strong>SAS</strong> to perform a computati<strong>on</strong> several times.<br />

Syntax:<br />

DO varname = c<strong>on</strong>stant1 TO c<strong>on</strong>stant2 BY c<strong>on</strong>stant3;<br />

END;<br />

<strong>SAS</strong> statements<br />

4.2.1 Example<br />

Suppose we wish to add up all <strong>the</strong> numbers from 1 to 100. The following <strong>SAS</strong><br />

program does this for us:<br />

DATA _NULL_;<br />

NUMSUM = 0;<br />

DO INDEX = 1 TO 100;<br />

NUMSUM = NUMSUM + INDEX;<br />

END;<br />

FILE ’sum.100’;<br />

PUT NUMSUM;<br />

RUN;<br />

QUIT;<br />

/* NUMSUM is <strong>the</strong> variable which will<br />

ultimately c<strong>on</strong>tain <strong>the</strong> sum we are<br />

interested in.*/<br />

/* At each iterati<strong>on</strong> of <strong>the</strong> DO group,<br />

<strong>the</strong> current value of INDEX is added to<br />

<strong>the</strong> current value of NUMSUM. */<br />

The file sum.100 will <strong>the</strong>n c<strong>on</strong>tain <strong>the</strong> value 5050, which is <strong>the</strong> sum of <strong>the</strong> first 100<br />

integers.<br />

4.2.2 Example<br />

Suppose we wish to add up all <strong>the</strong> even numbers between 1 <str<strong>on</strong>g>and</str<strong>on</strong>g> 101. The following<br />

<strong>SAS</strong> program does this for us:<br />

DATA _NULL_;<br />

NUMSUM = 0;


CHAPTER 4. DOING THINGS REPEATEDLY 19<br />

DO INDEX = 2 TO 100 BY 2;<br />

NUMSUM = NUMSUM + INDEX;<br />

END;<br />

FILE ’even.sum’;<br />

PUT NUMSUM;<br />

RUN;<br />

QUIT;<br />

The file even.sum will <strong>the</strong>n c<strong>on</strong>tain <strong>the</strong> value 2550, which is <strong>the</strong> sum of <strong>the</strong> first<br />

50 even numbers.<br />

4.2.3 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Write a <strong>SAS</strong> program which calculates <strong>the</strong> sum of all multiples of 3 between 1<br />

<str<strong>on</strong>g>and</str<strong>on</strong>g> 121. Ans. 2460<br />

2. Modify <strong>the</strong> above program so that it calculates <strong>the</strong> sum of all integers from 51<br />

through 100. Ans. 3775<br />

3. Modify <strong>the</strong> above program so that it calculates <strong>the</strong> sum of all squares from 1<br />

to 100.<br />

4. Modify <strong>the</strong> above program so that it calculates <strong>the</strong> sum of square roots of even<br />

numbers between 1 <str<strong>on</strong>g>and</str<strong>on</strong>g> 101.<br />

5. Modify <strong>the</strong> above program so that it calculates 20! (<strong>the</strong> product of all integers<br />

between 1 <str<strong>on</strong>g>and</str<strong>on</strong>g> 20).<br />

4.3 DO While (opti<strong>on</strong>al)<br />

In order to use <strong>the</strong> iterative DO, <strong>on</strong>e needs to know <strong>the</strong> number of times <strong>the</strong> computati<strong>on</strong> is<br />

to be performed. Often, this number is not known beforeh<str<strong>on</strong>g>and</str<strong>on</strong>g>. Instead, <strong>on</strong>e might require<br />

that <strong>the</strong> computati<strong>on</strong> is performed while a particular c<strong>on</strong>diti<strong>on</strong> is satisfied.<br />

Syntax:<br />

DO WHILE (c<strong>on</strong>diti<strong>on</strong>);<br />

END;<br />

<strong>SAS</strong> statements<br />

The <strong>SAS</strong> statements in <strong>the</strong> DO group are executed as l<strong>on</strong>g as <strong>the</strong> c<strong>on</strong>diti<strong>on</strong> is found to<br />

be true. The c<strong>on</strong>diti<strong>on</strong> is tested <strong>on</strong>ce before <strong>the</strong> beginning of each loop. The first time that<br />

<strong>the</strong> c<strong>on</strong>diti<strong>on</strong> is found to be false, <strong>the</strong> DO group statements are no l<strong>on</strong>ger executed <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>SAS</strong><br />

moves <strong>on</strong> bey<strong>on</strong>d <strong>the</strong> END; statement.<br />

4.3.1 Example<br />

Suppose we want to determine <strong>the</strong> largest value of n so that<br />

n∑<br />

i 2 < 10000.<br />

i=1


CHAPTER 4. DOING THINGS REPEATEDLY 20<br />

One approach to this problem is to successively add terms to <strong>the</strong> sum, while <strong>the</strong><br />

sum is less than 10000, <str<strong>on</strong>g>and</str<strong>on</strong>g> to stop accumulating as so<strong>on</strong> as <strong>the</strong> sum exceeds this<br />

amount. The following statements accomplish this:<br />

DATA _NULL_;<br />

NUMSUM = 0;<br />

INDEX=0;<br />

DO WHILE (NUMSUM < 10000);<br />

INDEX=INDEX+1;<br />

NUMSUM = NUMSUM + INDEX**2;<br />

END;<br />

INDEX=INDEX-1;<br />

FILE ’sum.out’;<br />

PUT INDEX;<br />

RUN;<br />

QUIT;<br />

The final value of INDEX is <strong>the</strong> soluti<strong>on</strong> n. This single number should be c<strong>on</strong>tained<br />

in <strong>the</strong> file ‘sum.out’ after executing <strong>the</strong> above lines of code.<br />

4.3.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Write a <strong>SAS</strong> program which finds <strong>the</strong> largest n satisfying<br />

n∑<br />

i 3 < 20000.<br />

i=1<br />

2. Write a <strong>SAS</strong> program which finds <strong>the</strong> largest n satisfying n! < 100000.<br />

3. Write a <strong>SAS</strong> program which finds <strong>the</strong> smallest n satisfying n! > 100000.


Chapter 5<br />

Simulati<strong>on</strong><br />

5.1 Generati<strong>on</strong> of Pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />

We begin our discussi<strong>on</strong> of simulati<strong>on</strong> with a brief explorati<strong>on</strong> of <strong>the</strong> mechanics of pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />

number generati<strong>on</strong>. Pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers are useful in simulati<strong>on</strong> studies.<br />

We will briefly describe a comm<strong>on</strong> method for simulating independent uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />

variables <strong>on</strong> <strong>the</strong> interval [0,1]. A multiplicative c<strong>on</strong>gruential r<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generator produces<br />

a sequence of pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers, u 0 , u 1 , u 2 , . . . , which are approximately independent<br />

uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables <strong>on</strong> <strong>the</strong> interval [0,1]. We now describe how to c<strong>on</strong>struct<br />

such a generator.<br />

Let m be a large integer, <str<strong>on</strong>g>and</str<strong>on</strong>g> let b be ano<strong>the</strong>r integer which is smaller than m. b is often<br />

somewhere around <strong>the</strong> square root of m. To begin, an integer x 0 is chosen between 1 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

m. x 0 is called <strong>the</strong> seed. It is best chosen in some n<strong>on</strong>-systematic manner.<br />

Once <strong>the</strong> seed has been chosen, <strong>the</strong> generator proceeds as follows:<br />

x 1 = bx 0 (mod m)<br />

u 1 = x 1 /m.<br />

u 1 is <strong>the</strong> first pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om number. Dividing by m ensures that <strong>the</strong> number lies between<br />

0 <str<strong>on</strong>g>and</str<strong>on</strong>g> 1. Note that it takes some value between 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> 1. If m <str<strong>on</strong>g>and</str<strong>on</strong>g> b are chosen properly, it<br />

is difficult to predict <strong>the</strong> value of u 1 , given <strong>the</strong> value of x 0 <strong>on</strong>ly. The sec<strong>on</strong>d pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />

number is <strong>the</strong>n obtained in <strong>the</strong> same manner:<br />

x 2 = bx 1 (mod m)<br />

u 2 = x 2 /m.<br />

u 2 is ano<strong>the</strong>r pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om number, which is approximately independent of u 1 . The method<br />

c<strong>on</strong>tinues using <strong>the</strong> following formulas:<br />

x n = bx n−1 (mod m)<br />

u n = x n /m.<br />

This method produces numbers which are in reality n<strong>on</strong>-r<str<strong>on</strong>g>and</str<strong>on</strong>g>om, but if d<strong>on</strong>e properly,<br />

<strong>the</strong> numbers appear to be r<str<strong>on</strong>g>and</str<strong>on</strong>g>om (i.e. unpredictable).<br />

Different values of b <str<strong>on</strong>g>and</str<strong>on</strong>g> m give rise to pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generators of varying<br />

quality. If <strong>the</strong>y are not chosen with some care, <strong>the</strong>n <strong>the</strong> generator will produce numbers that<br />

do not appear to be r<str<strong>on</strong>g>and</str<strong>on</strong>g>om. A number of statistical tests have been developed for assessing<br />

<strong>the</strong> quality of a pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generator.<br />

21


CHAPTER 5. SIMULATION 22<br />

5.1.1 Example<br />

The following lines of <strong>SAS</strong> create a file called RANDOM.DAT which c<strong>on</strong>tains 5 pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />

numbers based <strong>on</strong> <strong>the</strong> multiplicative c<strong>on</strong>gruential generator:<br />

x n = 171x n−1 (mod 30269)<br />

with initial seed x 0 = 23121.<br />

u n = x n /30269<br />

/* Rudimentary Pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om Number Generator */<br />

DATA _NULL_;<br />

FILE ’RANDOM.DAT’;<br />

B = 171;<br />

M = 30269;<br />

SEED = 23121;<br />

X = SEED;<br />

DO I = 1 TO 5;<br />

X = MOD(B*X, M);<br />

U = X/M;<br />

PUT X U;<br />

END;<br />

RUN;<br />

QUIT;<br />

The results which are stored in <strong>the</strong> file RANDOM.DAT are as follows. The first column<br />

c<strong>on</strong>sists of <strong>the</strong> integers x 1 , x 2 , . . . , x 5 . The sec<strong>on</strong>d column c<strong>on</strong>sists of numbers ranging<br />

between 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> 1. These are <strong>the</strong> uniform pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers, u 1 , u 2 , . . . , u 5 .<br />

18721 0.61849<br />

23046 0.76137<br />

5896 0.19479<br />

9339 0.30853<br />

22981 0.75923<br />

A related operati<strong>on</strong> is used internally by <strong>SAS</strong> to produce pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers automatically<br />

with <strong>the</strong> functi<strong>on</strong> UNIFORM.<br />

5.1.2 Example<br />

The following lines of <strong>SAS</strong> create a file called RANDOM.DAT which c<strong>on</strong>tains 50 uniform<br />

pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers based <strong>on</strong> <strong>the</strong> <strong>SAS</strong> generator UNIFORM with initial seed<br />

x 0 = 27218.<br />

/* Example dem<strong>on</strong>strating use of <strong>SAS</strong> RNG with fixed seed. */<br />

DATA _NULL_;<br />

SEED = 27218;


CHAPTER 5. SIMULATION 23<br />

FILE ’RANDOM.DAT’;<br />

DO I = 1 TO 50;<br />

U = UNIFORM(SEED);<br />

PUT U;<br />

END;<br />

RUN;<br />

QUIT;<br />

It is often of interest to look at <strong>the</strong> distributi<strong>on</strong> of a set of pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers.<br />

For <strong>the</strong> numbers generated in <strong>the</strong> previous example, we would proceed as follows:<br />

DATA RANDOM;<br />

INFILE ’RANDOM.DAT’;<br />

INPUT U;<br />

PROC CHART;<br />

VBAR U;<br />

RUN;<br />

QUIT;<br />

The bars of <strong>the</strong> histogram should all be roughly <strong>the</strong> same height, if <strong>the</strong> numbers<br />

are really uniformly distributed.<br />

5.1.3 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Generate 200 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers using <strong>the</strong> generator from <strong>the</strong> first example with<br />

an initial seed of 2018.<br />

2. Write a program (or modify <strong>the</strong> sec<strong>on</strong>d program in <strong>the</strong> sec<strong>on</strong>d example) which<br />

produces a histogram of <strong>the</strong> numbers produced in <strong>the</strong> previous exercise.<br />

3. Generate 200 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers using <strong>the</strong> <strong>SAS</strong> UNIFORM generator from example<br />

2 with an initial seed of 2018. Produce a histogram of this simulated data.<br />

4. Modify <strong>the</strong> generator of <strong>the</strong> first example so that it produces 200 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />

numbers from <strong>the</strong> generator<br />

with initial seed x 0 = 17218.<br />

x n = 172x n−1 (mod 30307)<br />

5. Generate 1000 pseudor<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers using <strong>the</strong> <strong>SAS</strong> functi<strong>on</strong> UNIFORM, <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

store <strong>the</strong>m in a file called UNIF.DAT.<br />

6. Modify <strong>the</strong> above program to simulate <strong>the</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable Y = 1/(U +<br />

1) where U is a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> <strong>the</strong> interval [0,1]. Specifically,<br />

generate 1000 values of this r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <str<strong>on</strong>g>and</str<strong>on</strong>g> put <strong>the</strong>m in a file called<br />

RANDOM.DAT.<br />

Also, plot <strong>the</strong> histogram of <strong>the</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers y 1 , . . . , y 1000 . Since Y is no<br />

l<strong>on</strong>ger a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable, <strong>the</strong> histogram will not be flat any l<strong>on</strong>ger;<br />

what is <strong>the</strong> shape of <strong>the</strong> distributi<strong>on</strong>?


CHAPTER 5. SIMULATION 24<br />

7. Write a program which generates 100 independent observati<strong>on</strong>s <strong>on</strong> a uniformly<br />

distributed r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> <strong>the</strong> interval [0, 100]. Estimate <strong>the</strong> mean, variance<br />

<str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> of such a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable.<br />

8. Use <strong>the</strong> FLOOR functi<strong>on</strong> toge<strong>the</strong>r with UNIFORM to simulate 100 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om integers<br />

between 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> 99.<br />

5.2 Simulati<strong>on</strong> of Bernoulli Trials<br />

A Bernoulli trial is an experiment in which <strong>the</strong>re are 2 possible outcomes. For example, a<br />

light bulb may work or it may not work; <strong>the</strong>se are <strong>the</strong> <strong>on</strong>ly possibilities. For ano<strong>the</strong>r example,<br />

c<strong>on</strong>sider a student who guesses <strong>on</strong> a multiple choice test questi<strong>on</strong> which has 5 opti<strong>on</strong>s; <strong>the</strong><br />

student may guess correctly with probability 0.2 <str<strong>on</strong>g>and</str<strong>on</strong>g> incorrectly with probability 0.8.<br />

Suppose we would like to know how well such a student would do <strong>on</strong> a multiple choice<br />

test c<strong>on</strong>sisting of 100 questi<strong>on</strong>s. We can get an idea by using simulati<strong>on</strong>:<br />

Each questi<strong>on</strong> corresp<strong>on</strong>ds to an independent Bernoulli trial with probability of success<br />

equal to 0.2. We can simulate <strong>the</strong> correctness of <strong>the</strong> student for each questi<strong>on</strong> by generating<br />

an independent uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om number. If this number is less than .2, we say that <strong>the</strong><br />

student guessed correctly; o<strong>the</strong>rwise, we say that <strong>the</strong> student guessed incorrectly.<br />

This will work because <strong>the</strong> probability that a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable is less than .2 is<br />

exactly .2, while <strong>the</strong> probability that a uniform r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable exceeds .2 is exactly .8,<br />

which is <strong>the</strong> same as <strong>the</strong> probability that <strong>the</strong> student guesses incorrectly. Thus, <strong>the</strong> uniform<br />

r<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generator is simulating <strong>the</strong> student. The <strong>SAS</strong> versi<strong>on</strong> of this is as follows:<br />

DATA _NULL_;<br />

SEED = 12883;<br />

FILE ‘STUDENT.ANS’;<br />

PUT ’CORRECT U’;<br />

DO QUESTION = 1 TO 100;<br />

U = UNIFORM(SEED);<br />

IF U < .2 THEN CORRECT = 1;<br />

ELSE CORRECT = 0;<br />

PUT CORRECT U;<br />

END;<br />

RUN;<br />

QUIT;<br />

The first column of <strong>the</strong> file STUDENT.ANS c<strong>on</strong>tains <strong>the</strong> results of <strong>the</strong> student’s guesses. A 1<br />

is recorded each time <strong>the</strong> student correctly guesses <strong>the</strong> answer, while a 0 is recorded each<br />

time <strong>the</strong> student is wr<strong>on</strong>g. The sec<strong>on</strong>d column records <strong>the</strong> value of <strong>the</strong> variable U; note<br />

that whenever its value is less than .2, <strong>the</strong> value of CORRECT is 1, <str<strong>on</strong>g>and</str<strong>on</strong>g> when U takes a value<br />

exceeding .2, <strong>the</strong> value of CORRECT is 0.<br />

5.2.1 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Write a <strong>SAS</strong> program which simulates a student guessing at a True-False test<br />

c<strong>on</strong>sisting of 40 questi<strong>on</strong>s.


CHAPTER 5. SIMULATION 25<br />

2. Write a <strong>SAS</strong> program which simulates 500 light bulbs, each of which has<br />

probability .99 of working.<br />

3. Write a <strong>SAS</strong> program which simulates a binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable Y with<br />

parameters n = 25 <str<strong>on</strong>g>and</str<strong>on</strong>g> p = .4. (Y is <strong>the</strong> sum of 25 independent Bernoulli<br />

r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables with p = .4.)<br />

• Now, modify <strong>the</strong> program so that it generates 100 of <strong>the</strong>se binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />

variables <str<strong>on</strong>g>and</str<strong>on</strong>g> writes <strong>the</strong>m to a file called binom.dat. In order to do this,<br />

you will need to nest <strong>on</strong>e DO group inside ano<strong>the</strong>r.<br />

• Write ano<strong>the</strong>r program which reads <strong>the</strong> data from binom.dat into a <strong>SAS</strong><br />

data set <str<strong>on</strong>g>and</str<strong>on</strong>g> produces a histogram. Estimate <strong>the</strong> mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance using<br />

PROC MEANS. Compare <strong>the</strong>se estimates with <strong>the</strong>ir <strong>the</strong>oretical counterparts.<br />

Recall that <strong>the</strong> <strong>the</strong>oretical mean of a binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable is np <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

<strong>the</strong> <strong>the</strong>oretical variance is np(1 − p).<br />

5.3 The Logistic Model<br />

In many biostatistical applicati<strong>on</strong>s, interest centers <strong>on</strong> a dose-resp<strong>on</strong>se relati<strong>on</strong>ship. For<br />

example, what dosage of a carcinogenic substance will produce cancer in a given percentage<br />

of a populati<strong>on</strong>? One would expect that higher dosages of carcinogen will yield higher rates<br />

of cancer. A first attempt at modelling this kind of relati<strong>on</strong>ship might be<br />

p = α 0 + α 1 x<br />

where p is <strong>the</strong> proporti<strong>on</strong> of <strong>the</strong> populati<strong>on</strong> that would acquire cancer at dosage x; α 0 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

α 1 are c<strong>on</strong>stants. This model is linear, <str<strong>on</strong>g>and</str<strong>on</strong>g> will almost have <strong>the</strong> correct behaviour if α 1 is<br />

positive. However, it will give values of p outside <strong>the</strong> interval [0, 1] if x is too large or too<br />

small.<br />

The logistic model is often used as an alternative to h<str<strong>on</strong>g>and</str<strong>on</strong>g>le this kind of problem. It<br />

is based <strong>on</strong> <strong>the</strong> logit transformati<strong>on</strong> which maps values in (0, 1) to (−∞, ∞). The logit<br />

transformati<strong>on</strong> is given by l(p) = log(p/(1 − p)). Its inverse is given by <strong>the</strong> logistic functi<strong>on</strong><br />

p(l) = exp(l)/(1 + exp(l)).<br />

We can <strong>the</strong>n model <strong>the</strong> dose-resp<strong>on</strong>se relati<strong>on</strong>ship with<br />

l(p) = β 0 + β 1 x<br />

where β 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 are c<strong>on</strong>stants. This model says that when <strong>the</strong> dosage is x, <strong>the</strong> proporti<strong>on</strong><br />

of <strong>the</strong> populati<strong>on</strong> acquiring cancer will be p, where<br />

Example<br />

p =<br />

eβ 0+β 1 x<br />

1 + e β 0+β 1 x .<br />

Write <strong>SAS</strong> code to simulate <strong>the</strong> resp<strong>on</strong>ses of 20 subjects who have been exposed to<br />

varying amounts of carcinogen under <strong>the</strong> logistic model assumpti<strong>on</strong> with β 0 = −1.5<br />

<str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 = 0.7. Assume that <strong>the</strong> dosages are given by x = 0.1, 0.2, . . . , 2.0. Output<br />

should be printed to a file called ‘doseresp<strong>on</strong>sesim.txt’.<br />

DATA _NULL_;


CHAPTER 5. SIMULATION 26<br />

SEED = 81818; B0 = -1.5; B1 = 0.7;<br />

FILE ‘doseresp<strong>on</strong>sesim.txt’;<br />

PUT ’Resp<strong>on</strong>se Dosage’;<br />

DO X = 0.1 TO 2.0 BY 0.1;<br />

U = UNIFORM(SEED);<br />

TMP = EXP(B0 + B1*X);<br />

P = TMP/(1+TMP);<br />

IF U < P THEN CANCER = 1;<br />

ELSE CANCER = 0;<br />

PUT CANCER X;<br />

END;<br />

RUN;<br />

QUIT;<br />

Up<strong>on</strong> running <strong>the</strong> code, it should be clear that as x increases, <strong>the</strong> incidence of<br />

cancer increases (i.e. <strong>the</strong> incidence of 1’s in <strong>the</strong> first column of simulated data<br />

increases).<br />

<str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Run <strong>the</strong> code for <strong>the</strong> logistic model given in <strong>the</strong> above example. Then change <strong>the</strong> slope<br />

parameter β 1 to −0.7. How does this affect <strong>the</strong> pattern in <strong>the</strong> resp<strong>on</strong>se?<br />

2. Modify <strong>the</strong> code given in <strong>the</strong> example so that dosages are given by 1.5, 1.7, 1.9, . . . , 3.5.<br />

3. Modify <strong>the</strong> example code so that <strong>the</strong> output enters a <strong>SAS</strong> dataset called ’DOSERESP’.<br />

Next, use <strong>the</strong> PLOT procedure to plot CANCER against X. Experiment with various<br />

values of β 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 in order to see how <strong>the</strong>se values affect <strong>the</strong> pattern of resp<strong>on</strong>se.<br />

5.4 Binomial R<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />

The RANBIN functi<strong>on</strong> can be used to automatically generate binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers.<br />

Syntax:<br />

Y = RANBIN(seed,n,p);<br />

The seed is any positive integer, while n <str<strong>on</strong>g>and</str<strong>on</strong>g> p are <strong>the</strong> binomial parameters. The functi<strong>on</strong><br />

assigns a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om binomial realizati<strong>on</strong> to <strong>the</strong> variable Y.<br />

5.4.1 Example<br />

Suppose 12% of a large populati<strong>on</strong> has recently been infected by a virus whose<br />

incubati<strong>on</strong> period is 2 weeks l<strong>on</strong>g, but whose presence can be detected by a blood<br />

test. Suppose r<str<strong>on</strong>g>and</str<strong>on</strong>g>om testing for <strong>the</strong> virus is c<strong>on</strong>ducted, <str<strong>on</strong>g>and</str<strong>on</strong>g> 15 individuals are<br />

tested each hour. Simulate <strong>the</strong> number of positive test results for each hour over<br />

a 24-hour period. Record <strong>the</strong> simulated numbers of positive test results in a file<br />

called viruscounts.txt.<br />

Since 15 individuals are tested each hour <str<strong>on</strong>g>and</str<strong>on</strong>g> each individual has a 0.12 probability<br />

of being infected, independent of <strong>the</strong> state of <strong>the</strong> o<strong>the</strong>r individuals, <strong>the</strong> number


CHAPTER 5. SIMULATION 27<br />

of positive test results in <strong>on</strong>e hour is a binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with n = 15<br />

<str<strong>on</strong>g>and</str<strong>on</strong>g> p = 0.12. To simulate <strong>the</strong> numbers of positive test results for each hour in a<br />

24-hour period, we need to generate 24 binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers:<br />

/* Simulati<strong>on</strong> of infected individuals */<br />

DATA _NULL_;<br />

SEED = 3728;<br />

N = 15;<br />

P = .12;<br />

FILE ’viruscounts.txt’;<br />

PUT ’HOUR NUMBER OF INFECTED’;<br />

DO HOUR = 1 TO 24;<br />

INFECTED = RANBIN(SEED,N,P);<br />

PUT HOUR INFECTED;<br />

END;<br />

RUN;<br />

QUIT;<br />

5.4.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Generate 1000 binomial variates with n = 18 <str<strong>on</strong>g>and</str<strong>on</strong>g> p = .75 using RANBIN. Then use<br />

PROC MEANS to estimate <strong>the</strong> average <str<strong>on</strong>g>and</str<strong>on</strong>g> variance. Compare with <strong>the</strong> <strong>the</strong>oretical mean<br />

<str<strong>on</strong>g>and</str<strong>on</strong>g> variance. Repeat for binomial variates with n = 50 <str<strong>on</strong>g>and</str<strong>on</strong>g> p = .4.<br />

2. Generate 50 binomial variates B 1 , B 2 , . . . , B 50 , having n = 20 <str<strong>on</strong>g>and</str<strong>on</strong>g> where p satisfies<br />

l(p) = −2.0 + 0.5x<br />

where x = 0.1, 0.2, 0.3, . . . , 5.0. Use <strong>the</strong> Plot procedure to plot B against x <str<strong>on</strong>g>and</str<strong>on</strong>g> note<br />

<strong>the</strong> pattern of plotted points.<br />

3. Refer to <strong>the</strong> previous questi<strong>on</strong>. Calculate <strong>the</strong> expected value of B i , for i = 1, 2, . . . , 50.<br />

Plot <strong>the</strong>se expected values against x.<br />

5.5 Poiss<strong>on</strong> R<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />

We can generate Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om numbers using <strong>SAS</strong> with <strong>the</strong> RANPOI functi<strong>on</strong>. It is similar<br />

to <strong>the</strong> RANBIN functi<strong>on</strong>, but <strong>the</strong>re is <strong>on</strong>ly <strong>on</strong>e parameter instead of two.<br />

Syntax:<br />

Y = RANPOI(seed, lambda);<br />

In this case, lambda is <strong>the</strong> mean of <strong>the</strong> Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable.


CHAPTER 5. SIMULATION 28<br />

5.5.1 Example<br />

Suppose traffic accidents occur at an intersecti<strong>on</strong> with a mean of 3.7 per year.<br />

Simulate <strong>the</strong> annual number of accidents for a 10-year period, assuming that <strong>the</strong><br />

numbers occurring from year to year are independent.<br />

/* Example of Poiss<strong>on</strong> variate generati<strong>on</strong> -- Simulati<strong>on</strong> of Traffic<br />

Accidents */<br />

DATA _NULL_;<br />

SEED = 497765;<br />

LAMBDA = 3.7;<br />

FILE ’ACCIDENT.DAT’;<br />

PUT ’YEAR NUMBER OF ACCIDENTS’;<br />

DO YEAR = 1 TO 10;<br />

ACCIDENT = RANPOI(SEED, LAMBDA);<br />

PUT YEAR ACCIDENT;<br />

END;<br />

RUN;<br />

QUIT;<br />

5.5.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Modify <strong>the</strong> above program to simulate <strong>the</strong> number of accidents per year for<br />

15 years, when <strong>the</strong> average rate is 2.8 accidents per year.<br />

2. Simulate <strong>the</strong> number of surface defects in <strong>the</strong> finish of a sports car for 20 cars,<br />

where <strong>the</strong> mean is 1.2 defects per car.<br />

3. Estimate <strong>the</strong> mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance of a Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable whose mean<br />

rate is 7.2 by simulating 1000 such variates <str<strong>on</strong>g>and</str<strong>on</strong>g> using PROC MEANS. Compare<br />

with <strong>the</strong> <strong>the</strong>oretical values, recalling that <strong>the</strong> variance <str<strong>on</strong>g>and</str<strong>on</strong>g> mean are equal for<br />

Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables.<br />

4. A comm<strong>on</strong>ly used model is <strong>the</strong> Poiss<strong>on</strong> regressi<strong>on</strong> model<br />

log(λ) = β 0 + β 1 x<br />

where β 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 are c<strong>on</strong>stants. Take β 0 = −3 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 = 0.5, <str<strong>on</strong>g>and</str<strong>on</strong>g> suppose<br />

x = 0.1, 0.2, 0.3, . . . , 4.0. Calculate <strong>the</strong> corresp<strong>on</strong>ding values of λ. (Store <strong>the</strong>se<br />

values in a <strong>SAS</strong> variable called lambda.)<br />

5. Refer to <strong>the</strong> previous questi<strong>on</strong>. Simulate Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variates which have<br />

<strong>the</strong> λ values. Plot <strong>the</strong> Poiss<strong>on</strong> variates against <strong>the</strong> corresp<strong>on</strong>ding values of x.<br />

5.6 Exp<strong>on</strong>ential R<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />

The exp<strong>on</strong>ential distributi<strong>on</strong> can be used as a simple model for <strong>the</strong> time until a comp<strong>on</strong>ent<br />

fails, or until a light bulb burns out.<br />

A r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable T has an exp<strong>on</strong>ential distributi<strong>on</strong> with mean λ if


CHAPTER 5. SIMULATION 29<br />

P(T ≤ t) = 1 − e −t/λ<br />

for any n<strong>on</strong>-negative t. The mean or expected value of T is 1/λ <str<strong>on</strong>g>and</str<strong>on</strong>g> <strong>the</strong> variance of T is<br />

1/λ 2 .<br />

The simplest way to simulate exp<strong>on</strong>ential r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables is to generate a uniform<br />

r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable U <strong>on</strong> [0,1], <str<strong>on</strong>g>and</str<strong>on</strong>g> set<br />

Solving this for T , we have<br />

1 − e −T/λ = U<br />

T = −λ log(1 − U).<br />

It can be shown that T defined in this way has an exp<strong>on</strong>ential distributi<strong>on</strong> with mean λ. The<br />

<strong>SAS</strong> functi<strong>on</strong> RANEXP can be used to generate r<str<strong>on</strong>g>and</str<strong>on</strong>g>om exp<strong>on</strong>ential variates with mean 1.<br />

Syntax:<br />

T = RANEXP(seed);<br />

This produces an exp<strong>on</strong>ential variate T having mean 1. To change <strong>the</strong> mean to lambda, we<br />

must use<br />

T = lambda * RANEXP(seed);<br />

5.6.1 Example<br />

/* SIMULATION OF N EXPONENTIAL LAMBDA RANDOM VARIATES */<br />

DATA _NULL_;<br />

SEED = 12238;<br />

LAMBDA = 2.5;<br />

N = 10;<br />

FILE ’EXPO.RVS’<br />

DO I = 1 TO N;<br />

T = RANEXP(SEED)*LAMBDA;<br />

PUT T;<br />

END;<br />

RUN;<br />

QUIT;<br />

5.6.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Suppose that a certain type of battery has a lifetime which is exp<strong>on</strong>entially<br />

distributed with mean 55 hours. Simulate 1000 such lifetimes to estimate <strong>the</strong><br />

mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance of <strong>the</strong> lifetime for this type of battery. Compare with <strong>the</strong><br />

<strong>the</strong>oretical mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance.<br />

2. The central limit <strong>the</strong>orem says that <strong>the</strong> sample mean for a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om sample<br />

of size n from a populati<strong>on</strong> with mean µ <str<strong>on</strong>g>and</str<strong>on</strong>g> variance σ 2 is approximately<br />

normally distributed with mean µ <str<strong>on</strong>g>and</str<strong>on</strong>g> variance σ 2 /n, where <strong>the</strong> approximati<strong>on</strong><br />

improves as n increases.


CHAPTER 5. SIMULATION 30<br />

The following programs provides a dem<strong>on</strong>strati<strong>on</strong> for <strong>the</strong> case where <strong>the</strong> underlying<br />

populati<strong>on</strong> is exp<strong>on</strong>entially distributed:<br />

/* PROGRAM 1: Computati<strong>on</strong> of averages of samples of size N coming<br />

from exp<strong>on</strong>ential lambda populati<strong>on</strong>s */<br />

DATA _NULL_;<br />

SEED = 12238;<br />

LAMBDA = 2.5;<br />

NSAMPLES = 1000;<br />

N = 10;<br />

FILE ’EXPO.AVG’<br />

DO NSAMPLE = 1 TO NSAMPLES;<br />

TSUM = 0;<br />

DO I = 1 TO N;<br />

T = RANEXP(SEED)*LAMBDA;<br />

TSUM = TSUM + T;<br />

END;<br />

RUN;<br />

QUIT;<br />

END;<br />

TAVG = TSUM/N;<br />

PUT TAVG;<br />

/* We are going to simulate NSAMPLES<br />

independent samples of size N, computing <strong>the</strong> average<br />

in each case. */<br />

/* Accumulating <strong>the</strong> sample<br />

values to form a sum */<br />

/* TAVG = average of <strong>the</strong> current<br />

sample. */<br />

/* Storing sample averages for<br />

use in next program where <strong>the</strong>y will be<br />

plotted as a histogram. */<br />

/* PROGRAM 2: Histogram of averages to dem<strong>on</strong>strate CLT */<br />

DATA EXPO_AVG;<br />

INFILE ’EXPO.AVG’;<br />

INPUT TAVG;<br />

PROC CHART;<br />

VBAR TAVG;<br />

PROC MEANS MEAN VAR;<br />

VAR TAVG;<br />

RUN;<br />

QUIT;<br />

/* We’ve included this procedure to compare<br />

<strong>the</strong> mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance of <strong>the</strong> averages with what is<br />

expected by <strong>the</strong> <strong>the</strong>ory */<br />

Run <strong>the</strong> above programs for N = 3, 6, 10, 20, 30, 40. Note how <strong>the</strong> histogram<br />

begins to resemble <strong>the</strong> familiar bell-shaped curve as N increases. How large<br />

would you say N should be in order for <strong>the</strong> normal approximati<strong>on</strong> to be c<strong>on</strong>sidered<br />

accurate, when <strong>the</strong> underlying populati<strong>on</strong> is exp<strong>on</strong>ential?


CHAPTER 5. SIMULATION 31<br />

5.7 Normal R<str<strong>on</strong>g>and</str<strong>on</strong>g>om Numbers<br />

St<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables can be generated using <strong>the</strong> RANNOR functi<strong>on</strong> in <strong>SAS</strong>.<br />

Syntax:<br />

Z = RANNOR(seed);<br />

This produces a value of a normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable Z which has mean 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> variance 1.<br />

Recall that if X has mean µ <str<strong>on</strong>g>and</str<strong>on</strong>g> variance σ 2 , <strong>the</strong>n<br />

X = µ + σZ<br />

where Z has mean 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> variance 1. Therefore, to simulate a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable X having<br />

mean mu <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> sigma, use<br />

X = mu + sigma*RANNOR(seed);<br />

5.7.1 Example<br />

Use simulati<strong>on</strong> to estimate P (Z < 1.25) where Z is a st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om<br />

variable.<br />

Idea: Simulate a large number (say, 1000) of st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variates <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

compute <strong>the</strong> proporti<strong>on</strong> that lie below 1.25.<br />

DATA _NULL_;<br />

FILE ’NORMAL.PRB’;<br />

SEED = 19218;<br />

N = 1000;<br />

VALUE = 1.25;<br />

COUNT = 0;<br />

DO I = 1 TO N;<br />

Z = RANNOR(SEED);<br />

IF Z < VALUE THEN COUNT = COUNT + 1;<br />

END;<br />

PROBEST = COUNT/N;<br />

PUT ’AN EMPIRICAL ESTIMATE OF P(Z < ’ VALUE ’) IS ’ PROBEST;<br />

RUN;<br />

QUIT;<br />

5.7.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Simulate 100 normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variates having mean 51 <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong><br />

5.2. Compute <strong>the</strong> average <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> of your simulated sample<br />

<str<strong>on</strong>g>and</str<strong>on</strong>g> compare with <strong>the</strong> <strong>the</strong>oretical values.<br />

2. Simulate 1000 st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variates Z, <str<strong>on</strong>g>and</str<strong>on</strong>g> use your simulated<br />

sample to estimate<br />

(a) P (Z > 2.5).<br />

(b) P (0 < Z < 1.645).


CHAPTER 5. SIMULATION 32<br />

(c) P (1.2 < Z < 1.45).<br />

(d) P (−1.2 < Z < 1.3).<br />

Compare with <strong>the</strong> <strong>the</strong>oretical values (i.e. c<strong>on</strong>sult a normal table).<br />

3. Using <strong>the</strong> fact that a χ 2 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> 1 degree of freedom has <strong>the</strong> same<br />

distributi<strong>on</strong> as <strong>the</strong> square of a st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable, simulate 100<br />

independent values of such a χ 2 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable, <str<strong>on</strong>g>and</str<strong>on</strong>g> estimate its mean <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

variance. (Compare with <strong>the</strong> <strong>the</strong>oretical values: 1, 2.)<br />

4. A χ 2 r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> n degrees of freedom has <strong>the</strong> same distributi<strong>on</strong> as<br />

<strong>the</strong> sum of n independent st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variables. Simulate a χ 2<br />

r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable <strong>on</strong> 8 degrees of freedom, <str<strong>on</strong>g>and</str<strong>on</strong>g> estimate its mean <str<strong>on</strong>g>and</str<strong>on</strong>g> variance.<br />

(Compare with <strong>the</strong> <strong>the</strong>oretical values: 8, 16.)<br />

5. A comm<strong>on</strong>ly used model is <strong>the</strong> simple regressi<strong>on</strong> model<br />

y = β 0 + β 1 x + ε<br />

where β 0 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 are c<strong>on</strong>stants. ε is a normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with mean 0 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

variance σ 2 . Take β 0 = −3 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 = 0.5, <str<strong>on</strong>g>and</str<strong>on</strong>g> suppose x = 0.1, 0.2, 0.3, . . . , 4.0.<br />

(a) Simulate 40 independent normal variates ε, supposing σ = 0.4. (Store<br />

<strong>the</strong>se values in a <strong>SAS</strong> variable called epsil<strong>on</strong>.)<br />

(b) Simulate <strong>the</strong> corresp<strong>on</strong>ding values of y. (Store <strong>the</strong>se values in a <strong>SAS</strong> variable<br />

called y.)<br />

(c) Plot <strong>the</strong> normal variates against <strong>the</strong> corresp<strong>on</strong>ding values of x. Note <strong>the</strong><br />

pattern <strong>on</strong> <strong>the</strong> plot.<br />

6. Re-do <strong>the</strong> previous questi<strong>on</strong> using σ = 1.5.<br />

7. Repeat, using β 0 = 5 <str<strong>on</strong>g>and</str<strong>on</strong>g> β 1 = −2.


Chapter 6<br />

REFERENCE: O<strong>the</strong>r <strong>Data</strong> <strong>Step</strong><br />

Functi<strong>on</strong>s<br />

A <strong>SAS</strong> DATASET<br />

X1 X2 X3 X4<br />

-1 3 2 2.3<br />

0.1 4 -1 2.1<br />

0.5 -1 -7 2.4<br />

1.9 -1.7 -4 1.9<br />

- used in some of <strong>the</strong> examples below.<br />

6.1 Arithmetic Functi<strong>on</strong>s<br />

• ABS(X) - returns <strong>the</strong> absolute value of X: |X|.<br />

EXAMPLE: Y=ABS(X1); (Y = 1 0.1 0.5 1.9).<br />

• MAX(X1,X2,...,XN) - returns <strong>the</strong> largest value am<strong>on</strong>g <strong>the</strong> values of <strong>the</strong> arguments.<br />

EXAMPLE: verb+Y=MAX(X1,X2,X3,X4);+ (Y = 3 4 2.4 1.9).<br />

• MIN(X1,X2,...,XN) - returns <strong>the</strong> smallest value am<strong>on</strong>g <strong>the</strong> values of <strong>the</strong> arguments.<br />

EXAMPLE: Y=MIN(X1,X2,X3,X4); (Y = -1 -1 -7 -4).<br />

• MOD(N1,N2) - returns <strong>the</strong> remainder when <strong>the</strong> quotient of N1 divided by N2 is calculated.<br />

EXAMPLE: Y=MOD(X1,X2); (Y= 2 0.1 0.5 0.2).<br />

• SIGN(X) - returns <strong>the</strong> sign of X, or 0, if X is 0.<br />

EXAMPLE: Y=SIGN(X1); (Y= -1 1 1 1)<br />

• SQRT(X) - returns <strong>the</strong> square root of X: √ X. When X is negative, it returns a missing<br />

value (.).<br />

EXAMPLE: Y=SQRT(X1); (Y = . 0.31622 0.70710 1.37840).<br />

6.2 Truncati<strong>on</strong> Functi<strong>on</strong>s<br />

• CEIL(X) - returns <strong>the</strong> smallest integer greater than X.<br />

• FLOOR(X) - returns <strong>the</strong> largest integer smaller than X.<br />

33


CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 34<br />

• INT(X) - returns <strong>the</strong> same value as FLOOR(X), if X is positive, <str<strong>on</strong>g>and</str<strong>on</strong>g> returns <strong>the</strong> same<br />

value as CEIL(X), if X is negative.<br />

• ROUND(X,Z) - returns <strong>the</strong> value of X rounded to <strong>the</strong> nearest unit of Z.<br />

6.3 Special Ma<strong>the</strong>matical Functi<strong>on</strong>s<br />

• EXP(X): e X .<br />

• GAMMA(X): <strong>the</strong> complete gamma functi<strong>on</strong>, ∫ ∞<br />

0 t X−1 e −t dt.<br />

• LOG(X): <strong>the</strong> natural logarithm of X.<br />

• LOG2(X): <strong>the</strong> logarithm to <strong>the</strong> base 2 of X.<br />

• LOG10(X): <strong>the</strong> logarithm to <strong>the</strong> base 10 of X.<br />

6.4 Trig<strong>on</strong>ometric <str<strong>on</strong>g>and</str<strong>on</strong>g> Hyperbolic Functi<strong>on</strong>s<br />

• ARCOS(X): inverse cosine of X.<br />

• ARSIN(X): inverse sine of X.<br />

• ATAN(X): inverse tangent of X.<br />

• COS(X): cosine of X.<br />

• COSH(X): hyperbolic cosine of X.<br />

• SIN(X): sine of X.<br />

• SINH(X): hyperbolic sine of X.<br />

• TAN(X): tangent of X.<br />

• TANH(X): hyperbolic tangent of X.<br />

6.5 Statistical functi<strong>on</strong>s<br />

• CSS(X1,X2,...,XN): <strong>the</strong> corrected sum of squares<br />

N∑<br />

Xi 2 − N ¯X 2<br />

i=1<br />

• CV(X1,X2,...,XN): <strong>the</strong> coefficient of variati<strong>on</strong> - <strong>the</strong> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> of X 1 , . . . , X N<br />

divided by <strong>the</strong> mean of X 1 , . . . , X N .<br />

• MEAN(X1,...,XN)<br />

¯X = 1 N<br />

N∑<br />

X i<br />

i=1<br />

EXAMPLE: Y = MEAN(X1,X2,X3,X4); (Y = 1.575 1.3 -1.275 -0.475).


CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 35<br />

• N(X1,...,XN): number of n<strong>on</strong>missing arguments.<br />

EXAMPLE: Y=N(.,4.1,.3.7,5.7); (Y = 3).<br />

• NMISS($X_1,\ldots,X_N$): number of missing values.<br />

EXAMPLE: Y=NMISS(.,4.1,.3.7,5.7); (Y = 2).<br />

• RANGE(X1,...,XN): maximum minus <strong>the</strong> minimum.<br />

EXAMPLE: Y=RANGE(X1,X2,X3,X4); (Y = 4 5 9.4 5.9).<br />

• STD(X1,...,XN): st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong>.<br />

• STDERR(X1,...,XN): st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard error (st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> divided by √ N).<br />

• SUM(X1,...,XN): ∑ N<br />

i=1 X i<br />

• USS(X1,...,XN): uncorrected sum of squares ∑ N<br />

i=1 Xi<br />

2<br />

• VAR(X1,...,XN): variance<br />

6.6 Probability functi<strong>on</strong>s<br />

The following functi<strong>on</strong>s can be used to determine various probabilities. The syntax is similar<br />

to that used for <strong>the</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om number generator functi<strong>on</strong>s.<br />

• GAMINV(P,eta): returns <strong>the</strong> value of x such that<br />

P =<br />

∫ x<br />

0 tη−1 e −t dt<br />

Γ(η)<br />

(0 ≤ P < 1, <str<strong>on</strong>g>and</str<strong>on</strong>g> η > 0).<br />

• POISSON(lambda,N): returns <strong>the</strong> probability that an observati<strong>on</strong> from a Poiss<strong>on</strong> distributi<strong>on</strong><br />

is less than or equal to N. λ is <strong>the</strong> mean parameter.<br />

i.e. POISSON(lambda,N) = ∑ N<br />

j=0<br />

e −λ (λ) j<br />

j!<br />

• PROBBNML(p,n,m): returns <strong>the</strong> probability that an observati<strong>on</strong> from a binomial distributi<strong>on</strong><br />

with parameters p <str<strong>on</strong>g>and</str<strong>on</strong>g> n is less than or equal to m.<br />

)<br />

i.e. PROBBNML(p,n,m) = ∑ m<br />

j=0<br />

(<br />

n<br />

j<br />

p j (1 − p) n−j .<br />

• PROBCHI(x,nu): returns <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with a chi-square distributi<strong>on</strong><br />

<strong>on</strong> ν degrees of freedom falls below x.<br />

• PROBF(x,ndf,ddf): returns <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with an F distributi<strong>on</strong><br />

<strong>on</strong> ndf numerator degrees of freedom <str<strong>on</strong>g>and</str<strong>on</strong>g> ddf denominator degrees of freedom falls<br />

below x.<br />

• PROBGAM(x,eta): returns <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with a gamma distributi<strong>on</strong><br />

with shape parameter η falls below x.<br />

∫ x<br />

0<br />

i.e. PROBGAM(x,eta) =<br />

tη−1 e −t<br />

.<br />

Γ(η)


CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 36<br />

• PROBIT(x): returns <strong>the</strong> inverse of <strong>the</strong> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal cumulative distributi<strong>on</strong> functi<strong>on</strong>.<br />

i.e. If X is a st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable, <strong>the</strong>n x is <strong>the</strong> probability that X will<br />

take <strong>on</strong> a value less PROBIT(X).<br />

• PROBNORM(x): returns <strong>the</strong> probability that a st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard normal r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable will fall<br />

below x.<br />

• PROBT(x,nu): returns <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with student’s t distributi<strong>on</strong><br />

<strong>on</strong> ν degrees of freedom will fall below x.<br />

• TINV(p,nu): returns <strong>the</strong> pth percentile of <strong>the</strong> student’s t distributi<strong>on</strong> <strong>on</strong> ν degrees of<br />

freedom.<br />

6.6.1 Example<br />

Find <strong>the</strong> probability that a r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with a t distributi<strong>on</strong> <strong>on</strong> 8 degrees of freedom is<br />

less than 1.4.<br />

i.e. P (T < 1.4) =? where T is t-distributed <strong>on</strong> 8 d.f. The following program writes <strong>the</strong><br />

correct probability into <strong>the</strong> file PROB.T.<br />

DATA _NULL_;<br />

FILE ’PROB.T’;<br />

PROB = PROBT(1.4, 8);<br />

PUT PROB;<br />

6.6.2 <str<strong>on</strong>g>Exercises</str<strong>on</strong>g><br />

1. Compute <strong>the</strong> probability that a Poiss<strong>on</strong> r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with mean rate 11.4<br />

takes <strong>on</strong> values less than<br />

(a) 1.<br />

(b) 2.<br />

(c) 5.<br />

(d) 11.<br />

(e) 15.<br />

(f) 21.<br />

2. Repeat <strong>the</strong> previous questi<strong>on</strong> for a binomial r<str<strong>on</strong>g>and</str<strong>on</strong>g>om variable with p = .45 <str<strong>on</strong>g>and</str<strong>on</strong>g><br />

n = 24.<br />

3. The time that it takes a bus to arrive at <strong>the</strong> next stop is normally distributed<br />

with mean 10.4 minutes <str<strong>on</strong>g>and</str<strong>on</strong>g> st<str<strong>on</strong>g>and</str<strong>on</strong>g>ard deviati<strong>on</strong> 1.2. Compute <strong>the</strong> probabilities<br />

that <strong>the</strong> bus will arrive in less than<br />

(a) 5 minutes.<br />

(b) 8 minutes.<br />

(c) 10.5 minutes.<br />

(d) 12.5 minutes.<br />

(e) 13.1 minutes.<br />

(f) 15.2 minutes.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!