Notes and Exercises on the SAS Data Step and Simulation

Notes and Exercises on the SAS Data Step and Simulation Notes and Exercises on the SAS Data Step and Simulation

from stats.uwo.ca More from this publisher

01.08.2014 Views

ong>Notesong> on the SAS Data Step ong>andong> an Introduction to Simulation W. John Braun University of Western Ontario Department of Statistical ong>andong> Actuarial Sciences

<strong>Notes</strong> on the SAS Data Step

<strong>and</strong> an Introduction to Simulation

W. John Braun

University of Western Ontario

Department of Statistical <strong>and</strong> Actuarial Sciences

Chapter 1

Introduction

1.1 Introduction to Data Analysis <strong>and</strong> Simulation

Given a set of data, one wishes to analyze it appropriately in order to make a decision or to

acquire some new insights into the population from which the data was extracted.

A data set is a collection of letters (characters) <strong>and</strong>/or numbers each representing information

in the form of measurements, counts or labels. The data sets we will consider in

this course will usually be in case-by-variable format which is a rectangular array (or matrix)

of data, where each row represents a set of measurements taken on a single subject or

case. Each column of the data set refers to a specific variable, such as age, gender or annual

income.

There are many different types of analysis that are possible. A few of them should be

familiar from an earlier course, such as simple regression analysis or ANOVA. Other kinds

of analyses will be introduced in this course. In all cases, the analysis of a data set involves

one or more of the following:

• checking for errors, missing values, etc. (data cleaning)

• graphical displays

• estimation

• prediction

• control

• measuring uncertainty

• statistical testing

• interpreting results

In order to be able to analyze a data set satisfactorily, a computer package is usually

necessary. Several are available, such as SPSS, Minitab, S-Plus <strong>and</strong> R. This course will focus

mainly on the use of SAS, <strong>and</strong> the goal of this set of notes on the SAS Data Step is to teach

you how to use SAS to simulate different kinds of data. Simulated data are generated by

the computer according to a pre-specified probability model, such as a normal distribution

or a t-distribution, or perhaps, something much more complicated. The way in which the

simulated data are generated is designed to make the data appear to be r<strong>and</strong>om, though in

fact, they are not truly r<strong>and</strong>om.

CHAPTER 1. INTRODUCTION 2

There are at least 2 reasons for learning how simulate data: first, it gives you a way

of ’making up’ data for your own future exercises so that you can test out different SAS

analysis procedures, <strong>and</strong> you will be able to find out what kinds of data are appropriate

for a given procedure; second, knowing how to simulate a set of data is a step towards

underst<strong>and</strong>ing what kind of structure underlies the data or the mathematical model which is

being studied as an approximation to the real population. Thus, we will first be using SAS

to create artificial data of different types. Later on, we will learn how to use SAS procedures

to analyze real data; the artificial data can then be used for practice.

1.2 Introduction to SAS

You are about to be introduced to one of the most commonly used statistical packages: SAS

(Statistical Analysis System). Many companies use SAS, especially in the pharmaceutical

industry. Certain insurance companies <strong>and</strong> banks are also happy to have employees who can

use SAS to analyze data.

SAS is a software system for data analysis. SAS has been (<strong>and</strong> is continuing to be)

developed at the SAS Institute in Research Triangle Park at Cary, North Carolina. We will

be using the SAS Version 9.3 in this course. It has been in development for over 30 years,

<strong>and</strong> it now has capabilities to perform hundreds of kinds of data analyses. A number of

extensions, such as IML, have also been developed which give SAS even more flexibility <strong>and</strong>

power.

In this course, we will only learn the basics. What you learn here will give you the ability

to self-learn the rest of the system as needed.

In these notes, we will begin our introduction to the SAS system by showing to get it

started in the computing lab in WSC 256 <strong>and</strong> how to use the graphical user interface. Then,

the Data Step will be considered in some detail. Matters of input/output <strong>and</strong> flow control

will be discussed. The main application will be to the generation of r<strong>and</strong>om numbers <strong>and</strong>

the creation of artificial data. The very important issue of documentation for SAS programs

will be considered briefly.

1.3 Accessing SAS at Western

We will begin by learning how to run SAS jobs in the Windows environment. In practice,

SAS is often run on Unix platforms in which case the procedures for running the SAS jobs

differs from what will be described here, but the content of the SAS programs is almost

identical.

To invoke SAS in the lab (Room 256 WSC), begin by logging into the network using your

UWO id <strong>and</strong> password. Proceed through the following steps as illustrated in Figure 1.1:

1. Click on the Windows icon <strong>and</strong> choose “All Programs”.

2. Scroll down to the “STATISTICS” folder <strong>and</strong> click on it.

3. Click on the SAS folder <strong>and</strong> choose “SAS 9.3”.

You will then see the Program Editor window <strong>and</strong> the Log Window. You should see

something similar to what is shown in Figure 1.2.

The Program Editor is ready for you to type in a SAS program or to open an existing

program (using the File Menu).

CHAPTER 1. INTRODUCTION 3

Figure 1.1: Locating the SAS program on the Lab’s Windows system.

Figure 1.2: What should appear on the computer screen after invoking SAS 9.3.

1.4 Main Components of a SAS program

1. DATA step - for reading <strong>and</strong> manipulating data. Sometimes programming is done in

this step.

2. PROC step - for analyzing data. A SAS procedure is used to conduct the analysis on

data that is contained in a SAS dataset prepared during the DATA step. Thus, the

PROC step usually follows a DATA step.

Chapter 2

The Data Step

2.1 Some Definitions

1. Data Value - a single measurement. e.g. the height of a person (Joe).

2. Observation - a set of data values for the same individual. e.g. name, height, weight,

age <strong>and</strong> sex of Joe.

3. Variable - a set of data values for the same measurement. e.g. the heights of 10 different

people.

4. Data set - a collection of observations. We usually think of the observations as being

the rows of the data set, while the variables make up the columns of the data set.

2.1.1 Example

Consider the following data set which consists of 4 observations on 5 different

variables (NAME, HEIGHT, WEIGHT, AGE, SEX).

NAME HEIGHT WEIGHT AGE SEX

JOE 149 54 13 M

MARY 151 60 28 F

SUE 154 45 21 F

TOM 174 72 26 M

Here, we have 3 numeric variables (HEIGHT, WEIGHT, AGE) <strong>and</strong> 2 character variables

(NAME, SEX).

2.1.2 Exercise

Consider the following data set:

TEMPERATURE PRESSURE MINIMUM WIND SPEED MAXIMUM WIND SPEED

32 101.5 21 42

31 101.3 15 28

30 101.8 7 35

24 101.2 12 23

21 100.8 4 22

22 100.9 18 27

1. How many variables are there?

CHAPTER 2. THE DATA STEP 5

2. How many observations on each variable?

The Data Step is the point in the SAS program at which one or more SAS data sets are

created. These data sets may be read in from external files or created from within the SAS

program itself. It should be noted that a single SAS program can consist of more than one

Data Step, though we shall find a single Data Step sufficient for present purposes.

The Data Step consists of a sequence of statements, each ending with a semi-colon. These

statements are primarily concerned with the construction of data sets <strong>and</strong> the management

of data.

2.2 Data

The first line of the Data Step consists of the Data statement. This statement indicates that

a data step is starting, <strong>and</strong> it tells SAS the name of the SAS data set which is being created.

Syntax:

DATA setname;

The data set name is a word which is somehow descriptive of the data set with which it

is associated. It must consist of at most 32 letters <strong>and</strong>/or numbers. The first character must

be a letter.

2.2.1 Examples

The following statement tells SAS that a SAS data set called WEATHER is going to

be created.

DATA WEATHER;

The following statement tells SAS that a SAS data set called GRADES98 is going to

be created.

DATA GRADES98;

Some programming applications do not involve a data set. The following statement

tells SAS to begin a data step without creating a data set.

DATA _NULL_;

This type of data statement frees up memory that would possibly be used unnecessarily.

We will use it when doing simulations.

2.3 Numeric Assignment

The Assignment statement is used for creating new variables <strong>and</strong> modifying existing variables.

Syntax:

varname = value;

Naming Variables in SAS: A variable name must begin with a letter <strong>and</strong> may be 1 to 8

characters long. e.g. NAME HEIGHT WEIGHT AGE SEX. e.g. If we have two samples of heights,

we could label the 2 height variables HEIGHT1 <strong>and</strong> HEIGHT2. 1HEIGHT <strong>and</strong> 2HEIGHT are not

valid variable names.

CHAPTER 2. THE DATA STEP 6

2.3.1 Example

TEMP = -21.7;

The above statement assigns the value -21.7 to the variable TEMP.

2.3.2 Example

We can create a SAS data set called WEATHER consisting of one observation on each

of 4 variables using the following sequence of assignment statements. Figure 2.1

shows what this should look like on your computer screen.

DATA WEATHER;

DATE = 22;

PRESSURE= 100.55;

WIND = 19;

TEMP = -21.7;

RUN;

QUIT;

When the program has run (as shown, for example, by pressing the ‘Runner’

button, in Figure 2.2), the resulting SAS data set is as follows:

WEATHER

DATE PRESSURE WIND TEMP

22 100.55 19 -21.7

Note that the data set is not actually visible in the output. In fact, no output is actually

available; clicking on the ‘Output’ button at the bottom of the screen opens the ‘Output’

window, but nothing appears there, as indicated in the bottom panel of Figure 2.2.

Figure 2.1: Entering comm<strong>and</strong>s into the Editor window to assign data values to a number of variables.

CHAPTER 2. THE DATA STEP 7

Figure 2.2: To execute lines of SAS code, press the ‘Runner’ button as shown in the top panel. What appears

on the screen after the lines of SAS code have been successfully executed: a record of what was done in the

log window. In this case, no errors were reported.

The problem is that we have simply created a ‘SAS dataset’ which is held internally by

the program. In order to see it, we would need to explicitly ask for it somehow. Later, we

will see how to do this.

A simpler way to read in data involves he the Input <strong>and</strong> Datalines statements. The

following lines of code, upon execution, will produce the same SAS dataset as before.

DATA WEATHER;

INPUT DATE PRESSURE WIND TEMP;

DATALINES;

22 100.55 19 -21.7

;

CHAPTER 2. THE DATA STEP 8

RUN;

QUIT;

A major advantage of this approach is that it allows us to read in more than one observation

on the variables specified by the Input statement. This is accomplished by inserting

additional lines of data, noting that the data will be read into the resulting SAS dataset case

by case, where each case consists of observations on each of the Input variables.

2.3.3 Example

Create a SAS data set called GRADES98 containing the following data:

ID EXAM FINAL

3237332 58 61

4136229 71 68

2838823 43 49

2881266 62 58

The following lines of code will give the required SAS dataset.

DATA GRADES98;

INPUT ID EXAM FINAL;

DATALINES;

3237332 58 61

4136229 71 68

2838823 43 49

2881266 62 58

;

RUN;

QUIT;

2.4 INFILE <strong>and</strong> INPUT: Importing Data from an External File

Often, a data set has been entered into a text file, for example, from a spreadsheet or data

editor, or perhaps from another SAS program. The INFILE statement is used in the Data

Step to tell SAS where to find the data. Then, the INPUT statement specifies how to assign

the data values to specific variables in the newly created SAS dataset.

Syntax:

INFILE ’filename’;

INPUT var1 var2 ... varn;

2.4.1 Example

Suppose the data set of the exercise in the previous section had been previously

entered into a file called weather.dat. We can produce a SAS data set called

WEATHER by executing the following program.

CHAPTER 2. THE DATA STEP 9

/* Example of reading data */

DATA WEATHER;

INFILE ’WEATHER.DAT’;

INPUT TEMP PRESSURE MINWIND MAXWIND;

PROC PRINT NOOBS; /* This statement is NOT necessary, but it

allows one to see the contents of the SAS

data set in the Output window. */

RUN; /* This statement IS necessary. The program

will not run otherwise. */

QUIT;

The PROC PRINT statement invokes the ‘Print Procedure’ which prints the SAS dataset

to the Output window. In this case, it consists of a single case on the four given variables.

It is pictured in Figure 2.3.

Figure 2.3: Output from the SAS Print Procedure. In this case, the single case of the SAS dataset WEATHER

has been printed to the Output window.

2.5 Comments <strong>and</strong> Documentation

It is often important to add documentation to any computer programs which you create.

Comment statements should be used to describe program contents. Proper documentation

allows you or other users to read <strong>and</strong> underst<strong>and</strong> your program more easily. This is

particularly useful if the program is to be updated later.

In SAS, there are two forms of comment statements:

1. /* comment */

e.g.

/* The variable RADIUS measures the

cross-sectional radius of each tree at a distance of 1 meter from

the ground. */

2. * comment;

e.g.

CHAPTER 2. THE DATA STEP 10

* The variable RADIUS measures the cross-sectional

radius of each tree at a distance of 1 meter from the

ground.;

A useful form of documentation includes a statement at the beginning of the program

consisting of the title of the program, the name of the programmer, <strong>and</strong> the date (dates

of later revisions are important as well). Sometimes variables are defined here. A brief

description of the purpose of the program is useful as well. In the body of the program, it is

often useful to explain any special comm<strong>and</strong>s used there.

2.5.1 Example

The following lines would make up a SAS file:

/* Descriptive Analysis of a Sample of Four Individuals

By P. Brooks

January 15, 2007

This program computes the mean <strong>and</strong> st<strong>and</strong>ard deviation for the

height, weight <strong>and</strong> age of a r<strong>and</strong>om sample of people.

Variables: HEIGHT = height in centimeters.

WEIGHT = weight in kilograms.

AGE = age in years. */

DATA SIZES; INFILE ’sizes.dat’;

INPUT HEIGHT AGE WEIGHT;

PROC MEANS MEAN STD;

* The extra arguments produce only the sample mean <strong>and</strong>

sample st<strong>and</strong>ard deviation for each variable;

2.6 File <strong>and</strong> Put

• The FILE statement is used to specify an external output file.

Syntax:

FILE filename;

• The PUT statement causes SAS to print to the external file named in an earlier FILE

statement.

Syntax:

PUT varname1 varname2 ...;

CHAPTER 2. THE DATA STEP 11

2.6.1 Example

The following lines cause SAS to print the values 22, 100.55, 19, -21.7 to a file

called weather.txt.

DATA WEATHER;

FILE ’weather.txt’;

INPUT DATE PRESSURE WIND TEMP;

PUT DATE PRESSURE WIND TEMP;

DATALINES;

22 100.55 19 -21.7

;

RUN;

QUIT;

Each occurrence of a Put statement causes the current value of the relevant variables to

be output to the file named in the File statement.

2.6.2 Example

DATA _NULL_;

FILE ’GRADES.08’;

IF _N_=1 THEN PUT ’2008 GRADES’; /* _N_ counts the observations

as they are input to the dataset */

LENGTH NAME $ 8; /* This Length statement ensures that the

variable NAME can contain values up to

INPUT NAME $ GRADE;

PUT NAME GRADE;

DATALINES;

JOE 57.5

MARY 83

JENNIFER 64.5

;

RUN;

QUIT;

8 characters in length. */

/* The $ tells SAS that NAME is a character

variable. */

This produces a file called GRADES.08 containing the lines

2008 GRADES

JOE 57.5

MARY 83

JENNIFER 64.5

Note that the use of DATA _NULL_ results in no SAS dataset being created.

CHAPTER 2. THE DATA STEP 12

2.6.3 <strong>Exercises</strong>

1. Write out the contents of the file epa.dat produced by the following:

DATA _NULL_;

FILE ’epa.dat’;

PUT ’SOME MILEAGE MEASUREMENTS’;

LENGTH CAR $ 13;

CAR = ’BUICK CENTURY’;

DISTANCE = 540;

FUEL = 40;

PUT CAR DISTANCE FUEL;

CAR = ’HONDA CRX’;

DISTANCE = 720;

FUEL = 30;

PUT CAR DISTANCE FUEL;

RUN;

QUIT;

2. Check your answer by executing the above lines on a computer.

3. Was a SAS data set created? Check this by adding the line PROC PRINT NOOBS;

(then look in the Output window <strong>and</strong> the Log file for more information.)

4. Reorganize the program so that it uses the Datalines statement.

2.7 Arithmetic

SAS can be used as a calculator to perform simple arithmetic.

1. Addition:

varname = varname1 + varname2;

2. Subtraction:

varname = varname1 - varname2;

3. Multiplication:

varname = varname1 * varname2;

4. Division:

varname = varname1 / varname2;

5. Power (varname1 varname2 ):

varname = varname1 ** varname2;

6. Modular arithmetic:

varname = MOD(varname1, varname2);

this computes the remainder resulting from division of varname1 by varname2 <strong>and</strong>

assigns this value to varname.

CHAPTER 2. THE DATA STEP 13

2.7.1 Example

DATA _NULL_;

/* some examples of arithmetic calculations */

FILE ’arith.out’;

X = 15; Y = 6;

SUM = X + Y;

DIFF = X - Y; /* DIFF = DIFFERENCE */

PRODUCT = X * Y;

QUOTIENT = X/Y;

POWER = X ** Y;

REMAIND = MOD(X,Y); /* REMAIND = REMAINDER */

PUT X Y SUM DIFF PRODUCT;

PUT QUOTIENT POWER REMAIND;

RUN;

QUIT;

Execution of the above SAS program produces a file called arith.out which contains

the following lines:

15 6 21 9

90 2.5 11390600 3

2.7.2 <strong>Exercises</strong>

1. What are the contents of the file convert.tmp produced by the following

program?

DATA _NULL_;

FILE ’convert.tmp’;

TEMPC = 20;

TEMPF = TEMPC*1.8 + 32;

PUT TEMPC ’ degrees Celsius = ’ TEMPF ’ degrees Fahrenheit.’

RUN;

QUIT;

2. Suppose X = 45, Y = 32, <strong>and</strong> Z = 7. Find the value of the variable ANSWER

in each of the following:

(a) ANSWER = X - Y;

(b) ANSWER = Z ** Z;

(d) ANSWER = MOD(Y,Z);

(e) ANSWER = MOD(X,Y)+ MOD(X,Z);

3. Using the fact that 1 mile = 1.6 kilometers, write a complete SAS program

which converts a distance of 26 miles into kilometer units, <strong>and</strong> which prints

the following into a file called convert.dst:

A distance of 26 miles

is the same as a distance of 41.6

kilometers.

CHAPTER 2. THE DATA STEP 14

The Floor Function

Syntax:

varname = FLOOR(varname1);

This statement assigns the greatest integer less than varname1 to the variable varname.

For example, the greatest integer less than 27.34 is 27, <strong>and</strong> the greatest integer less than

-16.4 is -17.

2.7.3 Example

DATA _NULL_;

X = 47.39;

Y = FLOOR(X);

The value of Y is 47.

2.7.4 <strong>Exercises</strong>

1. Write out the contents of the file arith.dat produced by

DATA _NULL_;

FILE ’arith.dat’;

X = -42.49;

Y = FLOOR(X);

PUT X Y;

RUN;

QUIT;

2. Modify the above program to compute the greatest integer less than

(a) 0.47.

(b) -0.47.

Chapter 3

If: Controlling Flow of Operations

The IF statement is very important in database management. It is used to control the flow

of operations which are applied to variables depending on the values of relevant variables.

In other words, if a certain variable takes on a certain value, a certain operation might be

performed; otherwise, the operation is not performed or a different operation is performed

in its place.

Syntax:

IF (condition) THEN (SAS statement);

ELSE (SAS statement);

SAS evaluates the condition to determine whether it is true or false. If the condition is true,

SAS proceeds to carry out the SAS statement. The ELSE statement is optional. It provides

an alternative action if the condition is false.

Possible conditions to test are

varname GE constant, varname LE constant

varname < constant, varname > constant

varname = constant, varname NE constant

Testing the first condition above amounts to testing whether the variable with name

varname is greater than or equal to the specified constant (another variable name could be

used here as well). The second condition listed concerns less than or equal, <strong>and</strong> the last

condition involves testing for inequality.

3.0.5 Example – Coding

The variable SEX can take values ’M’ <strong>and</strong> ’F’. It is sometimes more convenient

to code this variable numerically using 1 for males <strong>and</strong> 0 for females. The IF

statement can be used to do this as follows:

IF SEX = ’M’ THEN SEXCODE = 1;

ELSE SEXCODE = 0;

In other words, if the variable SEX takes the value ’M’, then the new variable

SEXCODE takes the value 1. Otherwise, SEXCODE takes the value 0.

CHAPTER 3. IF: CONTROLLING FLOW OF OPERATIONS 16

3.0.6 Example – Outlier Detection

Suppose X is a variable whose mean is MU <strong>and</strong> st<strong>and</strong>ard deviation is SIGMA. We may

decide that the value of X is to be considered outlying if it is more than 3 st<strong>and</strong>ard

deviations from MU. The following SAS lines determine if the value of X is outlying.

The variable OUTLIER is assigned the value 1 if X is an outlier, <strong>and</strong> it is assigned

the value 0 if X is not an outlier.

OUTLIER = 0;

Z = (X - MU)/SIGMA;

IF Z > 3 THEN OUTLIER = 1;

ELSE IF Z < -3 THEN OUTLIER = 1;

3.0.7 <strong>Exercises</strong>

1. Execute the following program <strong>and</strong> view the contents of the file demog.dat.

DATA DEMOGRAP;

FILE ’demog.dat’;

INPUT SEX $;

IF SEX = ’M’ THEN SEXCODE = 1;

ELSE SEXCODE = 0;

PUT SEXCODE;

DATALINES;

M

F

M

F

;

RUN;

QUIT;

2. The following data has been recorded over a period of 5 hours at a switch:

0,1,1,1,0. The switch is off when the value of the above variable (called

testcode) is 0, <strong>and</strong> on when the value is 1.

Write a SAS program which assigns the value ’on’ to the variable test when

the testcode value is 1 <strong>and</strong> ’off’ when testcode is ’0’.

3. A r<strong>and</strong>om variable X has mean 14 <strong>and</strong> variance 49. Write a SAS program

which determines which of the following values of X are outliers: 15, 23, -8,

31, 17. The results should be output to a file called ’outliers.ex’.

Chapter 4

DOing things repeatedly

The DO statement is often useful for simulation. It is also sometimes useful in other kinds

of data preparation <strong>and</strong> analysis.

4.1 Simple DO

The simple DO statement (which is usually used in association with an IF statement) tells

SAS to execute a set of SAS statements. This set of statements is usually referred to as a

DO group.

Syntax:

DO;

SAS statements

END;

4.1.1 Example

DATA _NULL_;

FILE ’do.eg’;

INPUT X Y;

IF X > Y THEN DO;

Z1 = X+Y;

Z2 = X-Y;

END;

ELSE DO;

Z1 = X-Y;

Z2 = X+Y;

END;

PUT X Y Z1 Z2;

DATALINES;

3 4

5 4

;

RUN;

CHAPTER 4. DOING THINGS REPEATEDLY 18

QUIT;

Executing the above program results in a file called ’do.eg’ which contains the

following:

3 4 -1 7

5 4 9 -1

4.2 Iterative DO

The iterative DO statement tells SAS to perform a computation several times.

Syntax:

DO varname = constant1 TO constant2 BY constant3;

END;

SAS statements

4.2.1 Example

Suppose we wish to add up all the numbers from 1 to 100. The following SAS

program does this for us:

DATA _NULL_;

NUMSUM = 0;

DO INDEX = 1 TO 100;

NUMSUM = NUMSUM + INDEX;

END;

FILE ’sum.100’;

PUT NUMSUM;

RUN;

QUIT;

/* NUMSUM is the variable which will

ultimately contain the sum we are

interested in.*/

/* At each iteration of the DO group,

the current value of INDEX is added to

the current value of NUMSUM. */

The file sum.100 will then contain the value 5050, which is the sum of the first 100

integers.

4.2.2 Example

Suppose we wish to add up all the even numbers between 1 <strong>and</strong> 101. The following

SAS program does this for us:

DATA _NULL_;

NUMSUM = 0;

CHAPTER 4. DOING THINGS REPEATEDLY 19

DO INDEX = 2 TO 100 BY 2;

NUMSUM = NUMSUM + INDEX;

END;

FILE ’even.sum’;

PUT NUMSUM;

RUN;

QUIT;

The file even.sum will then contain the value 2550, which is the sum of the first

50 even numbers.

4.2.3 <strong>Exercises</strong>

1. Write a SAS program which calculates the sum of all multiples of 3 between 1

<strong>and</strong> 121. Ans. 2460

2. Modify the above program so that it calculates the sum of all integers from 51

through 100. Ans. 3775

3. Modify the above program so that it calculates the sum of all squares from 1

to 100.

4. Modify the above program so that it calculates the sum of square roots of even

numbers between 1 <strong>and</strong> 101.

5. Modify the above program so that it calculates 20! (the product of all integers

between 1 <strong>and</strong> 20).

4.3 DO While (optional)

In order to use the iterative DO, one needs to know the number of times the computation is

to be performed. Often, this number is not known beforeh<strong>and</strong>. Instead, one might require

that the computation is performed while a particular condition is satisfied.

Syntax:

DO WHILE (condition);

END;

SAS statements

The SAS statements in the DO group are executed as long as the condition is found to

be true. The condition is tested once before the beginning of each loop. The first time that

the condition is found to be false, the DO group statements are no longer executed <strong>and</strong> SAS

moves on beyond the END; statement.

4.3.1 Example

Suppose we want to determine the largest value of n so that

n∑

i 2 < 10000.

i=1

CHAPTER 4. DOING THINGS REPEATEDLY 20

One approach to this problem is to successively add terms to the sum, while the

sum is less than 10000, <strong>and</strong> to stop accumulating as soon as the sum exceeds this

amount. The following statements accomplish this:

DATA _NULL_;

NUMSUM = 0;

INDEX=0;

DO WHILE (NUMSUM < 10000);

INDEX=INDEX+1;

NUMSUM = NUMSUM + INDEX**2;

END;

INDEX=INDEX-1;

FILE ’sum.out’;

PUT INDEX;

RUN;

QUIT;

The final value of INDEX is the solution n. This single number should be contained

in the file ‘sum.out’ after executing the above lines of code.

4.3.2 <strong>Exercises</strong>

1. Write a SAS program which finds the largest n satisfying

n∑

i 3 < 20000.

i=1

2. Write a SAS program which finds the largest n satisfying n! < 100000.

3. Write a SAS program which finds the smallest n satisfying n! > 100000.

Chapter 5

Simulation

5.1 Generation of Pseudor<strong>and</strong>om Numbers

We begin our discussion of simulation with a brief exploration of the mechanics of pseudor<strong>and</strong>om

number generation. Pseudor<strong>and</strong>om numbers are useful in simulation studies.

We will briefly describe a common method for simulating independent uniform r<strong>and</strong>om

variables on the interval [0,1]. A multiplicative congruential r<strong>and</strong>om number generator produces

a sequence of pseudor<strong>and</strong>om numbers, u 0 , u 1 , u 2 , . . . , which are approximately independent

uniform r<strong>and</strong>om variables on the interval [0,1]. We now describe how to construct

such a generator.

Let m be a large integer, <strong>and</strong> let b be another integer which is smaller than m. b is often

somewhere around the square root of m. To begin, an integer x 0 is chosen between 1 <strong>and</strong>

m. x 0 is called the seed. It is best chosen in some non-systematic manner.

Once the seed has been chosen, the generator proceeds as follows:

x 1 = bx 0 (mod m)

u 1 = x 1 /m.

u 1 is the first pseudor<strong>and</strong>om number. Dividing by m ensures that the number lies between

0 <strong>and</strong> 1. Note that it takes some value between 0 <strong>and</strong> 1. If m <strong>and</strong> b are chosen properly, it

is difficult to predict the value of u 1 , given the value of x 0 only. The second pseudor<strong>and</strong>om

number is then obtained in the same manner:

x 2 = bx 1 (mod m)

u 2 = x 2 /m.

u 2 is another pseudor<strong>and</strong>om number, which is approximately independent of u 1 . The method

continues using the following formulas:

x n = bx n−1 (mod m)

u n = x n /m.

This method produces numbers which are in reality non-r<strong>and</strong>om, but if done properly,

the numbers appear to be r<strong>and</strong>om (i.e. unpredictable).

Different values of b <strong>and</strong> m give rise to pseudor<strong>and</strong>om number generators of varying

quality. If they are not chosen with some care, then the generator will produce numbers that

do not appear to be r<strong>and</strong>om. A number of statistical tests have been developed for assessing

the quality of a pseudor<strong>and</strong>om number generator.

CHAPTER 5. SIMULATION 22

5.1.1 Example

The following lines of SAS create a file called RANDOM.DAT which contains 5 pseudor<strong>and</strong>om

numbers based on the multiplicative congruential generator:

x n = 171x n−1 (mod 30269)

with initial seed x 0 = 23121.

u n = x n /30269

/* Rudimentary Pseudor<strong>and</strong>om Number Generator */

DATA _NULL_;

FILE ’RANDOM.DAT’;

B = 171;

M = 30269;

SEED = 23121;

X = SEED;

DO I = 1 TO 5;

X = MOD(B*X, M);

U = X/M;

PUT X U;

END;

RUN;

QUIT;

The results which are stored in the file RANDOM.DAT are as follows. The first column

consists of the integers x 1 , x 2 , . . . , x 5 . The second column consists of numbers ranging

between 0 <strong>and</strong> 1. These are the uniform pseudor<strong>and</strong>om numbers, u 1 , u 2 , . . . , u 5 .

18721 0.61849

23046 0.76137

5896 0.19479

9339 0.30853

22981 0.75923

A related operation is used internally by SAS to produce pseudor<strong>and</strong>om numbers automatically

with the function UNIFORM.

5.1.2 Example

The following lines of SAS create a file called RANDOM.DAT which contains 50 uniform

pseudor<strong>and</strong>om numbers based on the SAS generator UNIFORM with initial seed

x 0 = 27218.

/* Example demonstrating use of SAS RNG with fixed seed. */

DATA _NULL_;

SEED = 27218;

CHAPTER 5. SIMULATION 23

FILE ’RANDOM.DAT’;

DO I = 1 TO 50;

U = UNIFORM(SEED);

PUT U;

END;

RUN;

QUIT;

It is often of interest to look at the distribution of a set of pseudor<strong>and</strong>om numbers.

For the numbers generated in the previous example, we would proceed as follows:

DATA RANDOM;

INFILE ’RANDOM.DAT’;

INPUT U;

PROC CHART;

VBAR U;

RUN;

QUIT;

The bars of the histogram should all be roughly the same height, if the numbers

are really uniformly distributed.

5.1.3 <strong>Exercises</strong>

1. Generate 200 r<strong>and</strong>om numbers using the generator from the first example with

an initial seed of 2018.

2. Write a program (or modify the second program in the second example) which

produces a histogram of the numbers produced in the previous exercise.

3. Generate 200 r<strong>and</strong>om numbers using the SAS UNIFORM generator from example

2 with an initial seed of 2018. Produce a histogram of this simulated data.

4. Modify the generator of the first example so that it produces 200 r<strong>and</strong>om

numbers from the generator

with initial seed x 0 = 17218.

x n = 172x n−1 (mod 30307)

5. Generate 1000 pseudor<strong>and</strong>om numbers using the SAS function UNIFORM, <strong>and</strong>

store them in a file called UNIF.DAT.

6. Modify the above program to simulate the r<strong>and</strong>om variable Y = 1/(U +

1) where U is a uniform r<strong>and</strong>om variable on the interval [0,1]. Specifically,

generate 1000 values of this r<strong>and</strong>om variable <strong>and</strong> put them in a file called

RANDOM.DAT.

Also, plot the histogram of the r<strong>and</strong>om numbers y 1 , . . . , y 1000 . Since Y is no

longer a uniform r<strong>and</strong>om variable, the histogram will not be flat any longer;

what is the shape of the distribution?

CHAPTER 5. SIMULATION 24

7. Write a program which generates 100 independent observations on a uniformly

distributed r<strong>and</strong>om variable on the interval [0, 100]. Estimate the mean, variance

<strong>and</strong> st<strong>and</strong>ard deviation of such a uniform r<strong>and</strong>om variable.

8. Use the FLOOR function together with UNIFORM to simulate 100 r<strong>and</strong>om integers

between 0 <strong>and</strong> 99.

5.2 Simulation of Bernoulli Trials

A Bernoulli trial is an experiment in which there are 2 possible outcomes. For example, a

light bulb may work or it may not work; these are the only possibilities. For another example,

consider a student who guesses on a multiple choice test question which has 5 options; the

student may guess correctly with probability 0.2 <strong>and</strong> incorrectly with probability 0.8.

Suppose we would like to know how well such a student would do on a multiple choice

test consisting of 100 questions. We can get an idea by using simulation:

Each question corresponds to an independent Bernoulli trial with probability of success

equal to 0.2. We can simulate the correctness of the student for each question by generating

an independent uniform r<strong>and</strong>om number. If this number is less than .2, we say that the

student guessed correctly; otherwise, we say that the student guessed incorrectly.

This will work because the probability that a uniform r<strong>and</strong>om variable is less than .2 is

exactly .2, while the probability that a uniform r<strong>and</strong>om variable exceeds .2 is exactly .8,

which is the same as the probability that the student guesses incorrectly. Thus, the uniform

r<strong>and</strong>om number generator is simulating the student. The SAS version of this is as follows:

DATA _NULL_;

SEED = 12883;

FILE ‘STUDENT.ANS’;

PUT ’CORRECT U’;

DO QUESTION = 1 TO 100;

U = UNIFORM(SEED);

IF U < .2 THEN CORRECT = 1;

ELSE CORRECT = 0;

PUT CORRECT U;

END;

RUN;

QUIT;

The first column of the file STUDENT.ANS contains the results of the student’s guesses. A 1

is recorded each time the student correctly guesses the answer, while a 0 is recorded each

time the student is wrong. The second column records the value of the variable U; note

that whenever its value is less than .2, the value of CORRECT is 1, <strong>and</strong> when U takes a value

exceeding .2, the value of CORRECT is 0.

5.2.1 <strong>Exercises</strong>

1. Write a SAS program which simulates a student guessing at a True-False test

consisting of 40 questions.

CHAPTER 5. SIMULATION 25

2. Write a SAS program which simulates 500 light bulbs, each of which has

probability .99 of working.

3. Write a SAS program which simulates a binomial r<strong>and</strong>om variable Y with

parameters n = 25 <strong>and</strong> p = .4. (Y is the sum of 25 independent Bernoulli

r<strong>and</strong>om variables with p = .4.)

• Now, modify the program so that it generates 100 of these binomial r<strong>and</strong>om

variables <strong>and</strong> writes them to a file called binom.dat. In order to do this,

you will need to nest one DO group inside another.

• Write another program which reads the data from binom.dat into a SAS

data set <strong>and</strong> produces a histogram. Estimate the mean <strong>and</strong> variance using

PROC MEANS. Compare these estimates with their theoretical counterparts.

Recall that the theoretical mean of a binomial r<strong>and</strong>om variable is np <strong>and</strong>

the theoretical variance is np(1 − p).

5.3 The Logistic Model

In many biostatistical applications, interest centers on a dose-response relationship. For

example, what dosage of a carcinogenic substance will produce cancer in a given percentage

of a population? One would expect that higher dosages of carcinogen will yield higher rates

of cancer. A first attempt at modelling this kind of relationship might be

p = α 0 + α 1 x

where p is the proportion of the population that would acquire cancer at dosage x; α 0 <strong>and</strong>

α 1 are constants. This model is linear, <strong>and</strong> will almost have the correct behaviour if α 1 is

positive. However, it will give values of p outside the interval [0, 1] if x is too large or too

small.

The logistic model is often used as an alternative to h<strong>and</strong>le this kind of problem. It

is based on the logit transformation which maps values in (0, 1) to (−∞, ∞). The logit

transformation is given by l(p) = log(p/(1 − p)). Its inverse is given by the logistic function

p(l) = exp(l)/(1 + exp(l)).

We can then model the dose-response relationship with

l(p) = β 0 + β 1 x

where β 0 <strong>and</strong> β 1 are constants. This model says that when the dosage is x, the proportion

of the population acquiring cancer will be p, where

Example

p =

eβ 0+β 1 x

1 + e β 0+β 1 x .

Write SAS code to simulate the responses of 20 subjects who have been exposed to

varying amounts of carcinogen under the logistic model assumption with β 0 = −1.5

<strong>and</strong> β 1 = 0.7. Assume that the dosages are given by x = 0.1, 0.2, . . . , 2.0. Output

should be printed to a file called ‘doseresponsesim.txt’.

DATA _NULL_;

CHAPTER 5. SIMULATION 26

SEED = 81818; B0 = -1.5; B1 = 0.7;

FILE ‘doseresponsesim.txt’;

PUT ’Response Dosage’;

DO X = 0.1 TO 2.0 BY 0.1;

U = UNIFORM(SEED);

TMP = EXP(B0 + B1*X);

P = TMP/(1+TMP);

IF U

ELSE CANCER = 0;

PUT CANCER X;

END;

RUN;

QUIT;

Upon running the code, it should be clear that as x increases, the incidence of

cancer increases (i.e. the incidence of 1’s in the first column of simulated data

increases).

<strong>Exercises</strong>

1. Run the code for the logistic model given in the above example. Then change the slope

parameter β 1 to −0.7. How does this affect the pattern in the response?

2. Modify the code given in the example so that dosages are given by 1.5, 1.7, 1.9, . . . , 3.5.

3. Modify the example code so that the output enters a SAS dataset called ’DOSERESP’.

Next, use the PLOT procedure to plot CANCER against X. Experiment with various

values of β 0 <strong>and</strong> β 1 in order to see how these values affect the pattern of response.

5.4 Binomial R<strong>and</strong>om Numbers

The RANBIN function can be used to automatically generate binomial r<strong>and</strong>om numbers.

Syntax:

Y = RANBIN(seed,n,p);

The seed is any positive integer, while n <strong>and</strong> p are the binomial parameters. The function

assigns a r<strong>and</strong>om binomial realization to the variable Y.

5.4.1 Example

Suppose 12% of a large population has recently been infected by a virus whose

incubation period is 2 weeks long, but whose presence can be detected by a blood

test. Suppose r<strong>and</strong>om testing for the virus is conducted, <strong>and</strong> 15 individuals are

tested each hour. Simulate the number of positive test results for each hour over

a 24-hour period. Record the simulated numbers of positive test results in a file

called viruscounts.txt.

Since 15 individuals are tested each hour <strong>and</strong> each individual has a 0.12 probability

of being infected, independent of the state of the other individuals, the number

CHAPTER 5. SIMULATION 27

of positive test results in one hour is a binomial r<strong>and</strong>om variable with n = 15

<strong>and</strong> p = 0.12. To simulate the numbers of positive test results for each hour in a

24-hour period, we need to generate 24 binomial r<strong>and</strong>om numbers:

/* Simulation of infected individuals */

DATA _NULL_;

SEED = 3728;

N = 15;

P = .12;

FILE ’viruscounts.txt’;

PUT ’HOUR NUMBER OF INFECTED’;

DO HOUR = 1 TO 24;

INFECTED = RANBIN(SEED,N,P);

PUT HOUR INFECTED;

END;

RUN;

QUIT;

5.4.2 <strong>Exercises</strong>

1. Generate 1000 binomial variates with n = 18 <strong>and</strong> p = .75 using RANBIN. Then use

PROC MEANS to estimate the average <strong>and</strong> variance. Compare with the theoretical mean

<strong>and</strong> variance. Repeat for binomial variates with n = 50 <strong>and</strong> p = .4.

2. Generate 50 binomial variates B 1 , B 2 , . . . , B 50 , having n = 20 <strong>and</strong> where p satisfies

l(p) = −2.0 + 0.5x

where x = 0.1, 0.2, 0.3, . . . , 5.0. Use the Plot procedure to plot B against x <strong>and</strong> note

the pattern of plotted points.

3. Refer to the previous question. Calculate the expected value of B i , for i = 1, 2, . . . , 50.

Plot these expected values against x.

5.5 Poisson R<strong>and</strong>om Numbers

We can generate Poisson r<strong>and</strong>om numbers using SAS with the RANPOI function. It is similar

to the RANBIN function, but there is only one parameter instead of two.

Syntax:

Y = RANPOI(seed, lambda);

In this case, lambda is the mean of the Poisson r<strong>and</strong>om variable.

CHAPTER 5. SIMULATION 28

5.5.1 Example

Suppose traffic accidents occur at an intersection with a mean of 3.7 per year.

Simulate the annual number of accidents for a 10-year period, assuming that the

numbers occurring from year to year are independent.

/* Example of Poisson variate generation -- Simulation of Traffic

Accidents */

DATA _NULL_;

SEED = 497765;

LAMBDA = 3.7;

FILE ’ACCIDENT.DAT’;

PUT ’YEAR NUMBER OF ACCIDENTS’;

DO YEAR = 1 TO 10;

ACCIDENT = RANPOI(SEED, LAMBDA);

PUT YEAR ACCIDENT;

END;

RUN;

QUIT;

5.5.2 <strong>Exercises</strong>

1. Modify the above program to simulate the number of accidents per year for

15 years, when the average rate is 2.8 accidents per year.

2. Simulate the number of surface defects in the finish of a sports car for 20 cars,

where the mean is 1.2 defects per car.

3. Estimate the mean <strong>and</strong> variance of a Poisson r<strong>and</strong>om variable whose mean

rate is 7.2 by simulating 1000 such variates <strong>and</strong> using PROC MEANS. Compare

with the theoretical values, recalling that the variance <strong>and</strong> mean are equal for

Poisson r<strong>and</strong>om variables.

4. A commonly used model is the Poisson regression model

log(λ) = β 0 + β 1 x

where β 0 <strong>and</strong> β 1 are constants. Take β 0 = −3 <strong>and</strong> β 1 = 0.5, <strong>and</strong> suppose

x = 0.1, 0.2, 0.3, . . . , 4.0. Calculate the corresponding values of λ. (Store these

values in a SAS variable called lambda.)

5. Refer to the previous question. Simulate Poisson r<strong>and</strong>om variates which have

the λ values. Plot the Poisson variates against the corresponding values of x.

5.6 Exponential R<strong>and</strong>om Numbers

The exponential distribution can be used as a simple model for the time until a component

fails, or until a light bulb burns out.

A r<strong>and</strong>om variable T has an exponential distribution with mean λ if

CHAPTER 5. SIMULATION 29

P(T ≤ t) = 1 − e −t/λ

for any non-negative t. The mean or expected value of T is 1/λ <strong>and</strong> the variance of T is

1/λ 2 .

The simplest way to simulate exponential r<strong>and</strong>om variables is to generate a uniform

r<strong>and</strong>om variable U on [0,1], <strong>and</strong> set

Solving this for T , we have

1 − e −T/λ = U

T = −λ log(1 − U).

It can be shown that T defined in this way has an exponential distribution with mean λ. The

SAS function RANEXP can be used to generate r<strong>and</strong>om exponential variates with mean 1.

Syntax:

T = RANEXP(seed);

This produces an exponential variate T having mean 1. To change the mean to lambda, we

must use

T = lambda * RANEXP(seed);

5.6.1 Example

/* SIMULATION OF N EXPONENTIAL LAMBDA RANDOM VARIATES */

DATA _NULL_;

SEED = 12238;

LAMBDA = 2.5;

N = 10;

FILE ’EXPO.RVS’

DO I = 1 TO N;

T = RANEXP(SEED)*LAMBDA;

PUT T;

END;

RUN;

QUIT;

5.6.2 <strong>Exercises</strong>

1. Suppose that a certain type of battery has a lifetime which is exponentially

distributed with mean 55 hours. Simulate 1000 such lifetimes to estimate the

mean <strong>and</strong> variance of the lifetime for this type of battery. Compare with the

theoretical mean <strong>and</strong> variance.

2. The central limit theorem says that the sample mean for a r<strong>and</strong>om sample

of size n from a population with mean µ <strong>and</strong> variance σ 2 is approximately

normally distributed with mean µ <strong>and</strong> variance σ 2 /n, where the approximation

improves as n increases.

CHAPTER 5. SIMULATION 30

The following programs provides a demonstration for the case where the underlying

population is exponentially distributed:

/* PROGRAM 1: Computation of averages of samples of size N coming

from exponential lambda populations */

DATA _NULL_;

SEED = 12238;

LAMBDA = 2.5;

NSAMPLES = 1000;

N = 10;

FILE ’EXPO.AVG’

DO NSAMPLE = 1 TO NSAMPLES;

TSUM = 0;

DO I = 1 TO N;

T = RANEXP(SEED)*LAMBDA;

TSUM = TSUM + T;

END;

RUN;

QUIT;

END;

TAVG = TSUM/N;

PUT TAVG;

/* We are going to simulate NSAMPLES

independent samples of size N, computing the average

in each case. */

/* Accumulating the sample

values to form a sum */

/* TAVG = average of the current

sample. */

/* Storing sample averages for

use in next program where they will be

plotted as a histogram. */

/* PROGRAM 2: Histogram of averages to demonstrate CLT */

DATA EXPO_AVG;

INFILE ’EXPO.AVG’;

INPUT TAVG;

PROC CHART;

VBAR TAVG;

PROC MEANS MEAN VAR;

VAR TAVG;

RUN;

QUIT;

/* We’ve included this procedure to compare

the mean <strong>and</strong> variance of the averages with what is

expected by the theory */

Run the above programs for N = 3, 6, 10, 20, 30, 40. Note how the histogram

begins to resemble the familiar bell-shaped curve as N increases. How large

would you say N should be in order for the normal approximation to be considered

accurate, when the underlying population is exponential?

CHAPTER 5. SIMULATION 31

5.7 Normal R<strong>and</strong>om Numbers

St<strong>and</strong>ard normal r<strong>and</strong>om variables can be generated using the RANNOR function in SAS.

Syntax:

Z = RANNOR(seed);

This produces a value of a normal r<strong>and</strong>om variable Z which has mean 0 <strong>and</strong> variance 1.

Recall that if X has mean µ <strong>and</strong> variance σ 2 , then

X = µ + σZ

where Z has mean 0 <strong>and</strong> variance 1. Therefore, to simulate a r<strong>and</strong>om variable X having

mean mu <strong>and</strong> st<strong>and</strong>ard deviation sigma, use

X = mu + sigma*RANNOR(seed);

5.7.1 Example

Use simulation to estimate P (Z < 1.25) where Z is a st<strong>and</strong>ard normal r<strong>and</strong>om

variable.

Idea: Simulate a large number (say, 1000) of st<strong>and</strong>ard normal r<strong>and</strong>om variates <strong>and</strong>

compute the proportion that lie below 1.25.

DATA _NULL_;

FILE ’NORMAL.PRB’;

SEED = 19218;

N = 1000;

VALUE = 1.25;

COUNT = 0;

DO I = 1 TO N;

Z = RANNOR(SEED);

IF Z < VALUE THEN COUNT = COUNT + 1;

END;

PROBEST = COUNT/N;

PUT ’AN EMPIRICAL ESTIMATE OF P(Z < ’ VALUE ’) IS ’ PROBEST;

RUN;

QUIT;

5.7.2 <strong>Exercises</strong>

1. Simulate 100 normal r<strong>and</strong>om variates having mean 51 <strong>and</strong> st<strong>and</strong>ard deviation

5.2. Compute the average <strong>and</strong> st<strong>and</strong>ard deviation of your simulated sample

<strong>and</strong> compare with the theoretical values.

2. Simulate 1000 st<strong>and</strong>ard normal r<strong>and</strong>om variates Z, <strong>and</strong> use your simulated

sample to estimate

(a) P (Z > 2.5).

(b) P (0 < Z < 1.645).

CHAPTER 5. SIMULATION 32

(d) P (−1.2 < Z < 1.3).

Compare with the theoretical values (i.e. consult a normal table).

3. Using the fact that a χ 2 r<strong>and</strong>om variable on 1 degree of freedom has the same

distribution as the square of a st<strong>and</strong>ard normal r<strong>and</strong>om variable, simulate 100

independent values of such a χ 2 r<strong>and</strong>om variable, <strong>and</strong> estimate its mean <strong>and</strong>

variance. (Compare with the theoretical values: 1, 2.)

4. A χ 2 r<strong>and</strong>om variable on n degrees of freedom has the same distribution as

the sum of n independent st<strong>and</strong>ard normal r<strong>and</strong>om variables. Simulate a χ 2

r<strong>and</strong>om variable on 8 degrees of freedom, <strong>and</strong> estimate its mean <strong>and</strong> variance.

(Compare with the theoretical values: 8, 16.)

5. A commonly used model is the simple regression model

y = β 0 + β 1 x + ε

where β 0 <strong>and</strong> β 1 are constants. ε is a normal r<strong>and</strong>om variable with mean 0 <strong>and</strong>

variance σ 2 . Take β 0 = −3 <strong>and</strong> β 1 = 0.5, <strong>and</strong> suppose x = 0.1, 0.2, 0.3, . . . , 4.0.

(a) Simulate 40 independent normal variates ε, supposing σ = 0.4. (Store

these values in a SAS variable called epsilon.)

(b) Simulate the corresponding values of y. (Store these values in a SAS variable

called y.)

(c) Plot the normal variates against the corresponding values of x. Note the

pattern on the plot.

6. Re-do the previous question using σ = 1.5.

7. Repeat, using β 0 = 5 <strong>and</strong> β 1 = −2.

Chapter 6

REFERENCE: Other Data Step

Functions

A SAS DATASET

X1 X2 X3 X4

-1 3 2 2.3

0.1 4 -1 2.1

0.5 -1 -7 2.4

1.9 -1.7 -4 1.9

- used in some of the examples below.

6.1 Arithmetic Functions

• ABS(X) - returns the absolute value of X: |X|.

EXAMPLE: Y=ABS(X1); (Y = 1 0.1 0.5 1.9).

• MAX(X1,X2,...,XN) - returns the largest value among the values of the arguments.

EXAMPLE: verb+Y=MAX(X1,X2,X3,X4);+ (Y = 3 4 2.4 1.9).

• MIN(X1,X2,...,XN) - returns the smallest value among the values of the arguments.

EXAMPLE: Y=MIN(X1,X2,X3,X4); (Y = -1 -1 -7 -4).

• MOD(N1,N2) - returns the remainder when the quotient of N1 divided by N2 is calculated.

EXAMPLE: Y=MOD(X1,X2); (Y= 2 0.1 0.5 0.2).

• SIGN(X) - returns the sign of X, or 0, if X is 0.

EXAMPLE: Y=SIGN(X1); (Y= -1 1 1 1)

• SQRT(X) - returns the square root of X: √ X. When X is negative, it returns a missing

value (.).

EXAMPLE: Y=SQRT(X1); (Y = . 0.31622 0.70710 1.37840).

6.2 Truncation Functions

• CEIL(X) - returns the smallest integer greater than X.

• FLOOR(X) - returns the largest integer smaller than X.

CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 34

• INT(X) - returns the same value as FLOOR(X), if X is positive, <strong>and</strong> returns the same

value as CEIL(X), if X is negative.

• ROUND(X,Z) - returns the value of X rounded to the nearest unit of Z.

6.3 Special Mathematical Functions

• EXP(X): e X .

• GAMMA(X): the complete gamma function, ∫ ∞

0 t X−1 e −t dt.

• LOG(X): the natural logarithm of X.

• LOG2(X): the logarithm to the base 2 of X.

• LOG10(X): the logarithm to the base 10 of X.

6.4 Trigonometric <strong>and</strong> Hyperbolic Functions

• ARCOS(X): inverse cosine of X.

• ARSIN(X): inverse sine of X.

• ATAN(X): inverse tangent of X.

• COS(X): cosine of X.

• COSH(X): hyperbolic cosine of X.

• SIN(X): sine of X.

• SINH(X): hyperbolic sine of X.

• TAN(X): tangent of X.

• TANH(X): hyperbolic tangent of X.

6.5 Statistical functions

• CSS(X1,X2,...,XN): the corrected sum of squares

N∑

Xi 2 − N ¯X 2

i=1

• CV(X1,X2,...,XN): the coefficient of variation - the st<strong>and</strong>ard deviation of X 1 , . . . , X N

divided by the mean of X 1 , . . . , X N .

• MEAN(X1,...,XN)

¯X = 1 N

N∑

X i

i=1

EXAMPLE: Y = MEAN(X1,X2,X3,X4); (Y = 1.575 1.3 -1.275 -0.475).

CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 35

• N(X1,...,XN): number of nonmissing arguments.

EXAMPLE: Y=N(.,4.1,.3.7,5.7); (Y = 3).

• NMISS($X_1,\ldots,X_N$): number of missing values.

EXAMPLE: Y=NMISS(.,4.1,.3.7,5.7); (Y = 2).

• RANGE(X1,...,XN): maximum minus the minimum.

EXAMPLE: Y=RANGE(X1,X2,X3,X4); (Y = 4 5 9.4 5.9).

• STD(X1,...,XN): st<strong>and</strong>ard deviation.

• STDERR(X1,...,XN): st<strong>and</strong>ard error (st<strong>and</strong>ard deviation divided by √ N).

• SUM(X1,...,XN): ∑ N

i=1 X i

• USS(X1,...,XN): uncorrected sum of squares ∑ N

i=1 Xi

2

• VAR(X1,...,XN): variance

6.6 Probability functions

The following functions can be used to determine various probabilities. The syntax is similar

to that used for the r<strong>and</strong>om number generator functions.

• GAMINV(P,eta): returns the value of x such that

P =

∫ x

0 tη−1 e −t dt

Γ(η)

(0 ≤ P < 1, <strong>and</strong> η > 0).

• POISSON(lambda,N): returns the probability that an observation from a Poisson distribution

is less than or equal to N. λ is the mean parameter.

i.e. POISSON(lambda,N) = ∑ N

j=0

e −λ (λ) j

j!

• PROBBNML(p,n,m): returns the probability that an observation from a binomial distribution

with parameters p <strong>and</strong> n is less than or equal to m.

)

i.e. PROBBNML(p,n,m) = ∑ m

j=0

(

n

j

p j (1 − p) n−j .

• PROBCHI(x,nu): returns the probability that a r<strong>and</strong>om variable with a chi-square distribution

on ν degrees of freedom falls below x.

• PROBF(x,ndf,ddf): returns the probability that a r<strong>and</strong>om variable with an F distribution

on ndf numerator degrees of freedom <strong>and</strong> ddf denominator degrees of freedom falls

below x.

• PROBGAM(x,eta): returns the probability that a r<strong>and</strong>om variable with a gamma distribution

with shape parameter η falls below x.

∫ x

0

i.e. PROBGAM(x,eta) =

tη−1 e −t

.

Γ(η)

CHAPTER 6. REFERENCE: OTHER DATA STEP FUNCTIONS 36

• PROBIT(x): returns the inverse of the st<strong>and</strong>ard normal cumulative distribution function.

i.e. If X is a st<strong>and</strong>ard normal r<strong>and</strong>om variable, then x is the probability that X will

take on a value less PROBIT(X).

• PROBNORM(x): returns the probability that a st<strong>and</strong>ard normal r<strong>and</strong>om variable will fall

below x.

• PROBT(x,nu): returns the probability that a r<strong>and</strong>om variable with student’s t distribution

on ν degrees of freedom will fall below x.

• TINV(p,nu): returns the pth percentile of the student’s t distribution on ν degrees of

freedom.

6.6.1 Example

Find the probability that a r<strong>and</strong>om variable with a t distribution on 8 degrees of freedom is

less than 1.4.

i.e. P (T < 1.4) =? where T is t-distributed on 8 d.f. The following program writes the

correct probability into the file PROB.T.

DATA _NULL_;

FILE ’PROB.T’;

PROB = PROBT(1.4, 8);

PUT PROB;

6.6.2 <strong>Exercises</strong>

1. Compute the probability that a Poisson r<strong>and</strong>om variable with mean rate 11.4

takes on values less than

(a) 1.

(b) 2.

(d) 11.

(e) 15.

(f) 21.

2. Repeat the previous question for a binomial r<strong>and</strong>om variable with p = .45 <strong>and</strong>

n = 24.

3. The time that it takes a bus to arrive at the next stop is normally distributed

with mean 10.4 minutes <strong>and</strong> st<strong>and</strong>ard deviation 1.2. Compute the probabilities

that the bus will arrive in less than

(a) 5 minutes.

(b) 8 minutes.

(d) 12.5 minutes.

(e) 13.1 minutes.

(f) 15.2 minutes.

Notes and Exercises on the SAS Data Step and Simulation

Notes and Exercises on the SAS Data Step and Simulation ... View more Notes and Exercises on the SAS Data Step and Simulation

Delete template?

Save as template ?

Notes and Exercises on the SAS Data Step and Simulation Notes and Exercises on the SAS Data Step and Simulation