Personal tools
You are here: Home Software Packages SAS Introduction to the Data Step -- Part 2

Introduction to the Data Step -- Part 2

Contributors: benson, mictchell
— filed under:

or Now that My Data is in a SAS Data Set, What can I do with it?

All numbers in parentheses refer to page numbers in the Second Edition of The Little SAS Book.


In Part 1 of the Introduction to the Data Step, we dealt mainly with how to use the Data Step to read raw data into a SAS data set. Remember that a data set is a table consisting of a number of observations which are referred to as ROWS, and that each row consists of a number of variables referred to as COLUMNS.

Today we are going to discuss how the Data Step can be used to modify the data set to create new Variables and produce customized data sets by subsetting, sorting and merging existing data sets.


General Information

The general format of a Data Step is :


    DATA OutputDataSet (options);	
SET
InputDataSet (options);
Statement1;
Statement2;
...
Statementn;
RUN;
  • A SAS programs consists of DATA steps and PROC steps (procedures which perform a number of specialized tasks including many statistical functions).
  • A Data Step may not include a Proc Step and vice-versa.
  • A statement ends with a semi-colon (;).
  • Both Data steps and Proc steps end with either another data or Proc Step or with the RUN statement.
  • If your program does not seem to do anything when youexecute it, check that the last statement of your program is a " RUN;" statement.
  • If your LOG (which you are OF COURSE checking carefully)flags as errors statements that you are certain are syntactically correct, check for missing semi-colons proceeding the error messages.
  • A Data Step processes the entire input Data Set, oneobservation at a time. SAS begins by executing all statements in a Data Step using the first observation. Then it gets the second observation and proceeds through all the statements, again. And so on, until the last observation has been handled.

DATA SET OPTIONS (140-143, 191)

The following are options that can be included when working with SAS data Sets. Data Sets options are enclosed in parentheses. Multiple options can be specified by putting a blank between the options.

For example the following examples drops the variable age from the output and renames the variable " b01" as state:

(DROP = age data RENAME = (b01 = state));

Included in the list of Data Set options are:

DROP variables
Does not include the named variables in the output.

FIRSTOBS=n
Begins processing with the nth observation.

IN=variable
Creates a temporary variable that indicates if the current observation is included in the output.

KEEP=variables
Includes only the listed variables in the output.

OBS=n
Stops processing with the nth observation.

RENAME=(oldname=newnames...)
Changes the name of the variables in the list.

During the writing and debugging phase of a program, we strongly urge you to use the OBS= and FIRSTOBS= options,especially if you are working with a large number of input records. It will save you processing time and therefore debugging time if you limit the number of input observations. Once you believe the bugs are out of your code, remove these options and let your program fly.

Example:


        DATA newfile;
          SET oldfile (firstobs=5000 obs = 6000);
	    SAS statement...;
	    RUN;

This will only process 1000 observations, numbers 5000-6000 of the oldfile data set.

You used DROP, KEEP and RENAME options as part of your homework. RENAME= will become very useful when we discuss how to join (MERGE) two data sets.

The IN= option creates a temporary variable in a SET, MERGE or UPDATE statement, especially during MERGE statements. The variable is NOT written to the output file, and exists only during the current data step. It indicates which of your input data sets contributed to your output and can be used to decide whether to actually write that observation to your output Data Set.

Relational Operators

The following relational operators can be used in SAS programs, especially as part of IF-THEN-ELSE statements. These operators determine whether the specified relationship between two values is true. The operators return the value 1 if the relationship is true, 0 if it is false.

= EQ
EQUAL TO - the relationship holds if the two quantities are identical.

^= NE
NOT EQUAL TO - the relationship holds if the two quantities are different.

> GT
GREATER THAN - the quantity on the left-hand side of the operator is larger than the quantity on the right-hand side.

< LT
LESS THAN - the quantity on the left-hand side of the operator is less than the quantity on the right-hand side.

>= GE
GREATER THAN OR EQUAL TO - the quantity on the left-hand side of the operator is larger than, or equal to the quantity on the right-hand side.

<= LE
LESS THAN OR EQUAL TO - the quantity on the left-hand sideof the operator is less than, or equal to the quantity on the right-hand side.

IN
THE IN OPERATOR holds if the quantity is equal to any value in a list. This replaces multiple IF-then-else statements.

The form of the statement is:
expression IN(v1,v2,...,vn);

The following example groups states into regions, in this case, New England;

	If state =  ME  then region =  NEW ;      /*Maine         */	
else if state = VT then region = NEW ; /*Vermont */
else if state = NH then region = NEW ; /*New Hampshire */
else if state = MA then region = NEW ; /*Massachusetts */
else if state = RI then region = NEW ; /*Rhode Island */
else if state = CT then region = NEW ; /*Connecticut */

As you can see, to continue this for all 50 states would be tedious and error prone. It can be rewritten as:

If state IN (ME, VT, NH, MA, RI,CT) then region = NEW ;

LOGICAL OPERATORS (72-73, 92)

The following logical operators can be used in the SAS Data step, especially as part of if-then-else statements.

AND - THE LOGICAL AND OPERATOR.

If both conditions linked by the AND are true, then the expression is true.

If either condition linked by the AND is false then the entire statement is false.

For example:


       If (sex =  "f" ) AND (MaritalStatus =  "m") then
           output wives;

This will create an output data set containing all women who are married.

OR- THE LOGICAL OR OPERATOR

If either condition linked by the OR is true, then the entire expression is true.

If both conditions linked by the OR are false then the entire expression is false.

For example:


       If (MaritalStatus = "m" ) OR (MaritalStatus =  "c" ) then 
           output spouses;

This will create an output data set of all people (male or females) who are either married or cohabiting.

NOT - THE NEGATION OPERATOR

This reverses the logical value of the condition.

For example:


       If NOT (MaritalStatus EQ  "m" ) AND NOT (Marital Status EQ  "c" )
then output singles;

This will create an output data set only if both conditions are met. That is if both the person is NOT married and the person is NOT cohabiting. It is the same as:



      If (MaritalStatus NOT EQ  "m" ) OR (Marital Status NOT EQ  "c")
then output singles;

A note of caution: Be absolutely certain to check both conditions when using the NOT operator. It is easy to become confused to write backward code.

One important note concerning logical operators in SAS is that any numeric value other than 0 or missing is true. Thus numeric variables or functions which return numeric values can be used as a condition in a logical statement.

For example:


      If nbrchildren then parent =  "Y" ;

This is the same as:


      If nbrchildren > 0 then parent =  "Y" ;  

If multiple logical statements are being used, parentheses can be used to force the order of evaluation. Without parentheses, SAS evaluates expressions around the AND operator before the OR. In addition to verifying the order of operation, parentheses make the program easier to read.

For example the following code, which will generate an output Data Set containing observations of females who are less than 18 AND ALL people (male and female) who are over 65 years old.


      If AGE > 65 OR AGE < 18 AND SEX = "f"  then output;

Is the same as:


      If AGE > 65 OR (AGE < 18 AND SEX =  f)  then output;

But not the same as:


      If (AGE > 65 OR AGE < 18) AND SEX =  f  then output;

Which will generate an output data set containing observations of females who are either less than 18 or older than 65.

CONDITIONAL STATEMENTS (72-77)

Conditional statements allow actions to be performed only when specified conditions are met.

The most frequently encountered conditional statement is the IF-THEN-ELSE statement. Its general format is:


     IF condition THEN action;
ELSE action;
  • The ELSE portion of the statement can be omitted, if it is not needed.

  • Either a single statement, or a group of statements can be executed. If multiple statements are executed, they must be enclosed in a DO; . . . END; structure.

  • Multiple IF-THEN/ELSE statements can be used together in a data step.

  • Please notice the the semi-colon (;) at the end of both the "IF" statement and the one at the end of "ELSE" statement.

The following are the possible forms of the IF-THEN-ELSE statement:


     IF condition THEN statement;/* The "Else" is not needed */
 
     IF condition THEN DO;       /* The "Else" is still not  */
         statement1;             /* needed, a # of statements*/        
 
statement2; /* is executed when the */
... /* condition is true */ statementn; END; IF condition THEN statement; /* Normal IF-THEN-ELSE */
ELSE statement;

IF condition THEN statement; /* Single action if cond. */
ELSE DO; /* true, multiple actions */ statement1; /* if condition is false */
...
statementn;
END;

IF condition1 THEN statement; /* Multiple IF-THEN-ELSE */
ELSE IF condition2 THEN statement;
ELSE IF condition3 THEN statement;
ELSE statement;

Note that in this last case, when a condition is met, the statement is executed, and that none of the other conditions are checked.

Make certain that every DO has a matching END.

EXAMPLES of "IF" statements:

  • In the first example only people who ever attended school are selected for the new Data Set.
           DATA educated;
    SET survey;
    If AttendedSchool = "Y" then output educated;
    run;
  • Now we decide to break up the file of educated people into men and women.
           DATA educated;
    SET survey;
    If AttendedSchool = "Y" then output educated;

    DATA edu_men edu_women;
    SET educated;
    If sex = "m" then output edu_men;
    else output edu_women;
    run;
  • The last example was not efficient. We first created an intermediate Data Set educated whose only purpose was to allow us to eliminate all uneducated people before separating the people we wanted to analyze by sex. In addition, any value other than "m" would be considered "non male" and output to the edu_woman file including missing or mis-coded data. Remember input and output operations are time and resource hogs.
           DATA edu-men edu_women;
    SET survey;
    if (attendedSchool = "Y") AND (sex = "m") then
    output edu_men;
    else if (attnededSchool = "Y") and (sex = "f") then
    output edu_women;
  • This final example uses the DO-END structure to the IF statement. In this case we have simplified the logic by removing a second instance of the same comparison statement: If attendedSchool = "Y"
           DATA edu-men edu_women;
    SET survey;
    if attendedSchool = "Y" then DO;
    If sex = "m" then
    output edu_men;
    else if sex = "f" then
    output edu_women;
    end;

If there is a long series of mutually exclusive comparisons, the SELECT statement can be used. The format of the SELECT statement is:

    SELECT  ( select-expression);  
WHEN (expression-1)statement;
WHEN (expression-2)statement;
WHEN (expression-n)statement;
OTHERWISE statement;
END:

The optional select-expression is used with a variable that evaluates to a single value.

Document Actions

Copyright ©2014, The Pennsylvania State University | Privacy and Legal Statements
Contact the Help Site Administrator | Last modified Aug 13, 2008 | Weblion Partner