Personal tools
You are here: Home Software Packages SAS Recoding Variables using Formats, Functions and Macros

Recoding Variables using Formats, Functions and Macros

— filed under: ,

This document describes a number of ways you can work with the variables in your data set to modify their values or create new variables from them. A variety of techniques are discussed:

Arithmetic Recodes

Sometimes in survey data you will find that responses to questions have "reversed" scales, that is a question is stated so that the response has the opposite meaning to another question:
                        Workshop Evaluation

Agree/Disagree
1    2    3       Q1. I felt the instructor was well prepared.
1    2    3       Q2. I felt the workshop was useful.
1    2    3       Q3. I felt the handout were inadequate.
A "1" for question 3 would indicate the opposite of a "1" for questions 1 & 2 in terms of quality of instruction. To reverse the response, add the lowest value in the scale to the highest value (1+3) and subtract the value of the response.
  data recodes;
   set mydata;
   q3r = 4 - q3; 
  run;

Functions

A function is a built in routine that performs an operation on an individual observation. Functions are very handy and can save you considerable time and effort in programming. There are arithmetic functions such as ones to get the integer or absolute value of a number. There are date and time functions as well as character functions.

Function are used in the data step. All functions must have parentheses and any arguments to the function are separated by commas. You can use multiple functions in the same SAS statement such as:

 
    age = abs(int(yeardif)); 
This example uses both the absolute value and integer functions. The age variable would have the absolute value of the integer of the variable yeardif. So if yeardif = -14.678 then age = 14.

This is a quick review of some of the most common functions: There is a complete list of functions in the SAS documentation and some of the more common functions are mentioned in "The Little SAS Book".

ARITHMETIC FUNCTIONS:
INT - This function returns the integer part of a number.
    age = int(14.6789);

 

In this example age = 14.

ABS - This function returns the absolute value of a number.

 

   
    days = abs(-12678);

 

In this example days = 12678.

SUM - This function calculates the sum of the arguments. This is different than proc summary since in proc summary you are summarizing a variable in an entire data set while the function sum works on multiple variables in a single observation. Example:

    grade = sum(test1,test2,test3);

If test1 = 80, test2 = 85, and test3 = 90 then grade = 255. This is an important function when you have missing values since the function SUM ignores missing values.

Note the difference between using SUM and SAS arithmetic with the following data.

test1 test2 test3
85 90 95
70 . 90
   grademth  = test1 + test2 + test3;
   gradesum  = sum(test1,test2,test3);

 

test1 test2 test3 grademth gradesum
85 90 95 270 270
70 . 90 . 160

IF ANY OF THE VALUES ARE MISSING THEN YOUR CALCULATED VARIABLE WILL BE SET TO MISSING IF YOU ADD THEM TOGETHER!

The function SUM ignores the missing values.

Other common arithmetic functions include MIN, MAX, and MEAN which return the minimum, maximum or mean of the arguments. They also ignore the missing values but you must have at least 2 non missing values.

NMISS - This function returns the number of missing values.

    misstest = nmiss(test1,test2,test3);

 

If test1 = 70 and test2 = 40 and test3 = . then misstest = 1 since there is 1 variable with a missing value.

N - This function returns the number of non missing values.

    comptest = n(test1,test2,test3);

 

If test1 = 70 and test2 = 40 and test3 = . then comptest = 2 since there are 2 tests with non missing values.

 

DATE TIME FUNCTIONS:
A SAS date is the number of days since January 1, 1960. You can create a SAS date using the MDY function.

MDY - This function produces a SAS date from month, day and year variables

    dob  = mdy(birthmon,birthdy,birthyr);

 

Functions MONTH, DAY and YEAR will return their respective parts of the SAS date.

 


    birthyr = year(dob);  

CHARACTER FUNCTIONS:
SUBSTR - This function allows you to extract part of a string. You have to tell it where to begin to extract and how many characters to extract. If you have a variable testdate which has the value of 01231994 and you want to get the year, the following code:
    testyr = substr(testdate,5,4);

 

will extract the year by starting in positon 5 and extracting the next 4 characters.

INPUT - This function defines an informat for a value. One use is to convert variables from character to numeric by using numeric informats.

    varnum = input(varchar, 1.);

The 1. is a numeric informat.

PUT - This function defines an output format for a value. It is commonly used to convert from numeric to character using formats. The format for the PUT function must be the type that you are converting from (in this case numeric).

    idcar = put(idnum,z2.0);

If idnum = 1 then idcar would equal 01 since the z format will put in leading zeros when necessary.

The SUBSTR and PUT functions are also used to get the state and county codes from a fips (US Federal Information Processing Standards) code: Example:

    length state county 5.;

state = substr(put(fips,z5.)1,2);

county = substr(put(fips,z5.),3,3);

 

 

HANDY MACROS AVAILABLE ON THE POPNET:

There are several macros available on the PopNet that can significantly reduce your code.

%nvals MACRO:

The %nvals macro counts the number of numeric variables in a list that are equal to the first value in the list. You can use it in sevaral ways:

* If any of the variables = 1 then set a variable to 1.

 

    if %nvals (1, house1 house2 house3 house4 house5) then husband = 1;

In this example, house1-house5 are the relationships of the people in the household to the person taking the survey (fathers, mothers, husband ect.). Husbands have a value of 1, mothers have a value of 2, etc.. If any of the household roster variables have a value of 1 then we know there is someone in the household with a status of husband.

* Search for households in which there are 2 and only 2 sons. The household roster code for son is 3.

if %nvals (3, house1 house2 house3 house4 house5) = 2 then sons2 = 1;

If there are 2 and only 2 sons in the household then the variable sons2 would equal 1.

* Count the number of variables equal to the value in variable 'birth1'.

 

    sameborn = %nvals (birth1, birth2 birth3 birth4 birth5);

If birth1 - birth5 represent the sex of babies, then this macro will count the number of babies born that have the same sex as the first born (birth1).

%age MACRO:

The %age macro calculates a persons age in years given their birthdate and a reference date (BOTH as SAS Dates)

    age =   %age(birthdate, reference date);

In this example dob = the SAS date of '01AUG60' and when = the SAS date of '01 AUG98'.

 

   agenow = %age(dob,today());         /* returns age today */
   agethen = %age(dob,'22OCT65'd);     /* returns 5  */
   agewhen = %age(dob,when);           /* returns 38 */
   age21st = %age(dob, '02OCT2005'd);  /* returns 45 */
 

 

IN OPERATOR:
Most of you are familar with the Boolean operators of eq, ne , gt, and, or, etc. One handy operator that is frequently overlooked is IN. IN allows you to check if something is contained in a list. Example:
    If state in ('PA', 'NJ', 'NY') then tristate = 1; 

 

Test your knowledge of Functions & Pre-Defined Formats: Contents of PRACTICE.LABOR
state fips
AL 1001
AL 1003
    libname practice '/home/info/seminars/practice';
    data usdata;
    set practice.labor (rename= (state=cstate));
    length state county 5.;
    state=stfips(cstate);
    county = substr(put(fips,z5.),3,3);
    run;

Contents of USDATA (after):

cstate fips state county
AL 1001 1 1
AL 1003 1 3

Recoding Using User-defined Formats

Q: I never quite got the hang of recoding variables in SAS and don't often have to do it. However, now I do. Can someone explain briefly the SAS approach to recoding. I liked SPSS's recode statement. What is the equivalent in SAS? Here's an example of something I need to do, in SPSS syntax:
   RECODE nc (4 thru 9=1) (10,3=2) (2=3) (1=4).
How might I do the same in SAS?

Thanks,
"A new SAS User"
Children's Hospital, Boston

A: (by Sig Hermansen) SAS and SPSS take different approaches to the problem of recoding variables. SPSS changes the value stored and provides a simple method of doing that. SAS offers a convenient way to change the format of variables from one coding scheme to another for purposes of computing statistics, printing values, or comparing values. It takes some time and effort to get used to methods of applying formats to variables. Nonetheless, recoding using formats offers a number of advantages. The SAS equivalent for your example is accomplished by defining the recoding scheme in a PROC FORMAT

          
proc format;          
value rcode 4-9  = 1                         
            10,3 = 2                         
            2    = 3                         
            1    = 4
            ;
value dichot  10 = 1 
                        
           other = 0
           ;             
value topcat  0 = 'not in top category'                          
              1 = 'in top category'
              ;
run;

And then associating the scheme with a variable (or variables) on an "as needed" basis. With this format, you can now compute frequencies of the recoded variables using the original values...

         
libname practice '/home/info/seminars/practice';
proc freq data=practice.sasstate;
tables nc;
run;

Or using the 5 level recode of the values...

 

        
proc freq data=practice.sasstate;
tables nc;
format nc rcode.;         
run;

Or a dichotomous recode of the the value...

proc freq data=practice.sasstate;            
tables nc;       
format nc dichot.;        
run;
(*** Note the period in the numeric format names rcode. and dichot.)

The same format statement will work the same way in a proc print and some other SAS procedures. You do not need to run a DATA step to do the recoding.

You can also assign the formatted values to new variables by using the put statement, as in

              
x=put(nc,rcode.);
for x of type character, or
               
y=input(put(nc,rcode.),8.);
for y of type numeric.

Note that formats in SAS are used to provide "value labels" for variables as well as performing "recodes". The format topcat. will display the value labels rather than the values 0,1 in a frequency distribution of "y".

proc freq;            
tables y;            
format y topcat.;         
run;

Though a bit less direct, the formatting method makes it easy to switch from one coding scheme to another, to specify complex recoding schemes, and to differentiate coded values from the original values in the data sources. To assist, SAS provides a number of standard formats (and informats for input operations as well). All this serves to overriding principles of data management: Avoid when possible changing source data in a way that will cause a loss of information if it becomes necessary to restart a program. And avoid creating redundant data.

Remember that using formats is a two-step process:

  • Define the format scheme in a PROC FORMAT
  • Associate the format with the variable(s) as needed with a FORMAT statement

Click here for more information on formats.

Recoding on Condition

/* All of the following data steps use the CLASS dataset located in the SASUSER library. The dataset is included in base SAS. To acess the sample datasets, enter SAS ASSIST, click on the "setup" box, highlight the "Sample data set" line, then hit the return key, highlight "Create sample data sets" in the box, press the return key, and then select the data sets to create.

The first data step uses only IF-THEN-ELSE statements to recode variables. If there are many cases or complex relations to examine, this is probably not the method to use. In addition, debugging and maintenance can be a problem. */

title1 'IF-THEN-ELSE only recoding';  
data temp1;   
set sasuser.class; 
if (sex eq 'M') and (age eq 11) then sexage = 'M11_12';  
else if (sex eq 'M') and (age eq 12) then sexage = 'M11_12'; 
else if (sex eq 'M') and (age eq 13) then sexage = 'M13_14';  
else if (sex eq 'M') and (age eq 14) then sexage = 'M13_14';  
else if (sex eq 'M') and (age eq 15) then sexage = 'M15_16';  
else if (sex eq 'M') and (age eq 16) then sexage = 'M15_16';  
else if (sex eq 'F') and (age eq 11) then sexage = 'F11_12';  
else if (sex eq 'F') and (age eq 12) then sexage = 'F11_12';  
else if (sex eq 'F') and (age eq 13) then sexage = 'F13_14';  
else if (sex eq 'F') and (age eq 14) then sexage = 'F13_14';  
else if (sex eq 'F') and (age eq 15) then sexage = 'F15_16'; 
else if (sex eq 'F') and (age eq 16) then sexage = 'F15_16';
run;
proc freq;
table sexage; 
run;

The next data step also uses only IF-THEN-ELSE statements, but uses DO-END blocks to "organize" the code. Debugging and maintenance can still be a problem with this code if there are many cases.

title1 'IF-THEN with DO-END recoding';
data temp2;   
set sasuser.class;   
if (sex eq 'M') then do;      
    if age eq 11 then sexage = 'M11_12';      
    else if age eq 12 then sexage = 'M11_12';           
    else if age eq 13 then sexage = 'M13_14';                
    else if age eq 14 then sexage = 'M13_14';                     
    else sexage = 'M15_16';      
end;   
if (sex eq 'F') then do;
    if age eq 11 then sexage = 'F11_12';      
    else if age eq 12 then sexage = 'F11_12';           
    else if age eq 13 then sexage = 'F13_14';                
    else if age eq 14 then sexage = 'F13_14';                     
    else sexage = 'F15_16';
end;

proc freq; 
table sexage;  
run;

The following data step uses the IF-THEN statements along with the SELECT operator to recode the data. The "in" operator is also used.

title1 'IF-THEN with SELECT recoding';
data temp3;    
set sasuser.class;  
if (sex eq 'M') then select;       
    when (age in(11,12)) sexage = 'M11_12';      
    when (age in(13,14)) sexage = 'M13_14';       
    when (age in(15,16)) sexage = 'M15_16';       
    otherwise;       
end;   
else select;  /* If not male, hopefully female */     
    when (age in(11,12)) sexage = 'F11_12';      
    when (age in(13,14)) sexage = 'F13_14';       
    when (age in(15,16)) sexage = 'F15_16';       
    otherwise;       
end;  

The following data step used 2 SELECT statements to do the recoding.

proc freq; 
table sexage;   
run;

title1 'SELECT only recoding';
data temp4;  
set sasuser.class;   
select (sex);   
when ('M') 
     select;     
       when (age in(11,12)) sexage = 'M11_12';      
       when (age in(13,14)) sexage = 'M13_14';      
       when (age in(15,16)) sexage = 'M15_16';     
       otherwise;       
     end;    
when ('F') 
     select;       
       when (age in(11,12)) sexage = 'F11_12';      
       when (age in(13,14)) sexage = 'F13_14';      
       when (age in(15,16)) sexage = 'F15_16';       
       otherwise;      
     end;    
otherwise;   
end;

proc freq; 
table sexage;
run;

The next data step uses the IF-THEN-ELSE statement along with 2 user-defined formats to do the recoding.

proc format;
value mage 
  low-10 = 'Invalid'             
  11-12  = 'M11_12'               
  13-14  = 'M13_14'               
  15-16  = 'M15_16'               
  17-high= 'Invalid'
  ;    
value fage 
  low-10 = 'Invalid'
  11-12  = 'F11_12'               
  13-14  = 'F13_14'               
  15-16  = 'F15_16'               
  17-high= 'Invalid'
  ;
title1 'IF-THEN-ELSE recoding with user-defined formats'; 
data temp5;
set sasuser.class;  
if sex eq 'M' then agesex=put(age,mage.);     
else agesex=put(age,fage.);
run;
proc freq;    
table agesex;
run;

The output from all 5 data steps is identical.

Document Actions

Copyright ©2009, The Pennsylvania State University | Privacy and Legal Statements
Contact the Help Site Administrator | Last modified Aug 21, 2008 | Weblion Partner