Personal tools
You are here: Home Software Packages SAS Rollups

Rollups

— filed under: ,

Collapsing multiple observations into one observation

Rollups: Collapsing multiple observations into 1

Problem: I've got a data set with multiple observations per person. I'd like to collapse all of the info for one person into one observation. The file looks like this:

     OBS   PERSON   SCORE

1 101 65
2 101 39
3 102 53
4 102 83
5 103 38
6 104 94
7 104 78
8 104 84
9 104 56
10 104 97

Where person is a unique id and score represents the result of up to five test scores. But I want it to look like this:

     OBS   PERSON   NTIMES  S1  S2  S3  S4  S5 

1 101 2 65 39
2 102 2 53 83
3 103 1 38
4 104 5 94 78 84 56 97

Where ntimes represents the number of observations for a person in the original file and s1 - s5 represent the five (maximum) test scores.

Solution: This is a program shell that not only solves the problem above, but an understanding of the program reveals alot about how SAS processes data. Take a look at the code and try running the program /home/info/seminars/SASstuff/handy.sas.


Line  SAS statement               Explanation

0001  data perslvl;            name new person level file
0002  set sasstuff.perstest;   read 1 obs from person/test file
0003  by person;               note id of person being processed
0004  retain ntimes s1-s5;     don't automatically reset these variables
0005  array scores{5} s1-s5;   declare array to hold 5 scores
0006  drop score i;            don't output these variables to new file
0007  if first.person then do; do ONLY if it's person's first obs
0008     do i = 1 to 5;
0009        scores{i}=.;       clear scores array
0010     end;
0011     ntimes=0;             AND reset obs counter
0012  end;                     end if person's first observation  
0013  ntimes=ntimes+1;         count current obs
0014  scores{ntimes}=score;    place score into position in array
0015  if last.person then      ONLY if it's person's last obs
0016    output;                output an obs to new file
0017  run;  

The program assumes that you are familiar with arrays and do loops.

What may be new to you:

  •     'By' group processing

The lines:

0002  set sasstuff.perstest;
0003 by person;

tells SAS that we know that all observations with the same person id are together in the data set. If the data set was not in order by person, it would have to be sorted first. Setting the data set BY person tells SAS to automatically keep track of which person we are looking at by using it's two system variables: first.variable and last.variable. When the data is read, if SAS sees a value in the variable 'person' that it hasn't seen before, it sets first.person=1. We can then ask the question 'Is this the person's first record?' with statement 0007:

0007  if first.person then do; 

We can also know if we are on the person's last record. Because if it is the person's last observation, the system variable last.person will be set to 1. (Unfortunately, we can't know if we are on the 3rd record or the next to the last record with this technique.)

  •     Retain

The DATA step works like a loop, repeating the statements over and over with each input record. There's a work area used by the data step called the Program Data Vector. SAS copies your input record, builds your recodes, and makes notes to itself there. Each time through the loop the area is erased in preparation for the next incoming record.

If you want SAS to remember a value from the first record when you are reading the second record, you've got to say, 'Hey, don't erase that value'. In this program we want SAS to remember, as it reads say record 3, that it already counted records 1 & 2 in the variable 'ntimes' and to remember the two scores that we loaded into the 1st & 2nd elements ofthe array. So we say:

0004  retain ntimes s1-s5;

We can change the values, but SAS won't automatically erase them.

  •     Explicit output

If we don't type an output statement in the DATA step, SAS assumes that we are doing straighforward 'one in, one out' processing. If that's not what we want, we've got to say so. In our example, we only want as many observations in our new data set as there are persons. We want the new observation to include information from all the records for that person. So we've got to read through all records for a person without outputting anything until we get to their last record.

0015  if last.person then
0016 output;

You can do all kinds of fancy data reshaping with this code. Try figuring out how you could create an event history type file if you started with a data set of one observation per person containing an array of scores and wanted to end up with a data set having one observation per person/score.

Document Actions

Copyright ©2008, The Pennsylvania State University | Privacy and Legal Statements
Contact the Help Site Administrator | Last modified Aug 13, 2008 | Weblion Partner