Rollups
Collapsing multiple observations into one observation
Problem: I've got a data set with multiple observations per person. I'd like to collapse all of the info for one person into one observation. The file looks like this:
OBS PERSON SCORE
1 101 65
2 101 39
3 102 53
4 102 83
5 103 38
6 104 94
7 104 78
8 104 84
9 104 56
10 104 97
Where person is a unique id and score represents the result of up to five test scores. But I want it to look like this:
OBS PERSON NTIMES S1 S2 S3 S4 S5
1 101 2 65 39
2 102 2 53 83
3 103 1 38
4 104 5 94 78 84 56 97
Where ntimes represents the number of observations for a person in the original file and s1 - s5 represent the five (maximum) test scores.
Solution: This is a program shell that not only solves the problem above, but an understanding of the program reveals alot about how SAS processes data. Take a look at the code and try running the program /home/info/seminars/SASstuff/handy.sas.
Line SAS statement Explanation
0001 data perslvl; name new person level file
0002 set sasstuff.perstest; read 1 obs from person/test file
0003 by person; note id of person being processed
0004 retain ntimes s1-s5; don't automatically reset these variables
0005 array scores{5} s1-s5; declare array to hold 5 scores
0006 drop score i; don't output these variables to new file
0007 if first.person then do; do ONLY if it's person's first obs
0008 do i = 1 to 5;
0009 scores{i}=.; clear scores array
0010 end;
0011 ntimes=0; AND reset obs counter
0012 end; end if person's first observation
0013 ntimes=ntimes+1; count current obs
0014 scores{ntimes}=score; place score into position in array
0015 if last.person then ONLY if it's person's last obs
0016 output; output an obs to new file
0017 run;
The program assumes that you are familiar with arrays and do loops.
What may be new to you:
- 'By' group processing
The lines:
0002 set sasstuff.perstest;
0003 by person;
tells SAS that we know that all observations with the same person id are together in the data set. If the data set was not in order by person, it would have to be sorted first. Setting the data set BY person tells SAS to automatically keep track of which person we are looking at by using it's two system variables: first.variable and last.variable. When the data is read, if SAS sees a value in the variable 'person' that it hasn't seen before, it sets first.person=1. We can then ask the question 'Is this the person's first record?' with statement 0007:
0007 if first.person then do;
We can also know if we are on the person's last record. Because if it is the person's last observation, the system variable last.person will be set to 1. (Unfortunately, we can't know if we are on the 3rd record or the next to the last record with this technique.)
- Retain
The DATA step works like a loop, repeating the statements over and over with each input record. There's a work area used by the data step called the Program Data Vector. SAS copies your input record, builds your recodes, and makes notes to itself there. Each time through the loop the area is erased in preparation for the next incoming record.
If you want SAS to remember a value from the first record when you are reading the second record, you've got to say, 'Hey, don't erase that value'. In this program we want SAS to remember, as it reads say record 3, that it already counted records 1 & 2 in the variable 'ntimes' and to remember the two scores that we loaded into the 1st & 2nd elements ofthe array. So we say:
0004 retain ntimes s1-s5;
We can change the values, but SAS won't automatically erase them.
- Explicit output
If we don't type an output statement in the DATA step, SAS assumes that we are doing straighforward 'one in, one out' processing. If that's not what we want, we've got to say so. In our example, we only want as many observations in our new data set as there are persons. We want the new observation to include information from all the records for that person. So we've got to read through all records for a person without outputting anything until we get to their last record.
0015 if last.person then
0016 output;
You can do all kinds of fancy data reshaping with this code. Try figuring out how you could create an event history type file if you started with a data set of one observation per person containing an array of scores and wanted to end up with a data set having one observation per person/score.

