Merging
Combining data from 2 or more files
If you merge data sets and they have variables with the same name (other than the BY variables), the variable value from the second data set will overwrite the variable value from the first data set. You should rename one of the variables.
A variety of merge situations are discussed:
- 1 to 1 match merge with a by value
- non matches with a by value
- Using the IN option to limit the output records
- Creating more than 1 data set with a merge
- 1 to many merges
1 to 1 match merge with a BY value
Simple Merge Example:
This example merges a file of demographic data sorted by zip code and a file of health statistics sorted by zip code into a single file called pophlth. In this case both data sets have the exact same zip codes but have different variables. The files will be merged together by zip code. NOTE: You can merge on more than one variable.
data sastmp.pophlth; /* new data set */
merge sastmp.demog /* demography file */
sastmp.health; /* health stat. file */
by zip; /* merge by zip code */
run;
NOTE: Both files must have the variable zip and be in order by zip. If the files are not in zip order they must be sorted before merging. NOTE: if the BY variables have different names you can rename one as in the example below:.
data sastmp.pophlth; /* new data set */
merge sastmp.demog (rename = (oldzip = zip))
sastmp.health; /* health stat. file */
by zip; /* merge by zip code */
run;
If the demography file has the following observations:
oldzip zippop employ 16801 111111 123456 16802 222222 456678 16803 333333 678901
and the health file has the following observations:
zip cancer heart 16801 .01 .02 16802 .02 .01 16803 .03 .04
then the new file sastmp.pophlth will have the following observations:
zip zippop employ cancer heart 16801 111111 123456 .01 .02 16802 222222 456678 .02 .01 16803 333333 678901 .03 .04
Life usually isn't this simple, so the following sections explain variations of the merge statement.
non matches with a by value
The previous example used match merging with a by value to merge the 2 SAS files together that had the same zip codes in both files. What happens if one file has observations that don't match the other file's observations?
If the demography file has the following observations:
zip zippop employ 16801 111111 123456 16802 222222 456678 16803 333333 678901 16804 444444 123456
and the health file has the following observations:
zip cancer heart 16801 .01 .02 16802 .02 .01 16803 .03 .04
then the new file sastmp.pophlth has the following observations:
zip zippop employ cancer heart 16801 111111 123456 .01 .02 16802 222222 456678 .02 .01 16803 333333 678901 .03 .04 16804 444444 123456 . .
Since there isn't a matching health record for the demographic file's zip code 16804, the health variables were set to missing(.) for this record.
Using the IN option to limit the output records
In the previous examples, all of the records from both files ended up in the new SAS file. The following example will show you how to limit the records going to the output file. In this example you are working with a subset of all US zip codes (zip codes 16801-16804) located in the file demog. The health data file contains records for all US. zip codes. You want to get the health variables for only those zip codes in the demography file (16801-16804).
data pophlth;
merge demog (IN=indemog)
health (IN=inhealth);
by zip;
if indemog; /* limits observations */
run;
The output file from this program would contain an observation for the four zip codes in the demography file with any associated information from the health file. If there wasn't a matching zip code in the health file then the health variables would be missing.
If you wanted to output records only if there was a demography AND a health record for the same zip code you would use the following code:
data pophlth;
merge demog (in=indemog)
health (in=inhealth);
by zip;
if indemog and inhealth;
run;
Creating more than 1 data set with a merge
Sometimes it is handy to create more than one data set from the merge. This technique is commonly used in cleaning data. For example, you have a master file of all the people that you sent surveys to that contains their demographic information and another file with the returned survey information. Each file has the variable surveyid which is the survey id number. Unfortunately you have 100 survey subjects in your master file but only 75 observations in your survey file. How do you find out which survey subjects haven't returned their survey? The following code will create 3 data sets. Data set returned contains the demographic data from the master file and the returned survey information. Data set outstand contains the demographic information from the master file for the people that did NOT return their survey. Data set badid contains the returned survey information for people that have survey information but are not in the master file. You need to check their id's for typos.
data returned /* survey returned */
outstand /* survey outstanding */
badid; /* bad survey id number */
merge master (in=inmaster)
returned (in=inout);
by surveyid;
if inmaster and inout then output returned;
else if inmaster and not inout then output outstand;
else if inout and not inmaster then output badid;
run;
1 to many merge
Sometimes you want to merge several observations in one data set with 1 observation in another data set. For example, you have data by zip code and you want to get state or county information for those zip codes. Since there is 1 state for multiple zipcodes, you have a 1 to many merge. The code for a 1 to many merge is the same as for a 1 to 1 merge. Warning: make sure that you have a 1 to many merge and not a many to many merge (they are nasty and to be avoided).
zip zippop state 16801 111111 42 16802 222222 42 16803 333333 43
and the state file has the following observations:
state stpop hospital 40 1234567 123 42 2345678 234 43 5678900 678
then the new file has the following observations:
zip zippop state stpop hospital 16801 111111 42 2345678 234 16802 222222 42 2345678 234 16803 333333 43 5678900 678

