Personal tools
You are here: Home Software Packages SAS Merging

Merging

— filed under: ,

Combining data from 2 or more files

If things were easy, all the data that you would ever need would be in a single data set. Since this is the real world, you may need variables located in more than one data set. A merge is used to join observations from two or more data sets into a new data set. Since this is a common need, SAS has provided an easy method of combining SAS data sets with the MERGE statement. Warning:

If you merge data sets and they have variables with the same name (other than the BY variables), the variable value from the second data set will overwrite the variable value from the first data set. You should rename one of the variables.

 

A variety of merge situations are discussed:

 

1 to 1 match merge with a BY value

Simple Merge Example:

This example merges a file of demographic data sorted by zip code and a file of health statistics sorted by zip code into a single file called pophlth. In this case both data sets have the exact same zip codes but have different variables. The files will be merged together by zip code. NOTE: You can merge on more than one variable.

   data sastmp.pophlth;     /* new data set      */ 
   merge sastmp.demog       /* demography file   */
         sastmp.health;     /* health stat. file */ 
   by zip;                  /* merge by zip code */
   run;

NOTE: Both files must have the variable zip and be in order by zip. If the files are not in zip order they must be sorted before merging. NOTE: if the BY variables have different names you can rename one as in the example below:.

 
   data sastmp.pophlth;     /* new data set      */
   merge sastmp.demog (rename = (oldzip = zip)) 
         sastmp.health;     /* health stat. file */ 
   by zip;                  /* merge by zip code */
   run; 
 

 

If the demography file has the following observations:


  oldzip        zippop          employ
  16801	        111111		123456
  16802         222222		456678
  16803	        333333		678901

and the health file has the following observations:


  zip             cancer          heart
  16801           .01             .02
  16802           .02             .01
  16803           .03             .04

then the new file sastmp.pophlth will have the following observations:


  zip     zippop       employ      cancer  heart 
  16801   111111       123456       .01     .02
  16802   222222       456678       .02     .01
  16803   333333       678901       .03     .04

Life usually isn't this simple, so the following sections explain variations of the merge statement.

non matches with a by value

The previous example used match merging with a by value to merge the 2 SAS files together that had the same zip codes in both files. What happens if one file has observations that don't match the other file's observations?

If the demography file has the following observations:


  zip    zippop     employ
  16801  111111     123456
  16802  222222     456678
  16803  333333     678901
  16804  444444     123456

and the health file has the following observations:


  zip    cancer     heart
  16801	   .01       .02
  16802    .02       .01
  16803    .03       .04

then the new file sastmp.pophlth has the following observations:


  zip    zippop    employ    cancer   heart 
  16801  111111    123456     .01      .02
  16802  222222    456678     .02      .01
  16803  333333    678901     .03      .04
  16804  444444    123456      .        .

Since there isn't a matching health record for the demographic file's zip code 16804, the health variables were set to missing(.) for this record.

Using the IN option to limit the output records

In the previous examples, all of the records from both files ended up in the new SAS file. The following example will show you how to limit the records going to the output file. In this example you are working with a subset of all US zip codes (zip codes 16801-16804) located in the file demog. The health data file contains records for all US. zip codes. You want to get the health variables for only those zip codes in the demography file (16801-16804).

  data pophlth;
  merge demog (IN=indemog)
        health (IN=inhealth);
  by zip; 
  if indemog;           /* limits observations */ 
  run;

The output file from this program would contain an observation for the four zip codes in the demography file with any associated information from the health file. If there wasn't a matching zip code in the health file then the health variables would be missing.

If you wanted to output records only if there was a demography AND a health record for the same zip code you would use the following code:

  data pophlth;
  merge demog (in=indemog)
       health (in=inhealth);
  by zip;
  if indemog and inhealth;  
  run;

Creating more than 1 data set with a merge

Sometimes it is handy to create more than one data set from the merge. This technique is commonly used in cleaning data. For example, you have a master file of all the people that you sent surveys to that contains their demographic information and another file with the returned survey information. Each file has the variable surveyid which is the survey id number. Unfortunately you have 100 survey subjects in your master file but only 75 observations in your survey file. How do you find out which survey subjects haven't returned their survey? The following code will create 3 data sets. Data set returned contains the demographic data from the master file and the returned survey information. Data set outstand contains the demographic information from the master file for the people that did NOT return their survey. Data set badid contains the returned survey information for people that have survey information but are not in the master file. You need to check their id's for typos.


 data returned                /* survey returned      */ 
      outstand                /* survey outstanding   */ 
      badid;                  /* bad survey id number */ 
 merge master (in=inmaster)   
       returned (in=inout);
  by surveyid;
  if inmaster and inout then output returned;
  else if inmaster and not inout then output outstand;
  else if inout and not inmaster then output badid;
  run;

1 to many merge

Sometimes you want to merge several observations in one data set with 1 observation in another data set. For example, you have data by zip code and you want to get state or county information for those zip codes. Since there is 1 state for multiple zipcodes, you have a 1 to many merge. The code for a 1 to many merge is the same as for a 1 to 1 merge. Warning: make sure that you have a 1 to many merge and not a many to many merge (they are nasty and to be avoided).

 
   zip          zippop        state
  16801         111111          42
  16802         222222          42
  16803         333333          43
  

and the state file has the following observations:


  state         stpop            hospital
  40            1234567             123
  42            2345678             234
  43            5678900             678
 

then the new file has the following observations:


  zip     zippop   state    stpop  hospital 
  16801   111111    42     2345678    234
  16802   222222    42     2345678    234
  16803   333333    43     5678900    678
Document Actions

Copyright ©2009, The Pennsylvania State University | Privacy and Legal Statements
Contact the Help Site Administrator | Last modified Aug 21, 2008 | Weblion Partner