Data Cleaning Techniques
Garbage in -- Garbage out prevention techniques and coding examples
Garbage in -- Garbage out. In other words, you can write the most elegant code, do the most creative analysis and still end up with garbage if your data is not clean. Anyone that has collected data has horror stories of the data from hell. The following tips will get you started on data cleaning.
The following data was obtained from a survey which included the city and state of the person's place of birth. Data was entered directly into a laptop in the field with little or no data checking at the time of entry. It has several problems: 1) leading and trailing zeros , slashes and other characters, 2) upper, lower and mixed case characters, and 3) misspellings of place names. The following is the method that we used to clean the city and state fields.
CITY STATE
00000000000000000NYC 000000000///New York
0000000000new york c 000000000000NEW YOR/
00000000000000Boston 00000000Massachusett
000000000000Vineland 0000000000New Jersey
000000000000NEW JORK 000000000000new jork
00000000000///camdem 0000000000new yersey
0000000000filadelfia 000000000000000penn\
00000000000000004170 00000000000000000036
Getting rid of garbage characters and making everything upper case
The first step was to get rid of the 0's and other garbage characters using the INDEXC , TRANSLATE , SUBSTR and UPCASE functions. Indexc finds the first occurrence in the first argument of any character present in the other arguments. In this case it will search for the first occurence of a letter of the alphabet. Translate will replace specific characters with other characters. In this case it will replace the slashes and other characters with spaces. This is necessary to get rid of the trailing garbage characters. Substr will extract the "good" part of the place name using the results from indexc. Upcase will convert all the characters into upper case.
libname sastmp '/sastmp';
data sastmp.step1; set sastmp.survey1; x = indexc(city,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'); city = translate(city,' ','/\.'); if x gt 0 then city = upcase(substr(city,x)); y = indexc(state,'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'); state = translate(state,' ','/\.'); if y gt 0 then state = upcase(substr(state,y)); run;
The output appears below:
CITY STATE
NYC NEW YORK
NEW YORK C NEW YOR
BOSTON MASSACHUSETT
VINELAND NEW JERSEY
NEW JORK NEW JORK
CAMDEM NEW YERSEY
FILADELFIA PENN
0000000000000004170 00000000000000000036
Cleaning up the misspellings and abbreviations
Using proc sort nodup on the
city and state
variables, an
unduplicated list of cities and states was obtained.
AUSTIN NEW YORK
AUSTIN TEXAS
BRENTWOOD NEW YORK
BRIDGEPORT CONNECTICUT
BRIDGEPORT NEW YORK
BRIDGEPORT PENNSYLVANIA
CARNEYS POINT NEW JERSEY
CAROL CITY FLORIDA
HUNTINDON PENN
HUNTINGDON PENNSYLVANIA
HUNTINGDON WEST VIRGINIA
HUNTINGON WV
HUNTINGTON PA
HUNTINGTON WEST VIRGINIA
A series of IF statements was used to clean up the cities and states. It is important that you don't create new problems by changing a "good" spelling to a "bad" spelling in the process of correcting another record. For example you don't want to change all of the Huntingdons to Huntingtons since Huntingdon is the valid name in Pennsylvania and Huntington is the correct name in West Virginia. In this program the state variable was corrected first and then the city variable was corrected. That made it easier to deal with situations in which there were cities with the same or similar names in different states. A sample appears below:
libname sastmp '/sastmp';
data sastmp.step2;
set sastmp.step1;
if state in ('EW YORK', 'N Y', 'NEUEVA YORK', 'NEW ORK', 'NEW YOEK',
'NEW YORKI', 'NEW YORKJ', 'NEW YPRK', 'NEWYORK', 'NRE YORK', 'NY',
'NUEVA YORK', 'LONG ISLAND NEW YORK', 'NEW MYORK','NEW YOR',
'NEW YORK CITY', 'NUEVA YRK', 'NEW JORK', 'NEW TYORK', 'NEW YORJK',
'NEW YIRK', 'NW YRK', 'NYU', "NEW YORK, NEW YORK",
'NEW YRK')
then state = 'NEW YORK';
if state in ('PA', 'PENN', 'PENCILVANIA')
then state = 'PENNSYLVANIA';
if state in ('WV', 'W. VIRGINIA', 'WST VIRGINIA')
then state = 'WEST VIRGINIA';
if city in ('PHL','FILADELFIA','PHILA', 'PHILEDIAPHIA',
'PHILIADELPHIA', 'PHYLADELPHIA')
then city = 'PHILADELPHIA';
if city in ('HUNTINDON', 'HUNTINGTON') and state = 'PENNSYLVANIA'
then city = 'HUNTINGDON';
if city in ('HUNTINGDON','HUNTINGON') and state = 'WEST VIRGINIA'
then city = 'HUNTINGTON';
if city in ('FORT LADERLADE', 'FT LAUDERDALE', 'FT LAURADEL',
'FORT LADERDALE', 'FORT LAURELDALE', 'FT LAUTERDALE',
'FORTLAUDERDALE', 'FT LAUTERDALE')
then city = 'FORT LAUDERDALE';
run;
Several records did not have city or state names. In many of those cases it was filled with
all 0's or random numbers and in these cases the city and/or state names were changed to UNKNOWN.
However in some cases the interviewer put the city
and state geographic codes in place of
the city and state names. In the case of the last record in
the
example, the geographic code that we were using had New York state as 36 and the code
for the New York City area as
4170. We were able to salvage it with code such as:
if state = '0000000000000000036' and city = '00000000000000004170' then do;
city = 'NEW YORK CITY';
state = 'NEW YORK';
end;
run;

