Personal tools
You are here: Home Software Packages SAS SQL with SAS

SQL with SAS

— filed under:

Using PROC SQL, the SAS structured query language procedure to insert, delete, modify and retrieve information from SAS data tables

SAS does not allow SQL statements to be used in the DATA step. However, SAS provides PROC SQL which allows operations on SAS datasets with SQL. The SAS terms dataset, observation, and variable respectively correspond to the SQL terms table, row, and column.

The syntax of PROC SQL is:

 PROC SQL < option(s)> ;
ALTER alter-statement;
CREATE create-statement;
DELETE delete-statement;
DESCRIBE describe-statement;
DROP drop-statement;
INSERT insert statement;
RESET < option < option &gt...> ;
SELECT select-statement;
UPDATE update-statement;
VALIDATE validate-statement;

ALTER changes the attibutes of columns or adds or drops columns
CREATE creates tables
DELETE removes rows from a table
DESCRIBE display a view definition
DROP deletes the table
INSERT inserts a new row into the table
RESET allows options to be changed during execution
SELECT retrives data and outputs results
UPDATE modifies columns in existing rows
VALIDATE checks a query-expression for syntactic accuracy.

Note that the DELETE statement has exactly the same function as the
DELETE statement used in the DATA step. However, DROP in PROC SQL
will delete the entire table(dataset) rather the columns(variables).

Some of options which may be useful: INOBS=n restricts the number of rows processed from a source
NUMBER|NONUMBER includes a column with the row number
PRINT|NOPRINT turns printing for SELECT statements on or off




The Select Statement

Use the SELECT statement to select or create columns(variables) from tables(datasets)
PROC SQL;
SELECT name, team, league, no_hits, no_bb, no_atbat
FROM sasuser.baseball;
                                  League
at the Times at
Team at the end of Hits in Walks in Bat in
Player's Name end of 1986 1986 1986 1986 1986
-------------------------------------------------------------------------
Aldrete, Mike SanFrancisco National 54 33 216
Allanson, Andy Cleveland American 66 14 293
Almon, Bill Pittsburgh National 43 30 196
Anderson, Dave LosAngeles National 53 22 216
Armas, Tony Boston American 112 24 425
Ashby, Alan Houston National 81 39 315
Backman, Wally NewYork National 124 36 387
Baines, Harold Chicago American 169 38 570
...
The equivalent base SAS statements to generate this output are: proc print label data=sasuser.baseball noobs split='*'; var name team league no_hits no_bb no_atbat; run; The SELECT statement in SQL functions somewhat like a "SET" and
"KEEP" statement in base SAS. The SELECT in SQL should not be confused
with the SELECT code in base SAS. To select all players who play for Cleveland: PROC SQL; SELECT name, team, league, no_hits, no_bb, no_atbat FROM sasuser.baseball WHERE team eq 'Cleveland';
                                  League
at the Times at
Team at the end of Hits in Walks in Bat in
Player's Name end of 1986 1986 1986 1986 1986
------------------------------------------------------------------------
Allanson, Andy Cleveland American 66 14 293
Bando, Chris Cleveland American 68 22 254
Bernazard, Tony Cleveland American 169 53 562
Butler, Brett Cleveland American 163 70 587
Carter, Joe Cleveland American 200 32 663
Castillo, Carmen Cleveland American 57 9 205
Franco, Julio Cleveland American 183 32 599
...

Note that the WHERE statement is a continuation of the SELECT.
There is only one semicolon at the end of the entire statement.

To select players who drew between 80 and 100 base on balls:
PROC SQL; SELECT name, team, league, no_hits, no_bb FROM sasuser.baseball WHERE no_bb BETWEEN 80 AND 100;

                                  League
at the
Team at the end of Hits in Walks in
Player's Name end of 1986 1986 1986 1986
--------------------------------------------------------------
Brett, George KansasCity American 128 80
Davis, Chili SanFrancisco National 146 84
Doran, Bill Houston National 152 81
Downing, Brian California American 137 90
Evans, Darrell Detroit American 122 91
Evans, Dwight Boston American 137 97
...

To select Pittsburgh players who drew between 80 and 100 base on balls:
PROC SQL; SELECT name, team, league, no_hits, no_bb FROM sasuser.baseball WHERE (no_hits BETWEEN 80 AND 100) AND (team EQ 'Pittsburgh');

                                  League
at the
Team at the end of Hits in Walks in
Player's Name end of 1986 1986 1986 1986
--------------------------------------------------------------
Bonds, Barry Pittsburgh National 92 65
Orsulak, Joe Pittsburgh National 100 28
Other relations used in the WHERE statement in both PROC SQL and base SAS include: CONTAINS - will find all rows(observations) where the string is contained in the variable. Example: WHERE name CONTAINS 'Mike' will find all
players with "Mike" in their name.

LIKE - a pattern matching function where one or more "wild cards"
can be substituted in the search.

Example: WHERE team LIKE 'P%' will find all rows where the
team name starts with the letter P.
WHERE name LIKE 'D_n' would find all rows where the
player name started with 'D', ended with 'n' and
had one letter in between ("Dan" and "Don" would
match, but "Dean" would not).





SQL allows basic functions to be performed on columns.

Find the total number of hits, the mean number of base on balls, the number of non-missing at bats, and the number of non-missing salaries.

PROC SQL; SELECT SUM(no_hits) as tot_hits, MEAN(no_bb) as ave_bb, COUNT(no_atbat) as nm_ab, COUNT(salary) as nm_pay FROM sasuser.baseball;

TOT_HITS AVE_BB NM_AB NM_PAY -------------------------------------- 33294 39.85714 322 263

The base SAS code which produces the same results:

proc means data=sasuser.baseball sum mean n; var no_bb no_atbat salary no_hits; output out=mtemp n=nm_bb nm_ab nm_pay nm_hits mean=ave_bb ave_ab ave_pay ave_hits sum=tot_bb tot_ab tot_pay tot_hits; proc print data=mtemp; var tot_hits ave_bb nm_ab nm_pay; run;



The following statistics can be performed by the SQL procedure:

    AVE, MEAN means or average COUNT, FREQ, N number of nonmissing values CSS corrected sum of squares CV coefficient of variation (percent) MAX largest value MIN smallest value NMISS number of missing values PRT probability of a greater absolute value of Student's t STD standard deviation STDERR standard error of the mean SUM sum of values




Note that the functions used in SQL do not work the same at the functions
in base SAS. In base SAS, the above functions operate on variable in one
observation, i.e. multiple columns in one row. The functions in SQL
work on one column across multiple rows. The functions in SQL perform
the same task as PROCs in base SAS.


PROC SQL allows use of formats and labels.

PROC SQL; SELECT SUM(no_hits) as tot_hits format=comma7.0 label='Total Hits', MEAN(no_bb) as ave_bb format=7.2 label='Mean Base on Balls', COUNT(no_atbat) as nm_ab format=comma7.0 label='Frequency At Bat', COUNT(salary) as nm_pay format=comma7.0 label='Non-missing salary' FROM sasuser.baseball;
            Mean
Total Base on Frequency Non-missing
Hits Balls At Bat salary
----------------------------------------
33,294 39.86 322 263
To sum by a classification variable, add the GROUP BY clause. PROC SQL; SELECT team, league, SUM(no_hits) as tot_hits format=comma7.0 label='Total Hits', MEAN(no_bb) as ave_bb format=7.2 label='Mean Base on Balls', COUNT(no_atbat) as nm_ab format=comma7.0 label='Frequency At Bat', COUNT(salary) as nm_pay format=comma7.0 label='Non-missing salary' FROM sasuser.baseball GROUP BY team, league;
              
League
at the Mean
Team at the end of Total Base on Frequency Non-missing
end of 1986 1986 Hits Balls At Bat salary
----------------------------------------------------------------
Atlanta National 1,055 40.36 11 8
Baltimore American 1,336 35.33 15 10
Boston American 1,378 53.40 10 8
California American 1,324 49.85 13 10
Chicago American 1,257 34.38 13 11
Chicago National 1,188 39.82 11 11
Cincinnati National 1,203 42.08 12 10
Cleveland American 1,564 35.92 12 11
...
To generate this with base SAS:
proc means data=sasuser.baseball  nway noprint SUM MEAN N  ;
class team league;
var no_hits no_bb no_atbat salary;
output out=mtemp SUM=tot_hits tot_bb tot_ab tot_pay
MEAN=ave_hits ave_bb ave_ab ave_pay
N=nm_hits nm_bb nm_ab nm_pay;
run;

data temp;
keep team league tot_hits ave_bb nm_ab nm_pay;
set mtemp;
run;

proc sort data=temp;
by team league;
run;

proc print;
/* format and label statements omitted */
run;

SQL allows the merging of the summary statistics back into each row.

PROC SQL;

SELECT name, team, no_hits, no_bb,
SUM(no_hits) as tot_hits format=comma7.0 label='Total Hits',
MEAN(no_bb) as ave_bb format=7.2 label='Mean Base on Balls'
FROM sasuser.baseball
GROUP BY team, league;
                                                                  Mean
Team at the Hits in Walks in Total Base on
Player's Name end of 1986 1986 1986 Hits Balls
----------------------------------------------------------------------
Horner, Bob Atlanta 141 52 1,055 40.36
Oberkfell, Ken Atlanta 136 83 1,055 40.36
Moreno, Omar Atlanta 84 21 1,055 40.36
Murphy, Dale Atlanta 163 75 1,055 40.36
Hubbard, Glenn Atlanta 94 66 1,055 40.36
Harper, Terry Atlanta 68 29 1,055 40.36
Virgil, Ozzie Atlanta 80 63 1,055 40.36
Simmons, Ted Atlanta 32 12 1,055 40.36

The corresponding base SAS code

proc means data=sasuser.baseball  nway noprint SUM MEAN N  ;
class team league;
var no_hits no_bb no_atbat salary;
output out=mtemp SUM=tot_hits tot_bb tot_ab tot_pay
MEAN=ave_hits ave_bb ave_ab ave_pay
N=nm_hits nm_bb nm_ab nm_pay;
run;

data temp;
keep team league tot_hits ave_bb;
set mtemp;

proc sort data=temp; by team league;

data individ;
keep name team league no_hits no_bb;
set sasuser.baseball;

proc sort data=individ; by team league;

data temp2;
merge individ temp; by team league;

proc print;
/* format and label statements omitted */
run;

Variables can be created in the SELECT statement without using the
summary functions.
To calculate the "On Base Percentage" for an individual,
we need to add the base on balls to the number of hits, and divide this
total by the sum of the number of at bats plus the base on balls. To calculate
this number for the team, we need to use the SUM function in SQL to calculate
the numerator and denominator for each team. If the result of the calculation
was not stored in a variable with 'AS', the value would be calculated and
displayed, but with no column heading.
PROC SQL;

SELECT name, team, (no_hits+no_bb)/(no_atbat+no_bb) AS obp
format=4.3 label='Individual On Base Percentage',
SUM(no_hits+no_bb)/SUM(no_atbat+no_bb) AS team_obp
format=4.3 label='Team On Base Percentage'
FROM sasuser.baseball
GROUP BY team, league;
                                  Individual     Team On
Team at the On Base Base
Player's Name end of 1986 Percentage Percentage
--------------------------------------------------------
Moreno, Omar Atlanta .276 .325
Harper, Terry Atlanta .330 .325
Horner, Bob Atlanta .339 .325
Hubbard, Glenn Atlanta .338 .325
Thomas, Andres Atlanta .269 .325
...

The base SAS code:

data individ;
keep name team league no_hits no_bb no_atbat obp;
set sasuser.baseball;
obp=(no_hits+no_bb)/(no_atbat+no_bb);
proc summary data=sasuser.baseball nway;
class team league;
var no_hits no_bb no_atbat;
output out=team sum=no_hits no_bb no_atbat;
data teamobp;
keep team league team_obp;
set team;
team_obp = (no_hits+no_bb)/(no_atbat+no_bb);
proc sort data=individ; by team league;
proc sort data=teamobp; by team league;
data final;
keep name team league obp team_obp;
merge individ teamobp; by team league;
proc print;
Calculated variables can be used for selection.

To select players with an On Base Percentage greater than .380:

PROC SQL; SELECT name, team, (no_hits+no_bb)/(no_atbat+no_bb) AS obp format=4.3 label='Individual On Base Percentage', SUM(no_hits+no_bb)/SUM(no_atbat+no_bb) AS team_obp format=4.3 label='Team On Base Percentage' FROM sasuser.baseball WHERE CALCULATED obp > .380 GROUP BY team, league;
                                  Individual     Team On
Team at the On Base Base
Player's Name end of 1986 Percentage Percentage
--------------------------------------------------------
Murray, Eddie Baltimore .400 .400
Boggs, Wade Boston .455 .421
Rice, Jim Boston .385 .421
Daniels, Kal Cincinnati .394 .394
Grubb, Johnny Detroit .412 .412
...
*** This table contains an error. The team On Base Percentage has now been
calculated using only the individuals with obp > .380 The WHERE
clause should be replaced by HAVING. Notice also that the keyword
"CALCULATED" needs to preceed any variable name that was not originally
in the table.


PROC SQL; SELECT name, team, (no_hits+no_bb)/(no_atbat+no_bb) AS obp format=4.3 label='Individual On Base Percentage', SUM(no_hits+no_bb)/SUM(no_atbat+no_bb) AS team_obp format=4.3 label='Team On Base Percentage' FROM sasuser.baseball GROUP BY team, league HAVING CALCULATED obp > .380;

                                  Individual     Team On
Team at the On Base Base
Player's Name end of 1986 Percentage Percentage
--------------------------------------------------------
Murray, Eddie Baltimore .400 .333
Boggs, Wade Boston .455 .346
Rice, Jim Boston .385 .346
Daniels, Kal Cincinnati .394 .336
Grubb, Johnny Detroit .412 .340
Brett, George KansasCity .399 .313
...
The SELECT statement does not create tables(datasets). It only extracts
information from tables. To create a dataset from a PROC SQL query, use
the CREATE TABLE statement. The dataset can be a temporary dataset or
a permanent dataset. The name must adhere to the SAS naming conventions.

To create a temporary dataset "example" from the results of the previous
query:
PROC SQL; CREATE TABLE example AS SELECT name, team, (no_hits+no_bb)/(no_atbat+no_bb) AS obp format=4.3 label='Individual On Base Percentage', SUM(no_hits+no_bb)/SUM(no_atbat+no_bb) AS team_obp format=4.3 label='Team On Base Percentage' FROM sasuser.baseball GROUP BY team, league HAVING CALCULATED obp > .380; PROC PRINT DATA=example (obs=5); run;
OBS        NAME         TEAM           OBP    TEAM_OBP

1 Murray, Eddie Baltimore .400 .333
2 Boggs, Wade Boston .455 .346
3 Rice, Jim Boston .385 .346
4 Daniels, Kal Cincinnati .394 .336
5 Grubb, Johnny Detroit .412 .340
Please note than when using the SELECT statement with no CREATE, the results of the query appear in the output window (interactive SAS) or the .lst file (batch SAS). If a dataset is output using the CREATE TABLE statement, the printing of the results is supressed.

Another nice function of SQL is the CASE expression. It can be used
for classification of variables. If we wanted to classify the number
of hits into 4 groups, we could use the case statement. The ORDER BYstatement sorts the results of the query.

PROC SQL; SELECT name, team, no_hits, CASE WHEN no_hits < 50 THEN 'poor' WHEN no_hits BETWEEN 51 AND 100 THEN 'mediocre' WHEN no_hits BETWEEN 101 AND 150 THEN 'good' WHEN no_hits BETWEEN 151 and 200 THEN 'very good' ELSE 'excellent' END AS hit_qnty label 'Quality Hitter' FROM sasuser.baseball ORDER BY team;
                    Team at the    Hits in  Quality
Player's Name end of 1986 1986 Hitter
-----------------------------------------------------
Hubbard, Glenn Atlanta 94 mediocre
Murphy, Dale Atlanta 163 very good
Sample, Billy Atlanta 57 mediocre
Moreno, Omar Atlanta 84 mediocre
Horner, Bob Atlanta 141 good
Ramirez, Rafael Atlanta 119 good
Oberkfell, Ken Atlanta 136 good
Harper, Terry Atlanta 68 mediocre
Simmons, Ted Atlanta 32 poor

Subqueries - A select may be used within another select statement to
extract information from a second table.
For example, we will use
a subquery to find all values of v2 that are in the same row as id=2.
We will then select all rows in data1 that have var4 equal to these
values.

PROC SQL; SELECT key_val, var1, var2, var3, var4 FROM data1 WHERE var4 IN (SELECT v2 FROM data2 WHERE id = 2);
 KEY_VAL      VAR1      VAR2      VAR3  VAR4
-----------------------------------------------
2 4 2 3 B
5 6 2 1 C
1 1 1 9 C

Tables can be merged or joined in several ways. The following code
creates every possible combination merging on a variable. The ORDER BY
statement sorts the output.
PROC SQL; SELECT id, v1, var3, var4 FROM data1 AS first, data2 AS second WHERE first.key_val = second.id ORDER BY id, v1, var3;
      ID        V1      VAR3  VAR4
----------------------------------
1 2 4 A
1 2 5 E
1 2 9 C
1 3 4 A
1 3 5 E
1 3 9 C
1 5 4 A
1 5 5 E
1 5 9 C
2 1 3 B
2 1 3 B
2 1 9 A
2 1 9 A
2 2 3 B
2 2 9 A
3 1 8 E
3 2 8 E
3 3 8 E
4 1 5 A
4 2 5 A
IMPORTANT If SQL is being used to merge or join tables,
please be aware that if the tables are very large, SQL will not work,
or may be slower that the datastep. Also, only one table can be created
per SQL CREATE statement. Base SAS allows multiple
datasets to be created in on DATA step.






JOIN statements can be used to combine tables.


The following code produces output identical to the preceeding code: PROC SQL; SELECT id, v1, var3, var4 FROM data1 as first INNER JOIN data2 as second ON first.key_val = second.id ORDER BY id, v1, var3;
      ID        V1      VAR3  VAR4
----------------------------------
1 2 4 A
1 2 5 E
1 2 9 C
1 3 4 A
1 3 5 E
1 3 9 C
1 5 4 A
1 5 5 E
1 5 9 C
2 1 3 B
2 1 3 B
2 1 9 A
2 1 9 A
2 2 3 B
2 2 9 A
3 1 8 E
3 2 8 E
3 3 8 E
4 1 5 A
4 2 5 A

The LEFT JOIN combines two tables on key values that are equal, plus
any unmatched values from the first table.

PROC SQL; SELECT id, key_val, v1, var1, var2, var3, var4, v2 FROM data1 as first LEFT JOIN data2 as second ON first.key_val = second.id ORDER BY id, v1, var3;
      ID   KEY_VAL        V1      VAR1      VAR2      VAR3  VAR4      V2
--------------------------------------------------------------------------
. 5 . 6 2 1 C
. 5 . 2 2 2 A
1 1 2 2 3 4 A A
1 1 2 3 4 5 E A
1 1 2 1 1 9 C A
1 1 3 2 3 4 A D
1 1 3 3 4 5 E D
1 1 3 1 1 9 C D
1 1 5 2 3 4 A A
1 1 5 3 4 5 E A
1 1 5 1 1 9 C A
2 2 1 4 2 3 B C
2 2 1 4 2 3 B B
2 2 1 4 1 9 A C
2 2 1 4 1 9 A B
2 2 2 4 2 3 B D
2 2 2 4 1 9 A D
3 3 1 5 7 8 E B
3 3 2 5 7 8 E C
3 3 3 5 7 8 E D
4 4 1 5 5 5 A D
4 4 2 5 5 5 A C

The first two rows with the missing ID correspond to key_val = 5 in the first table.

The RIGHT JOIN combines the tables by selecting the rows where the
key values match, plus any matched keys from the second table.

PROC SQL; SELECT id, key_val, v1, var1, var2, var3, var4, v2 FROM data1 as first RIGHT JOIN data2 as second ON first.key_val = second.id ORDER BY id, v1, var3;
      
ID KEY_VAL V1 VAR1 VAR2 VAR3 VAR4 V2
------------------------------------------------------------------------------
1 1 2 2 3 4 A A
1 1 2 3 4 5 E A
1 1 2 1 1 9 C A
1 1 3 2 3 4 A D
1 1 3 3 4 5 E D
1 1 3 1 1 9 C D
1 1 5 2 3 4 A A
1 1 5 3 4 5 E A
1 1 5 1 1 9 C A
2 2 1 4 2 3 B C
2 2 1 4 2 3 B B
2 2 1 4 1 9 A B
2 2 1 4 1 9 A C
2 2 2 4 2 3 B D
2 2 2 4 1 9 A D
3 3 1 5 7 8 E B
3 3 2 5 7 8 E C
3 3 3 5 7 8 E D
4 4 1 5 5 5 A D
4 4 2 5 5 5 A C
9 . 3 . . . B
9 . 3 . . . A

The FULL JOIN creates every possible combination from the tables:

PROC SQL; SELECT id, key_val, v1, var1, var2, var3, var4, v2 FROM data1 as first FULL JOIN data2 as second ON first.key_val = second.id ORDER BY id, v1, var3;
      ID   KEY_VAL        V1      VAR1      VAR2      VAR3  VAR4      V2
------------------------------------------------------------------------------
. 5 . 6 2 1 C
. 5 . 2 2 2 A
1 1 2 2 3 4 A A
1 1 2 3 4 5 E A
1 1 2 1 1 9 C A
1 1 3 2 3 4 A D
1 1 3 3 4 5 E D
1 1 3 1 1 9 C D
1 1 5 2 3 4 A A
1 1 5 3 4 5 E A
1 1 5 1 1 9 C A
2 2 1 4 2 3 B C
2 2 1 4 2 3 B B
2 2 1 4 1 9 A B
2 2 1 4 1 9 A C
2 2 2 4 2 3 B D
2 2 2 4 1 9 A D
3 3 1 5 7 8 E B
3 3 2 5 7 8 E C
3 3 3 5 7 8 E D
4 4 1 5 5 5 A D
4 4 2 5 5 5 A C
9 . 3 . . . B
9 . 3 . . . A
The corresponding dataset merge statements corresponding to the various
joins are:

INNER JOIN merge data1 (in=a) data2 (in=b); by id; if a and b; LEFT JOIN merge data1 (in=a) data2 (in=b); by id; if a; RIGHT JOIN merge data1 (in=a) data2 (in=b); by id; if b; FULL JOIN merge data1 (in=a) data2 (in=b); by id;
Datasets used in examples
Selected variables from SASUSER.BASEBALL
                                         N              C
O N R C
L _ O _ R S
E A _ N A _ C A
N T A T H O T H R L
A E G B I _ B I _ A
M A U A T B A T B R
E M E T S B T S B Y

Aldrete, Mike SanFrancisco National 216 54 33 216 54 33 75
Allanson, Andy Cleveland American 293 66 14 293 66 14 .
Almon, Bill Pittsburgh National 196 43 30 3231 825 238 240
Anderson, Dave LosAngeles National 216 53 22 926 210 114 225
Armas, Tony Boston American 425 112 24 4513 1134 230 .
Ashby, Alan Houston National 315 81 39 3449 835 375 475
Backman, Wally NewYork National 387 124 36 1775 506 194 550
Baines, Harold Chicago American 570 169 38 3754 1077 263 950
Baker, Dusty Oakland American 242 58 27 7117 1981 762 .
Balboni, Steve KansasCity American 512 117 43 1750 412 155 100
Sample dataset "data1"
KEY_VAL    VAR1    VAR2    VAR3    VAR4

1 2 3 4 A
1 3 4 5 E
2 4 2 3 B
3 5 7 8 E
4 5 5 5 A
5 6 2 1 C
2 4 1 9 A
5 2 2 2 A
1 1 1 9 C
Sample dataset "data2"
ID    V1    V2

1 2 A
1 3 D
1 5 A
2 1 B
2 1 C
2 2 D
3 1 B
3 2 C
3 3 D
4 1 D
4 2 C
9 3 B
9 3 A
Document Actions

Copyright ©2009, The Pennsylvania State University | Privacy and Legal Statements
Contact the Help Site Administrator | Last modified Aug 13, 2008 | Weblion Partner