Personal tools
You are here: Home Software Packages SAS Introduction to the Data Step -- Part 1

Introduction to the Data Step -- Part 1

— filed under: ,

All numbers in parentheses refer to page numbers in the Second Edition of The Little SAS Book

General Syntax    Libname and Filename Statements    SAS datasets    Variables   

Reading Raw Data    Arithmetic Operators    Assignment Statements    Functions

 

 

The SAS DATA step can be used to:

  • Input data into a SAS data set
  • Create new variables
  • Check for errors in the data and correct those errors
  • Produce new data sets by subsetting, merging, and/or modifying existing data sets
  • Prepare data for analysis using SAS procedures

 

General SAS syntax rules

 

  • SAS programming statements end with a semi-colon.
  • SAS programming statements can start in any column.
  • SAS programming statements can span multiple lines.
  • SAS programming statements are not case sensative.
  • SAS names can consist of letters, numbers, and the underscore '_'.
  • SAS names must start with an underscore or letter - no numbers!

     

  • Comments in SAS programs can be written in two forms
    1. Start with the characters "/*" and end with the characters "*/"
    2. Start with an asterisk "*" and end with a semi-colon ";"

 

 

LIBNAME and FILENAME

There are two important SAS statements that allow SAS to communicate with the hardware you are using: the FILENAME and LIBNAME

statements (26,27,42,43).

 

The LIBNAME statement refers to a directory or folder. The syntax of the libname statement is:

    LIBNAME reference name 'complete path name of directory or folder';

For example:

libname abcdef '~/project/data';      (Unix system)

libname wxyz_1 'C:\mydata\project';      (Windows)

 

The FILENAME statement refers to a file which can be used for input or output. The basic syntax of the filename statement is:

    FILENAME reference name 'complete filename';

Examples:

filename rawdata '~/project/rawdata/wave1.data';     (Unix)

filename rawdata 'D:\wave1.dat';      (Windows)

When using the FILENAME statement on a Unix machine, the PIPE keyword can be used to allow a Unix pipe as the device type. This is very useful when reading a compressed or zipped data file.

FILENAME indata PIPE 'zcat ~/project/rawdata/wave2.data.gz';

The preceeding example allows SAS to use the output from the UNIX zcat command as input data. zcat is a command which uncompresses or unzips a file. This eliminates the need to issue the command to unzip the file prior to running the SAS program.

 

SAS data sets (4,5)

 

  • SAS data sets are tables.
  • They consist of observations(rows) and variables(columns).
  • Data sets can be permanent or temporary (40,41).

Temporary datasets are located in a SAS library named WORK. Temporary datasets exist only for the duration of the SAS session or job. As long as you are using SAS, these datasets will be available. When referring to a temporary dataset, the libname "WORK" does not have to be explicitly specified.

Permanent datasets are located in a SAS library pointed to by a LIBNAME statement. Permanet datasets exist after the termination of the SAS session or job. Multiple datasets can be contained within a library.

The DATA step always starts with the word DATA usually followed by dataset name(s).

Examples:
    DATA family;
    DATA abcdef.individ;
The first example is a temporary dataset, the second is a permanent dataset located in the directory/folder referred to as "abcdef" in a libname statement.

 

Variables

 

  • Can either be character or numeric (4,5).
  • Variable names must start with a letter or underscore (_)
  • Variable names can consist of letters, numbers, and underscores
  • Variable names can be 32 characters long (Version 7 - Note: 8 characters long in Version 6)
  • Can be used with formats to change appearance (98-101).
  • Can be assigned maximum lengths to save disk space (214, 218, 219)

Unless specified otherwise, SAS considers variables to be numeric. Character variables can be created by reading the variable as character, using a length statement, or using other methods to designate the variable as character. Note that in some cases, variables that are all numeric are better represented as character (zip codes, phone numbers, FIPS codes).

Because variables can be only character or numeric, SAS handles dates in a special manner. Internally, SAS represents dates as the number of days since January 1, 1960.

Missing character values in SAS are represented as a blank space. Missing numeric values are usually represented as a ".". SAS will allow multiple representations of missing numeric values using a capital letter or underscore along with the period (._, .A,.B, ... ,.Z). Missing values are smaller than any numeric value, so when sorting in ascending order, the missing values will appear at the beginning of the list.

The FORMAT statement can be used to change the appearance of a variable, even thought the actual value in the SAS dataset is not changed. The syntax of this statement is:

   FORMAT variable formatname;

For example:  FORMAT income DOLLAR15.2;

Here is a varible displayed using 2 different formats.
57890022.79        No format
$57,890,022.79    DOLLAR15.2 format
57,890,022.79      COMMA15.2 format

The LENGTH statement is used to set the maximum amount of space a variable will use. This is important since the default length is 8 characters. For numeric variables, the length statement can be used to minimize the space used for integer values. Many times variables can be stored in less space. For example, variables containing age or years of education. For character variables, the length statement can be used to increase the length so that information is not truncated. An example would be city names.

The syntax of the length statement is:

   LENGTH variable length variable length ... ;

For example:   LENGTH age 3 city $20;

This table gives the length (in bytes) and largest integer that can be stored on a UNIX machine.

Length in bytes/Largest Integer
  3  

8,192

  4  

2,097,152

  5  

536,870,912

  6  

137,438,953,472

  7  

35,184,372,088,832

  8  

9,007,199,254,740,992

If your numeric data are real numbers (contain decimal points), do NOT use the length statement - leave the length at the default of 8

The INFORMAT statement is used to input data which may contain "special" characters (dollar signs, commas, etc.), to specify the number of decimal places for real numbers, or to read dates.

    INFORMAT income dollar10.2;

Would read the values for the variable "income" if it contained dollar signs and commas. The values would be stored as numeric values.

The LABEL statement can change the way the variable name is displayed when output.

    LABEL income = '1998 Household Income';

Would print '1998 Household Income' on the specified output. With no label statement, the variable name 'income' would print.

The RENAME statement can be used to change the name of a variable.

     RENAME V1298A1 = state V1298A2 = county;

The KEEP statement tells the data step which variables to keep in the dataset. The DROP statement tells the data step to delete the listed variables.

    KEEP variable1 variable2 ;
    DROP variable8 variable9;

Reading raw (ascii) data

Two essential statements for reading raw data into SAS are the INFILE and INPUT statements. The INFILE statement specifies where the raw data is located. The INPUT statement specifies the variables to be read, and provides instructions for reading these variables.

The INFILE statement can be used with the FILENAME statement, or can operate by itself (26,27).

filename indata '/home/data/some-directory/rawdata.toread'; infile indata;

infile '/home/data/some-directory/rawdata.toread';

In both cases, the program will read the data in the file /home/data/some-directory/rawdata.toread.

There are various options for the INFILE statement that you may encounter. Some that are useful are:

  • FIRSTOBS= specifies that the program should start reading at the line specified.
  • LRECL= gives the length of the input records. This must be used when the line length is greater than 256.
  • N= defines number of lines available to input pointer. Used when data requires more than one line.
  • OBS= gives the last line that should be read. Useful when testing code.
  • MISSOVER tells SAS to assign missing values to variables when then end of the input line is reached.
  • TRUNCOVER tells SAS to truncate values if line is shorter than others.
Most of the data in the data archive at PRI must be read by columns. There are no spaces between the variables, although there may be embedded spaces in character variables. There are several advantages to reading data in columns.
  • Variables can be read in any order.
  • Character variables can have embedded blanks
  • Variables or parts of variables can be reread

There may be multiple observations/rows per line of input data or it may require multiple lines of raw data to construct one observation in the dataset. SAS provides the ability to read the data.

Some of the control characters that may appear in input statements:

  • @n starts reading data in column n
  • +n moves the pointer n columns
  • / moves the pointer to the beginning of the next line
  • @ at the end of an INPUT statement holds the line for the next INPUT statement.
  • @@ holds the input line for further executions of the DATA step.
  • #n specifies a line in the input buffer (used with INFILE N= option)

 

The following examples of the input statement are taken from code in the data archives.

 


INPUT

   V14801 1-3             V14802 4-7             V14803 8-9

   V14804 10-12           V14805 13-17           V14806 18

   V14807 19              V14808 20              V14809 21

;

This example uses column numbers to input numeric variables.


 


DATA ONE;

INFILE TEST3A LRECL=7925 N=4;

INPUT

FILEID $ 1-8

STUSAB $ 9-10

SUMLEV 11-13

;

This input statement also uses column numbers to input the variables. Note that the variables FILEID and STUSAB are character, and SUMLEV is numeric. In addition, the INFILE statement needs a LRECL statement (the raw data is longer than 256 characters). The N=4 is used because 4 lines of raw data are needed for each observation (row) in the dataset.

INPUT

        @2 REG $1.        @2 STCD $2.

        @4 HOSPNO $4.

@1 ID $7.

@8 DBEGM 2. @10 DBEGD 2. @12 DBEGY 2.

@8 DTBEG MMDDYY6. ;

This input statement uses the column pointer (@n) to tell the program where to start reading the data. Rather than specifying the column numbers, an informat is used to specify the length. Several columns are read multiple times to create different variables. STCD is composed of REG, and the character in column 3. ID is composed of column 1, STCD, and HOSPNO. The month, day, and year are being read as numeric values. The date is also being converted from mmddyy format to a SAS date value by reading the month, day, and year values with an informat.

Always check your SAS log! The SAS log will tell you the number of lines of raw data that were input and the number of observations output. Make sure the number of lines read, and the number of observations correspond. The SAS log will also tell you the number of variables in the dataset.

 


Arithmetic Operators

Numeric variables can be created or modified using arithmetic operators. They are:

  • + Addition
  • - Subtraction
  • * Multiplication
  • / Division
  • ** Exponentiation

For example:

 

a = b + c;         a = b + 1;      
x1 = a - 8;        males = total - females;
d = z * 5;         f = d * e;
h = i / j;         average = total / number;

Performing an arithmetic operation on a variable with a missing value results in a missing value.

 

 


Assignment Statements

Variables can be created or modified by using an assignment statement. Note that no special keywords such as "ASSIGN" or "COMPUTE" are needed.

 

year = 1996;       percent = 105.72;
city = 'Boston';   zipcode = '00101';	

Functions

SAS has hundreds of built-in functions that modify data. There are functions to perform arithmetic operations, character manipulation, elementary statistics, manipulate dates and time, among others.

Note that the arithmetic functions only use non-missing values. Missing values are ignored. In some cases, the functions should be used in place of the arithmetic operators.

If we have the following variables:
   first = 7;
   second = 2;
   third = .;

The value of result will be dramatically different depending on how the variables are summed.

   result = first + second + third;     result = . (missing)
   result = sum(first,second,third);    result = 9

The SAS log will print a message when missing values are calculated as a result of missing values.

 


Document Actions

Copyright ©2009, The Pennsylvania State University | Privacy and Legal Statements
Contact the Help Site Administrator | Last modified Aug 21, 2008 | Weblion Partner