Accessing Data: Center of Excellence Data Warehousing

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 108

Accessing Data

Last Updated : 29 June, 2004

Center of Excellence Data Warehousing

Agenda
Creating datasets using DATA step. Infile statement Different input styles. Combining datasets using DATA step

What Is the SAS System?


The SAS System is an integrated system of software products that enables you to perform
data entry, retrieval, and management report writing and graphics

statistical and mathematical analysis


business planning, forecasting, and decision support operations research and project management quality improvement applications development.

What Is the SAS System?


In addition, you can integrate with SAS many SAS business solutions that enable you to perform large scale business functions, such as
Data Warehousing and Data Mining Human Resources Management and Decision Support

Financial Management and Decision Support

Overview of Base SAS Software


The core of the SAS System is base SAS software SAS language
a programming language that you use to manage your data.

SAS procedures
software tools for data analysis and reporting.

Macro facility
a tool for extending and customizing SAS software programs and for reducing text in your programs.

DATA step debugger


a programming tool that helps you find logic problems in DATA step programs.

Output Delivery System (ODS)


a system that delivers output in a variety of easy-to-access formats, such as SAS data sets, listing files, or Hypertext Markup Language (HTML).

Base SAS data access management analysis presentation

Components of the SAS Language


SAS Files Files with formats or structures known to SAS. All SAS files reside in a SAS data library.
SAS data set
is structured in a format that SAS can process.

SAS catalog
Many different kinds of information that are used in a SAS job are

stored in SAS catalogs, such as instructions for reading and printing data values, or function key settings that you use in the SAS windowing environment.

SAS stored program


contains compiled code that you create and save for repeated use.

Components of the SAS Language


SAS Data Sets SAS data file
both describes and physically stores data values.
descriptor portion

describes the contents of the SAS data set to SAS.


Data portion

data that has been collected or calculated. An observation is a collection of data values that usually relate to a single object. A variable is the set of data values that describe a given characteristic.

SAS data view


does not actually store values but create logical SAS data sets

without using the storage space required by SAS data files.

Structure of SAS Data Sets


SAS Data Set General Data Set Information Name Number of Obs. *Label Number of Variables *Date/Time Created Storage Information Information for Each Variable Type Length Position Name *Label *Format *Informat IDNUM NAME WAGECAT S S S S WAGERATE 3392.50 5093.75 . 1572.50 1351 Farr, Sue 161 212 Moore, Ron 2512 Ruth, G H ... ... ... 5151 Coxe, Susan

Descriptor Portion

Data Portion

3163.00

Components of the SAS Language


External Files
Data files that you use to read and write data, but which are in a structure unknown to SAS. External files can be used for storing
raw data that you want to read into a SAS data file SAS program statements procedure output

Database Management System Files


SAS software is able to read and write data to and from other vendors'

software, such as many common database management system (DBMS) files. In addition to base SAS software, you must license the SAS/ACCESS software for your DBMS and operating environment.

Components of the SAS Language


SAS Language Elements
DATA step
consists of a group of statements in the SAS language that reads raw

data or existing SAS data sets to create a SAS data set.

PROC step
A group of procedure statements used to analyze data in SAS data sets

to produce statistics, tables, reports, charts, and plots, to create SQL queries, and to perform other analyses and operations on your data. They also provide ways to manage and print SAS files.

SAS Macro Facility


a powerful programming tool for extending and customizing your SAS

programs, and for reducing the amount of code that you must enter to do common tasks. Macros are SAS files that contain compiled macro program statements and stored text.

What Can the DATA Step Do?


You can use the DATA step in the following ways to transform your information:
Read from a raw data file into the SAS system.
Raw Data File SAS Data Set
Descriptor

DATA Step

What Can the DATA Step Do?


Create multiple SAS data sets in one DATA step.

DATA Step

What Can the DATA Step Do?


Combine existing data sets.
SAS Data Set 1

SAS Data Set 2

DATA Step

What Can the DATA Step Do?


You can also add or augment information in a variety of ways.
Create accumulating totals. SaleDate 01APR2001 02APR2001 03APR2001 04APR2001 05APR2001

Sale Amt
498.49 946.50 994.97 564.59 783.01

Mth2Dte 498.49 1444.99 2439.96 3004.55 3787.56

What Can the DATA Step Do?


Manipulate numeric values.

BirthDay 4253 SAS Function Age 30

What Can the DATA Step Do?


Summarize data sets.

Salary 42000 34000 27000 20000 19000 19000

Div HUMRES FINACE FLTOPS FINACE FINACE FLTOPS

Div
DATA Step FINACE FLTOPS HUMRES

DivSal
42000 46000 73000

What Can the DATA Step Do?


presenting your data file management And much, much more

Words in the SAS Language


A word or token in the SAS language is a collection of characters that communicates a meaning to SAS. A word or token ends when SAS encounters one of the following:
the beginning of a new token a blank after a name or a number token the ending quotation mark of a literal token.

Each word or token in the SAS language classified into four categories.
Names - a series of characters that begin with a letter or an underscore. Ex.: data, _old, yearcutoff, _n_, year_04, descending Literal - consists of 1 to 32,767 characters enclosed in single or double quotation marks ( Bangalore, 2003-04, Wipros Plan, "Report for the Third Quarter" )

Words in the SAS Language


Number - in general is composed entirely of numeric digits, with an optional decimal point and a leading plus or minus sign. SAS also recognizes numeric values in the following forms as number tokens: scientific (E-) notation, hexadecimal notation, missing value symbols, and date and time literals.
Ex: 1234, -2004, 1.25, 5.4E-1, 30jun04'd

Special character - is usually any single keyboard character other than letters, numbers, the underscore, and the blank. In general, each special character is a single token, although some two-character operators, such as ** and <=, form single tokens.
Ex: =, :, @, , +, /

Names in the SAS Language


A SAS name is a name token that represents
variables SAS data sets formats or informats SAS procedures options arrays statement labels SAS macros or macro variables SAS catalog entries librefs or filerefs.

There are two kinds of names in SAS.


names of elements of the SAS language names supplied by SAS users.

Names in the SAS Language


Rules for User-Supplied SAS Names
Members of SAS data libraries (SAS data sets, views, catalogs, indexes) Generation data sets Catalog entries Engines, Librefs, Filerefs, Passwords DATA step variables DATA step variable labels DATA step statement labels Arrays Functions Formats Informats Macros, Macro variables - 32 - 28 - 32 -8 - 32 - 256 - 32 - 32 - 16 -8 -7 - 32

Names in the SAS Language


Rules for User-Supplied SAS Names
The first character must be a letter (A, B, C, . . ., Z) or underscore (_). Subsequent characters can be letters, numeric digits (0, 1, . . ., 9), or underscores. You can use upper or lowercase letters. SAS processes names as uppercase regardless of how you type them. Blanks cannot appear in SAS names. SAS reserves a few names for automatic variables and variable lists, SAS data sets, and librefs.
When creating variables, do not use the names of special SAS

automatic variables (for example, _N_ and _ERROR_) or special variable list names (for example, _CHARACTER_, _NUMERIC_, and _ALL_). When associating a libref with a SAS data library, do not use SASHELP, SASMSG, SASUSER, WORK . When you create SAS data sets, do not use _NULL_, _DATA_, _LAST_.

Names in the SAS Language


Rules for User-Supplied SAS Names
Special characters, except for the underscore, are not allowed. In filerefs only, you can use the dollar sign ($), pound sign (#), and at sign (@). When assigning a fileref to an external file, do not use: SASCAT. When you create a macro variable, do not use names that begin with SYS

SAS Dates
SAS dates are special numeric values representing the number of days between January 1, 1960 and a specified date.
1jan1959 1jan1960 1jan1961 1jan2000 DATE9. Informat -365 0
SAS Date Values

366

14610
SAS Date Values

MMDDYY10. Format

Standard Data
The term standard data refers to character and numeric data that SAS recognizes automatically. Some examples of standard numeric data include
35469.93 3E5 (exponential notation) -46859

Standard character data is any character you can type on your keyboard. Standard character values are always left-justified by SAS.

Nonstandard Data
The term nonstandard data refers to character and numeric data that SAS does not recognize automatically. Examples of nonstandard numeric data include
12/12/2012 29FEB2000 4,242 $89,000

Create a SAS Data Set from a Raw Data File


E1232 E2341 E3452 E6781 E8321 E1052 E1062 E8172 E1091 15OCT1999 01JUN1997 26OCT1993 16SEP1992 26NOV1996 27FEB1997 10MAY1987 06JAN2000 20AUG1991 61065 A raw data file 91688 contains employee 32639 information for 28305 the level 1 flight 40440 attendants. Use the 39461 raw data file to create 41463 40650 the work.fltat1 SAS 40950 data set.

Desired Output
Obs 1 2 3 4 5 6 7 8 9 EmpID E1232 E2341 E3452 E6781 E8321 E1052 E1062 E8172 E1091 Hire Date 14532 13666 12352 11947 13479 13572 9991 14615 11554 Salary 61065 91688 32639 28305 40440 39461 41463 40650 40950 Bonus 3053.25 4584.40 1631.95 1415.25 2022.00 1973.05 2073.15 2032.50 2047.50

The DATA Statement


A DATA step always begins with a DATA statement. General form of a DATA statement:

DATA SAS-data-set;

The DATA statement starts the DATA step and names the SAS data set being created.

The INFILE Statement


If you are reading data from a raw data file, you need an INFILE statement. General form of an INFILE statement:
INFILE 'raw-data-file' <options>;

The INFILE statement points to the raw data file being read. Options in the INFILE statement affect how SAS reads the raw data file.

The INPUT Statement


When you are reading from a raw data file, the INPUT statement follows the INFILE statement. General form of an INPUT statement:

INPUT variable-specification ;

The INPUT statement describes the raw data fields and specifies how you want them converted into SAS variables.

Formatted Input
The input style tells SAS where to find the fields and how to read them into SAS.
INPUT @n variable-name informat. ...;

@n - moves the pointer to the starting point of the field. variable-name - names the SAS variable being created. Informat - specifies how many positions to read and how to convert the raw data into a SAS value.

The INPUT Statement


Common SAS informats: $w. - reads a standard character field, where w specifies the width of the field in bytes. W - reads a standard numeric field, where w specifies the width of the field in bytes. DATE9. - reads dates in the form 31DEC2012.

The Assignment Statement


To create a new variable in the DATA step, use an assignment statement:

variable-name=expression;

The assignment statement creates a SAS variable and specifies how to calculate that variable's value.

Create a SAS Data Set from a Raw Data File


data work.fltat1; infile 'raw-data-file'; input @1 EmpID $5. @7 HireDate date9. @17 Salary 5.; Bonus=.05*Salary; run;

Create a SAS Data Set from a Raw Data File


Partial Log

NOTE: 9 records were read from the infile 'fltat1.dat'. The minimum record length was 21. The maximum record length was 21. NOTE: The data set WORK.FLTAT1 has 9 observations and 4 variables.

Overview of DATA Step Processing

Processing the DATA Step


The SAS System processes the DATA step in two phases:
compilation execution.

When you submit a DATA step for execution, SAS checks the syntax of the SAS statements and compiles them. During the compile phase, SAS creates the following three items
input buffer
is a logical area in memory into which SAS reads each record of raw

data when SAS executes an INPUT statement.

program data vector (PDV)


is a logical area in memory where SAS builds a data set, one

observation at a time. When a program executes, SAS reads data values from the input buffer or creates them by executing SAS language statements.

DATA Step Processing


The data values are assigned to the appropriate variables in the

program data vector. From here, SAS writes the values to a SAS data set as a single observation Along with data set variables and computed variables, the PDV contains two automatic variables, _N_ and _ERROR_. The _N_ variable counts the number of times the DATA step begins to iterate. The _ERROR_ variable signals the occurrence of an error caused by the data during execution.

descriptor information
is information that SAS creates and maintains about each SAS

data set, including data set attributes and variable attributes. It contains, for example, the name of the data set and its member type, the date and time that the data set was created, and the number, names and data types (character or numeric) of the variables.

DATA Step Processing


The flow of action in the Execution Phase of a simple DATA step
The DATA step begins with a DATA statement. Each time the DATA statement executes, a new iteration of the DATA step begins, and the _N_ automatic variable is incremented by 1. SAS sets the newly created program variables to missing in the program data vector (PDV). SAS reads a data record from a raw data file into the input buffer, or it reads an observation from a SAS data set directly into the program data vector. You can use an INPUT, MERGE, SET, MODIFY, or UPDATE statement to read a record. SAS executes any subsequent programming statements for the current record.

DATA Step Processing


At the end of the statements, an output, return, and reset occur automatically. SAS writes an observation to the SAS data set, the system automatically returns to the top of the DATA step, and the values of variables created by INPUT and assignment statements are reset to missing in the program data vector. Note that variables that you read with a SET, MERGE, MODIFY, or UPDATE statement are not reset to missing here. SAS counts another iteration, reads the next record or observation, and executes the subsequent programming statements for the current observation. The DATA step terminates when SAS encounters the end-of-file in a SAS data set or a raw data file.

DATA step
Reading External File Data
data bonus_04; [1] infile 'your-input-file'; [2] input IDnumber name $ salary ; [3] bonus=salary * 0.25; [4] run; [5]

1- Begin the DATA step and create a SAS data set called bonus_04. 2- Specify the external file that contains your data. 3- Read a record and assign values to three variables. 4- Calculate a value for variable bonus. 5- Execute the DATA step.

DATA step Input Styles


The INPUT statement reads raw data from instream data lines or external files into a SAS data set and input styles depending on the layout of data values in the records.
INPUT, Formatted - Reads input values from specified columns and assigns them to the corresponding SAS variables INPUT, Column - Reads input values with specified informats and assigns them to the corresponding SAS variables INPUT, List - Scans the input data record for input values and assigns them to the corresponding SAS variables INPUT, Named - Reads data values that appear after a variable name that is followed by an equal sign and assigns them to corresponding SAS variables

DATA step
An informat is an instruction that SAS uses to read data values into a variable.
The INPUT statement with an informat after a variable name is the simplest way to read values into a variable.
$w. DATEw.

Reads standard character data Reads date values in the form ddmmmyy or

ddmmmyyyy
MMDDYYw. -

Reads date values in the form mmddyy or

mmddyyyy
w.d

COMMAw.d -

Reads standard numeric data Removes embedded characters

Accessing Data List input


List input uses a scanning method for locating data values. Data values are must be separated by at least one blank (or other defined delimiter). List input requires only that you specify the variable names and a dollar sign ($), if defining a character variable.
Libname new /wipro/dw/data; data new.scores; length name $ 12; input name $ score1 score2; datalines; aaaaaa 1132 1187 bbbbbbbb 1015 1102 cccc 246 357 ; Run;

Modified List Input


data scores; infile datalines dsd; input Name : $9. Score1-Score3 Team ~ $25. Div $; datalines; Smith,12,22,46,"Green Hornets, Atlanta",AAA Mitchel,23,19,25,"High Volts, Portland",AAA Jones,09,17,54,"Vulcans, Las Vegas",AA ; run; output Name Score1 Score2 Score3 Smith 12 22 46 Mitchel 23 19 25 Jones 09 17 54 Team "Green Hornets, Atlanta "High Volts, Portland "Vulcans, Las Vegas Div AAA AAA AA

Modified List Input


The : (colon) format modifier enables you to use list input but also to specify an informat after a variable name, whether character or numeric. SAS reads until it encounters a blank column. The ~ (tilde) format modifier enables you to read and retain single quotation marks, double quotation marks, and delimiters within character values. If you want SAS to read consecutive delimiters as though there is a missing value between them, specify the DSD option in the INFILE statment. To read and store a character input value longer than 8 bytes, define a variable's length by using a LENGTH, INFORMAT. Character values cannot contain embedded blanks when the file is delimited by blanks. Fields must be read in order. Data must be in standard numeric or character format.

Data Accessing - Column Input


Column input enables you to read standard data values that are aligned in columns in the data records. Specify the variable name, followed by a dollar sign ($) if it is a character variable, and specify the columns in which the data values are located in each record: data scores; infile datalines truncover; input name $ 1-12 score1 17-20 score2 27-30; datalines; 123456789101112131415161718192021222324252627282930 Riley 1132 987 Henderson 1015 1102 ; run;

Data Accessing - Column Input


To use column input, data values must be
in the same field on all the input lines in standard numeric or character form.

Features of column input include the following


Character values can contain embedded blanks. Character values can be from 1 to 32,767 characters long. Placeholders, such as a single period (.), are not required for missing data. Input values can be read in any order, regardless of their position in the record. Values or parts of values can be reread. Both leading and trailing blanks within the field are ignored. Values do not need to be separated by blanks or other delimiters. Use the TRUNCOVER option on the INFILE statement to ensure that SAS handles data values of varying lengths appropriately.

Data Accessing - Formatted Input


Formatted input combines the flexibility of using informats with many of the features of column input. By using formatted input, you can read nonstandard data for which SAS requires additional instructions. Formatted input is typically used with pointer controls that enable you to control the position of the input pointer in the input buffer when you read data. data scores; input name $12. +4 score1 comma5. +6 score2 comma5.; datalines; Riley 1,132 1,187 Henderson 1,015 1,102 ;

Data Accessing - Formatted Input


Important points about formatted input are
Characters values can contain embedded blanks. Character values can be from 1 to 32,767 characters long. Placeholders, such as a single period (.), are not required for missing data. With the use of pointer controls to position the pointer, input values can be read in any order, regardless of their positions in the record. Values or parts of values can be reread. Formatted input enables you to read data stored in nonstandard form, such as packed decimal or numbers with commas.

Data Accessing - Named Input


You can use named input to read records in which data values are preceded by the name of the variable and an equal sign (=). The following INPUT statement reads the data lines containing equal signs. data games; input name=$ score1= score2=; datalines; name=abc score1=1132 score2=1187 ; run;

The MISSOVER Option


The MISSOVER option prevents SAS from loading a new record when the end of the current record is reached. General form of the INFILE statement with the MISSOVER option: INFILE raw-data-file MISSOVER; If SAS reaches the end of the row without finding values for all fields, variables without values are set to missing.

Using the MISSOVER Option


data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=',' missover; input ID $ InService : date9. PassCap CargoCap; run;
Raw Data File
50001 ,25feb1989,132, 530 50002, 11nov1989,152 50003, 22oct1991,168, 530 50004, 4feb1993,172 50005, 24jun1993, 170, 510 50006, 20dec1994, 180, 520

Using the MISSOVER Option


Partial SAS Log NOTE: 6 records were read from the infile 'aircraft3.dat'. The minimum record length was 19. The maximum record length was 26. NOTE: The data set WORK.AIRPLANES3 has 6 observations and 4 variables. NOTE: DATA statement used: real time 0.42 seconds cpu time 0.07 seconds

Missing Values without Placeholders


There is missing data represented by two consecutive delimiters. 50001 ,25feb1989,, 540 50002, 11nov1989,132, 530 50003, 22oct1991,168, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180,520

Missing Values without Placeholders


By default, SAS treats two consecutive delimiters as one. Missing data should be represented by a placeholder. 5 0 0 0 1 ,25feb1989 , . , 5 3 0

The DSD Option


General form of the DSD option in the INFILE statement: INFILE file-name DSD;

Missing Values without Placeholders


The DSD option
sets the default delimiter to a comma treats consecutive delimiters as missing values enables SAS to read values with embedded delimiters if the value is surrounded by double quotes.

Using the DSD Option


data airplanes4; length ID $ 5; infile 'raw-data-file' dsd; input ID $ InService : date9. PassCap CargoCap; run;
50001 ,25feb1989,, 540 50002, 11nov1989,132, 530 50003, 22oct1991,168, 530 50004, 4feb1993,172, 550 50005, 24jun1993,, 510 50006, 20dec1994, 180,520

INFILE Statement Options


Problem Non-blank delimiters Missing data at end of row Missing data represented by consecutive delimiters or Embedded delimiters where values are surrounded by double quotes Option DLM='delimiter(s)' MISSOVER DSD

These options can be used separately or together in the INFILE statement.

Multiple Records Per Observation


A raw data file has three records per employee. Record 1 contains the first and last names, record 2 contains the city and state of residence, and record 3 contains the employees phone number.

Farr, Sue Anaheim, CA 869-7008 Anderson, Kay B. Chicago, IL 483-3321 Tennenbaum, Mary Ann Jefferson, MO 589-9030

Desired Output
The SAS data set should have one observation per employee.
LName Farr Anderson Tennenbaum FName Sue Kay B. Mary Ann City Anaheim Chicago Jefferson State CA IL MO Phone 869-7008 483-3321 589-9030

Multiple INPUT Statements


data address; length LName FName $ 20 City $ 25 State $ 2 Phone $ 8; infile 'raw-data-file' dlm=','; Load Record input LName $ FName $; Load Record input City $ State $; Load Record input Phone $; run;

Line Pointer Controls


You can also use line pointer controls to control when SAS loads a new record.
DATA SAS-data-set; INPUT var-1 var-2 var-3 / var-4 var-5; additional SAS statements

SAS loads the next record when it encounters a forward slash.

Reading Multiple Records Per Observation


data address; length LName FName $ 20 City $ 25 State $ 2 Phone $ 8; infile 'raw-data-file' dlm=','; Load Record input LName $ FName $ / Load Record City $ State $ / Load Record Phone $; run;

Reading Multiple Records Per Observation


Partial Log

NOTE: 9 records were read from the infile 'addresses.dat'. The minimum record length was 8. The maximum record length was 20. NOTE: The data set WORK.ADDRESS has 3 observations and 5 variables.

Reading Raw Data Files with Multiple Records Per Observation


Mixed Record Types
101 USA 1-20-1999 3295.50 3034 EUR 30JAN1999 1876,30 101 USA 1-30-1999 2938.00 128 USA 2-5-1999 2908.74 1345 EUR 6FEB1999 3145,60 109 USA 3-17-1999 2789.10

Reading Raw Data Files with Multiple Records Per Observation


Desired Output
Sales ID Sale Date

Location

Amount

101 3034 101 128 1345 109

USA EUR USA USA EUR USA

14264 14274 14274 14280 14281 14320

3295.50 1876.30 2938.00 2908.74 3145.60 2789.10

Reading Raw Data Files with Multiple Records Per Observation


The Single Trailing @
The single trailing @ option holds a raw data record in the input buffer until SAS executes an INPUT statement with no trailing @ reaches the bottom of the DATA step.
General form of an INPUT statement with the single trailing @:

INPUT var1 var2 var3 @;

Reading Raw Data Files with Multiple Records Per Observation


Processing the Trailing @
Hold record for next INPUT statement.

Load next record.

input SalesID $ Location $ @; if location='USA' then input SaleDate : mmddyy10. Amount; else if Location='EUR' then input SaleDate : date9. Amount : commax8.;

Multiple Observations Per Record


The raw data file RETIRE contains each employees identification number and this years contribution to his or her retirement plan. Each record contains information for three employees.
E00973 1400 E09872 2003 E73150 2400 E45671 4500 E34805 1980 E47200 4371 EmpID Contrib 1400 2003 2400 4500 1980 4371

Desired Output

E00973 E09872 E73150 E45671 E34805 E47200

Multiple Observations Per Record


Processing: What Is Required?
E00973 1400 E09872 2003 E73150 2400 Read for Obs. 1 Read for Obs. 2 Read for Obs. 3

Process Other Statements

Process Other Statements

Process Other Statements

Output

Output

Output

Multiple Observations Per Record


The Double Trailing @
The double trailing @ holds the raw data record across iterations of the DATA step until the line pointer moves past the end of the line. INPUT var1 var2 var3 @@;

data work.retire; length EmpID $ 6; infile 'raw-data-file'; input EmpID $ Contrib @@; run;

Hold until end of record.

Multiple Observations Per Record


Partial Log
NOTE: 2 records were read from the infile 'retire.dat'. The minimum record length was 35. The maximum record length was 36. NOTE: SAS went to a new line when INPUT statement reached past the end of a line. NOTE: The data set WORK.RETIRE has 6 observations and 2 variables.
The SAS went to a new line message is expected because the @@ option indicates that SAS should read until the end of each record.

Multiple Observations Per Record


Trailing @ Versus Double Trailing @
Option Trailing @ INPUT var-1... @; Double trailing @ INPUT var-1 ... @@; Effect Holds raw data record until 1) an INPUT statement with no trailing @ 2) the bottom of the DATA step. Holds raw data records in input buffer until SAS reads past end of line.

Reading Hierarchical Raw Data Files


Processing Hierarchical Files
Many files are hierarchical in structure, consisting of a header record one or more related detail records. Typically, each record contains a field that identifies whether it is a header record or a detail record.

Header Detail Detail Header Header Detail Header Detail Detail

Reading Hierarchical Raw Data Files


Processing Hierarchical Files
You can read a hierarchical file into a SAS data set by creating one observation per detail record and storing the header information as part of each observation.

Hierarchical File

SAS Data Set

Header 1 Detail 1 Detail 2 Detail 3 Header 2 Detail 1 Header 3 Detail 1 Detail 2

Header Variables Header 1 Header 1 Header 1 Header 2 Header 3 Header 3

Detail Variables

Detail 1 Detail 2 Detail 3 Detail 1 Detail 1 Detail 2

Reading Hierarchical Raw Data Files


Creating One Observation Per Detail
E:Adams:Susan D:Michael:C D:Lindsay:C E:Porter:David D:Susan:S E:Lewis:Dorian D. D:Richard:C E:Dansky:Ian E:Nicholls:James D:Roberta:C E:Slaydon:Marla D:John:S The raw data file DEPENDANTS has a header record containing the name of the employee and a detail record for each dependant on the employees health insurance.

Reading Hierarchical Raw Data Files


Desired Output
Personnel would like a list of all the dependants and the name of the associated employee.

EmpLName
Adams Adams Porter Lewis Nicholls Slaydon

EmpFName
Susan Susan David Dorian D. James Marla

DepName
Michael Lindsay Susan Richard Roberta John

Relation
C C S C C S

Reading Hierarchical Raw Data Files


The RETAIN Statement
General form of the RETAIN statement:

RETAIN variable-name <initial-value>;

The RETAIN statement prevents SAS from reinitializing the values of new variables at the top of the DATA step. This means that values from previous records are available for processing.

Reading Hierarchical Raw Data Files


Hold EmpLName and EmpFName
data dependants(drop=Type); length EmpLName EmpFName DepName $ 20 Relation $ 1; retain EmpLName EmpFName; infile 'raw-data-file' dlm=':'; input Type $ @; if Type='E' then input EmpLName $ EmpFName $; else do; input DepName $ Relation $; output; end; run;

Example Modified List Input


This example explains
Compilation Phase Execution Phase.

Example Modified List Input


Raw Data File
50001 50002 50003 50004 50005 50006

Compile

4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;

Input Buffer

ID $ 5

PDV

Raw Data File


50001 50002 50003 50004 50005 50006

Compile

4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;

Input Buffer

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

...

Raw Data File


50001 50002 50003 50004 50005 50006

Execute

4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;

Input Buffer

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

.
...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;

Input Buffer
5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

.
...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;

Input Buffer
5 0 0 0 1 4 f eb 1 9 8 9 1 3 2 5 3 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

50001

. 10627

. 132

530 .
...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Implicit return Implicit output

Input Buffer
5 0 0 0 1

4 f eb 1 9 8 9

1 3 2

5 3 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

50001

530 . Write out observation to airplanes. . 132

. 10627

...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Implicit output

Input Buffer
5 0 0 0 1 4 f eb 1 9 8 9 1 3 2

5 3 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

50001

530 . Write out observation to airplanes. . 132

. 10627

...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Implicit return

Input Buffer
5 0 0 0 1

4 f eb 1 9 8 9

1 3 2

5 3 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

50001

. 10627

. 132

530 .
...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;

Input Buffer
5 0 0 0 1 4 f eb 1 9 8 9 1 3 2 5 3 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

.
...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;

Input Buffer
5 0 0 0 2 1 1 n o v 1 98 9 1 5 2 5 4 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

50002

. 10907

. 152

540 .
...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Implicit return Implicit output

Input Buffer
5 0 0 0 2

1 1 n o v 1 98 9

1 5 2

5 4 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

50002

540 . Write out observation to airplanes. . 152

. 10907

...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Implicit output

Input Buffer
5 0 0 0 2 1 1 n o v 1 98 9 1 5 2

5 4 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

50002

. 10907

. 152

540 .
...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;
Implicit return

Input Buffer
5 0 0 0 2

1 1 n o v 1 98 9

1 5 2

5 4 0

ID $ 5

PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

50002

. 10907

. 152

540 .
...

Raw Data File


50001 50002 50003 50004 50005 50006 4feb1989 132 530 11nov1989 152 540 22oct1991 90 530 4feb1993 172 550 24jun1993 170 510 20dec1994 180 520

data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService : date9. PassCap CargoCap; run;

Input Buffer
5 0 0 0 2

Continue processing until 1 1 n o v 1 98 9 1 5 2 5 4 0 end of the raw data file. PDV INSERVICE PASSCAP CARGOCAP N N N 8 8 8

ID $ 5

Output of Dataset
proc print data=airplanes noobs; run; In Service 10627 10907 11617 12088 12228 12772 Pass Cap 132 152 168 172 170 180 Cargo Cap 530 540 530 550 510 520

ID 50001 50002 50003 50004 50005 50006

Combining SAS Data Sets


Create a Data set from two or more existing data sets by joining observation side-by-side or appends the observations from one data set to another data set. Methods to combine SAS data sets
concatenating interleaving one-to-one reading one-to-one merging match merging updating.

Combine SAS data sets

Concatenating
Concatenating the data sets appends the observations from one data set to another data set. The DATA step reads DATA1 sequentially until all observations have been processed, and then reads DATA2. Data set COMBINED contains the results of the concatenation.

Combine SAS data sets


Interleaving
intersperses observations from two or more data sets, based on one or more common variables.

Combine SAS data sets


One-to-One Reading and One-toOne Merging.
One-to-one reading combines observations from two or more SAS data sets by creating observations that contain all of the variables from each contributing data set. Observations are combined based on their relative position in each data set. The DATA step stops after it has read the last observation from the smallest data set. One-to-one merging is similar to a one-to-one reading, with two exceptions: you use the MERGE statement instead of multiple SET statements, and the DATA step reads all observations from all data sets.

Combine SAS data sets


Match merging
combines observations from two or more SAS data sets into a single observation in a new data set based on the values of one or more common variables.

Combine SAS data sets


Identifying Data Set Contributors
When you read multiple SAS data sets in one DATA step, you can use the IN= data set option to detect which data set contributed to an observation. General form of the IN= data set option:

SAS-data-set(IN=variable)
where variable is any valid SAS variable name. Variable is a temporary numeric variable with a value of:
0 to indicate false; the data set did not contribute to the current

observation 1 to indicate true; the data set did contribute to the current observation

Combine SAS data sets


IN= Data Set Option Transact
Num 111 111 113 114 116

Trans D C C D C

Amnt 126.32 560 235 14.56 371.69

Num 111 112 114 115 116

Branch

Branch M.G.Road Sivaji Nagar Madiwala Koramangala BTM

A data set named Newtrans shows this weeks transactions. A data set named noactiv shows accounts with no transactions this week. A data set named noacct shows transactions with no matching account number.

Combine SAS data sets


Num 111 111 113 114 116 Trans D C C D C Amnt 126.32 560 235 14.56 371.69 Num 111 112 114 115 116 Branch M.G.Road Sivaji Nagar Madiwala Koramangala BTM

Branch data Transact newtrans noactiv (drop=Trans Amnt) noacct (drop=Branch); merge prog2.transact(in=InTrans) prog2.branch(in=InBanks); by ActNum; if InTrans and InBanks then output newtrans; else if InBanks and not InTrans then output noactiv; else if InTrans and not InBanks then output noacct; run;

Combine SAS data sets


A data set named Newtrans shows this weeks Num Trans Amnt Branch transactions.
111 111 114 116 D C D C 126.32 560 14.56 371.69 M.G.Road M.G.Road Madiwala BTM

A data set named noactiv shows accounts with no Num Branch transactions this week.
112 115 Sivaji Nagar Koramangala

A data set named noacct shows transactions with no matching account number.
Num 113 Trans Amnt C 235

Questions

You might also like