SAS+Programming Resource+Guide
SAS+Programming Resource+Guide
PROGRAMMING
COURSE CREATED AND INSTRUCTED BY:
ASLAM KHAN
HTTPS://WWW.LINKEDIN.COM
/IN/ASLAMKHAN-PGMP-PMP/
WWW.MADE2STICKLEARNING.COM
Legal Disclaimer:
The online course and this resource guide provided is not an official content from SAS Institute, nor is it affiliated with SAS Institute in
any way. The course content is intended solely for educational purposes and is not to be reproduced or resold for commercial purposes.
The information contained in the course is provided "as is" without warranty of any kind, either express or implied, including but not
limited to the implied warranties of merchantability and fitness for a particular purpose.
The course instructor shall not be liable for any direct, indirect, incidental, special, or consequential damages arising out of or relating to
the use of or inability to use the course content or materials.
By accessing and using this resource guide, you acknowledge and agree to these terms and conditions.
COURSE SECTIONS
1.DATA PREPARATION will teach you how to import data from multiple sources, create new variables, write SAS functions, and understand what goes on behind
the scenes in SAS datasets
2.DATA STRUCTURING will make you leap into transforming data to a new level by merging and joining multiple datasets together, or turning them upside-down
(sorting), and side-ways (transposing)
3.DATA VISUALIZATION will propel you further into the world of analytics and obtaining insightful inferences from what is inside your data
4.OPTIMIZING CODE will take into the world of macro programming that teaches you how to write your code professionally and elegantly
In this lecture, the various windows in SAS are introduced, including the Editor window where programming code is written, the Log window where SAS
produces a log of code execution and error/warning/note messages, and the Explorer window for navigating data and libraries.
The Submit button, located in the Editor window, is used to submit code for execution. It is also mentioned that error messages in the Log window appear in
red, warning messages appear in green, and notes appear in blue.
The Explorer window includes pre-installed libraries such as SASUSER and SASHELP, as well as the temporary Work library.
Data sets can be opened by double-clicking them in the Explorer window.
It is also mentioned that the Output window displays results from code execution.
SAS Dataset, Variables and observations Keep, Drop and Rename variables
In SAS programming, data is organized into tables called datasets that have rows and columns.
The columns are called variables and the rows are called observations.
There are two types of variables in SAS: character variables, which contain text-based data, and numeric variables, which contain numbers.
Character variables are left-aligned and numeric variables are right-aligned.
Missing values
SAS handles missing values in character variables by representing them with a null value, while missing values in numeric variables are
represented with a period or dot.
Must begin with a letter (A-Z, either uppercase or lowercase) or an underscore (_)
The naming conventions for SAS libraries include having a name that is one to eight characters long, starting with a letter or
underscore, and containing any combination of letters, numbers, or underscores after the first character.
It is important to follow these conventions when creating your own SAS libraries.
Must begin with a letter (A-Z, either uppercase or lowercase) or an underscore (_)
There are three rules for naming conventions in SAS for both libraries and data sets.
For libraries, the name must be between 1 and 8 characters in length, must begin with a letter or underscore, and can continue with any
combination of numbers, letters, or underscores.
For data sets, the name must be between 1 and 32 characters in length, must begin with a letter or underscore, and can continue with any
combination of numbers, letters, or underscores.
It is not allowed to use special characters like percentages, ampersands, plus signs, minus signs, stars, parentheses, dollars, or exclamation
points in the names of libraries or data sets.
If a name does not follow these rules, it will not be considered a valid SAS library or data set name.
SAS code is free text format, case insensitive, and can begin and end anywhere
SAS is a programming language that uses simple English words to write code.
It has a specific syntax and structure, including using specific keywords and ending statements with semicolons.
SAS is case insensitive and can be written in multiple lines.
It also allows for spaces and line breaks in the code. Keywords are often colored in dark blue to make the code more readable.
SAS statements can begin and end anywhere in the programming editor.
In SAS programming, the source code is divided into two main modules: the data step and the proc step.
The data step is a block of code that pertains to modifying a data set, such as its variables and observations, or transferring data
from one place to another.
The proc step is a block of code that performs an action with the data, such as printing, reporting, or sorting.
A data step begins with the keyword "data" and ends with the keyword "run", while a proc step begins with the keyword "proc" and
ends with the keyword "run".
Single level naming -> Library is Work by default if not explicitly mentioned
Just like finding books in a library through a catalog, data can be called upon from somewhere inside a library in two ways.
In SAS, there are two ways to reference data sets in libraries: two-level naming and single-level naming.
In two-level naming, the data set name is preceded by the library name and a period, for example: "library.data set."
Single-level naming does not include the library name, but it can only be used for data sets in the temporary library (called "work").
To reference a data set in a permanent library, two-level naming must be used.
In summary, there are three methods of bringing data into SAS. The first method is using existing data within SAS, such as data in the permanent
libraries. The second method is creating data within the SAS programming window. The third method is importing data from external sources, such
as Excel spreadsheets or other databases.
The different methods for bringing data into SAS are discussed here.
The first method is using existing data in SAS, which is stored in permanent libraries. This can be done through a data step in SAS, which
involves specifying the data set name and the library it is coming from.
The second method is creating data within the SAS programming window and storing it in either permanent or temporary libraries. This is also
done through a data step, using the input statement to specify the variable names and using the datalines statement to input the data.
The third method is importing data from external sources, such as Excel sheets or databases. This can be done using the proc import statement,
which allows you to specify the file path and file type of the data. There are other methods to import data which can be done using the infile and
input statements.
In SAS, a libname is a reference to a SAS library, which is a collection of SAS data sets and/or other files.
The libname statement assigns a SAS library reference to a physical location on a storage device, such as a disk drive or a
directory on a server.
Once you have defined a libname for a SAS library, you can use the libname as a prefix to refer to the data sets and files stored in
that library.
In this example, the libname statement assigns the SAS library reference mylib to the directory in between the quotes.
The data statement then creates a new SAS data set called class in the work library, and the set statement reads in data from the
class data set in the mylib library.
The filename statement is used to assign a fileref (short for file reference) to a physical file on a storage device.
A fileref is a symbolic name that represents the location of a file, and it can be used as a shorthand way of referring to the file in SAS
programming statements.
In this example, the filename statement assigns the fileref myclass to the file ‘…./class.dat’.
The infile statement in the data step then reads data from the file using the myclass fileref.
To bring data into SAS using the delimiter option, you will need to use the infile statement in a data step to read the data.
The infile statement has a delimiter option that allows you to specify a character or symbol that is used to separate columns in the data.
In this example, the infile statement in the data step reads data from the cards statement using the delimiter option to specify that the
data is separated by commas.
The input statement reads three variables, name gender age and weight, from the data, and the cards statement provides the data
values.
The infile statement in the data step reads data from the cards statement using the delimiter option to specify that the data is
separated by commas and the dsd option to tell SAS to treat consecutive delimiters as a single delimiter.
The input statement reads 4 variables, from the data, and the cards statement provides the data values.
To bring data into SAS from an Excel file using the proc import procedure, you can use the following syntax:
proc import datafile=”…\class.xlsx" out=class dbms=excel getnames=yes; run;
In this example, the "datafile" option is used to specify the path to the Excel file that is being imported.
The "out" option is used to specify the name of the SAS dataset where the imported data will be stored.
The "dbms" option is used to specify the type of data being imported, in this case Excel data. The "getnames" option is used to
specify that the first row of the Excel file should be treated as variable names.
After running this code, a SAS dataset called ”class" will be created, containing the data from the specified Excel file. The variable
names for the dataset will be taken from the first row of the Excel file.
The proc import procedure in SAS can be used to import data from text files that are formatted in different ways. For example, data can be
separated by commas or by spaces. The proc import syntax remains the same, but the "delimiter" option can be used to specify the character that
separates the data points in the text file.
To bring data into SAS from an Excel file using the proc import procedure, you can use the following syntax:
proc import datafile=”…\class.txt" out=class dbms=excel getnames=yes; run;
In this example, the "datafile" option is used to specify the path to the TXT file that is being imported. The "out" option is used to specify the name of
the SAS dataset where the imported data will be stored. The "dbms" option is used to specify the type of data being imported, in this case Excel
data. The "getnames" option is used to specify that the first row of the Excel file should be treated as variable names.
After running this code, a SAS dataset called ”class" will be created, containing the data from the specified Excel file. The variable names for the
dataset will be taken from the first row of the TXT file.
To create a new variable in SAS, you can use the DATA step.
First, you will need to define the dataset that you want to create the new variable in.
Then, you can create the new variable by specifying its name followed by an equal sign and the definition for the new variable.
The definition can be based on existing variables in the dataset or can be a completely new and independent value.
You can also use various functions and operators to modify the values of the new variable.
Finally, you can use the RUN statement to execute the DATA step and create the new variable in the dataset.
The KEEP statement in SAS is used to specify the variables that you want to include in your final dataset.
It is used to exclude variables that are not needed in the final dataset.
The KEEP statement is written before the RUN statement and is followed by a list of variables that you want to include in the final
dataset.
For example, if you have a dataset called class that contains variables name, sex, age, weight and weight kg, and you only want to
keep variables name, sex, age, weightkg in your final dataset, you would write the following code:KEEP name sex age weightkg ;
This will create a new dataset with only variables name sex age weightkg , and will exclude variable weight.
The rename statement in SAS is used to change the name of a variable in a dataset.
In this case, the rename statement is being used to change the name of the variable "sex" to "gender".
The syntax for using the rename statement is as follows:
To use if-else conditional logic in SAS, you can use the following syntax:
if (condition) then action1; else if (condition) then action2; else action3;
For example, to calculate a status variable based on BMI values where the status results in "Healthy weight" if the BMI is less than 18,
"Overweight" when the BMI is between 18 and 21, and "Obese" when the BMI is greater than 21, you can use the following code:
if (bmi < 18) then status = "Healthy weight"; else if (bmi >= 18 and bmi <= 21) then status = "Overweight"; else status = "Obese";
This code will evaluate the condition in the "if" statement first. If the condition is true, the action following the "then" statement will be
executed (in this case, setting the value of the "status" variable to "Healthy weight"). If the condition in the "if" statement is false, the
code will move on to the "else if" statement and evaluate the second condition. If the second condition is true, the action following the
"then" statement will be executed (in this case, setting the value of the "status" variable to "Overweight"). If both the "if" and "else if"
conditions are false, the code will execute the action following the "else" statement (in this case, setting the value of the "status" variable
to "Obese").
Filtering in SAS refers to the process of selecting a subset of observations from a dataset based on a specified criteria.
It can be applied to both a data step and a proc step, but the way it works is different in each case. In a data step, filtering reduces
the number of observations in the resulting dataset. In a proc step, it reduces the number of observations in the report output but
does not affect the underlying dataset.
Filtering is done using the WHERE statement, which specifies the criteria for selecting the observations.
The syntax for filtering in a data step or proc step is the same and involves using the WHERE statement followed by the criteria for
selection.
For example, the WHERE statement "WHERE sex='F'" would select only observations with a value of 'F' in the sex variable.
A date format is a way to specify how a date value should be displayed or written as a character string. SAS supports a wide range of
date formats, which can be used to display dates in different ways, such as with the month written as a name or as a number, or with the
year written with four digits or two digits.
Character Functions
Upcase
Lowcase
Propcase
Character functions are functions that perform some kind of modification on a character variable and usually result in a new character
variable with modified values. There are various character functions available in SAS, including functions for manipulating strings,
extracting substrings, and converting between character and numeric data types.
Character Functions
Length
LENGTH(string)
Where "string" is the character value for which you want to find the length.
The length function in SAS is used to find the number of characters in a particular character variable.
It takes a character variable as an input and returns an integer value representing the length of that variable.
For example, if a character variable called "name" has the value "John", the length function applied to "name" would return 4.
If the character variable has spaces or other special characters, they are also included in the length calculation.
The length function can be useful in situations where you need to know the size or length of a character variable for comparison or
other purposes.
Character Functions
Cat
The CAT function in SAS is a character function that concatenates or combines the values of two or more character variables and returns a
new character value. The syntax for the CAT function is:
new_variable = CAT(char1, char2, ...);
where "new_variable" is the name of the new character variable that will be created, and "char1", "char2", etc. are the names of the character
variables that you want to combine.
The values of these character variables will be combined in the order that they are listed, with no spaces or separators between them. For
example, if char1 has a value of "John" and char2 has a value of "Doe", the new variable created by the CAT function will have a value of
"JohnDoe".
Character Functions
SUBSTR
The SUBSTR function in SAS extracts a substring from a character string. The function has the following syntax:
SUBSTR(string, start, length)
"string" is the character string from which the substring will be extracted.
"start" is the position of the first character in the substring. The position is specified as an integer and starts at 1 for the first character in the string.
"length" is the number of characters to include in the substring. If "length" is not specified, the function extracts all characters from the start
position to the end of the string.
For example, if "string" is "abcdef", the following calls to the SUBSTR function would produce the following results:
SUBSTR("abcdef", 1, 2) returns "ab"
SUBSTR("abcdef", 3, 3) returns "cde"
SUBSTR("abcdef", 4) returns "def"
Character Functions
TRIM
TRIM(string)
where "string" is the character value or variable that you want to trim.
The TRIM function in SAS is a character function that is used to remove leading or trailing spaces from a character value. It takes a
character value as an input and returns a modified character value with the leading or trailing spaces removed. The function has the
following syntax:
TRIM(character-value)
For example, if you have a character value ' Hello ' with leading and trailing spaces, you can use the TRIM function to remove the
spaces like this:
result = TRIM(' Hello ');
The resulting value of 'result' would be 'Hello', with all leading and trailing spaces removed. The TRIM function is often used to clean up
data that has been imported from external sources, where values may have extra spaces due to formatting or other issues.
Character Functions
LEFT
LEFT(string, n)
where:
'string' is the character string that you want to extract the leftmost characters from.
'n' is the number of characters to extract from the left side of the string.
The LEFT function in SAS is a character function that extracts a specified number of characters from the left side of a character string.
For example, if you have a character string 'Hello World' and you apply the LEFT function to extract the first 3 characters, the resulting
value would be 'Hel'. The syntax for the LEFT function is as follows:
LEFT(string, n)
Where 'string' is the character string from which characters are to be extracted, and 'n' is the number of characters to extract.
The function returns the extracted characters as a new character string.
Character Functions
STRIP
new_variable = STRIP(original_variable);
Here, new_variable is the name of the new variable that will contain the modified values from original_variable, and original_variable is the name of
the original character variable that you want to modify.
The STRIP function in SAS is used to remove leading or trailing blanks or any specified characters from a character string. It takes a character string
as an input and returns a modified character string with the leading or trailing blanks or specified characters removed. For example, if you have a
character string ' Hello World ' and you apply the STRIP function to it, it will return 'Hello World'. If you specify a character to remove, such as '*', the
function will also remove any occurrences of that character at the beginning or end of the string. The syntax for the STRIP function is:
STRIP(string <, characters>).
Where:
string: is the character string to modify
characters: (optional) is a character or a list of characters to remove from the string. If not specified, blanks are removed.
Character Functions
COMPRESS
COMPRESS(source, set)
Where "source" is the character string that you want to modify and "set" is a character string that specifies the characters that you want to
remove from the source string. Default of “set” is empty/ leading/ trailing spaces.
Character Functions
COMPBL
compbl(string)
The argument "string" is the character value that you want to compress and remove the blanks from. The function returns a character
value with all the blanks removed from the original string.
The COMPBL function in SAS is a character function that removes consecutive spaces and reduces them to 1 space.
Character Functions
SCAN
The SCAN function in SAS is a character function that returns a specified number of substrings from a character string, based on a
specified delimiter.
For example, if the input string is "United States of America" and the position argument is 2, the SCAN function will return the second
substring from the input string, which is "States".
The delimiter in this case is a space character, and the position argument specifies which substring to return from the input string.
Here is an example of how you might use the SCAN function in SAS to extract the second substring from the input string "United States
of America”.
Character Functions
INDEXC
The INDEXC function in SAS is a character function that returns the position of a specific character or string within another character string. For
example, if you want to find the position of the letter "a" in the string "United States of America", you can use the INDEXC function like this:
INDEXC("United States of America", "a")
This would return the position of the first occurrence of the letter "a", which is 8. If you want to find the position of a different character or string, you
can specify it as the second argument in the INDEXC function. For example, to find the position of the string "States" in the same string, you can
use:
INDEXC("United States of America", "States")
This would return the position of the first occurrence of the string "States", which is 7. If the character or string you are searching for is not found in
the string, the INDEXC function returns a value of 0.
Character Functions
INDEXW
The INDEXW function in SAS is a character function that returns the position of the first occurrence of a word within a character string. It
takes two arguments: the character string to search and the word to search for. The position is returned as a number, with the first word
being at position 1. If the word is not found, the function returns a 0. For example:
data test; string = 'The quick brown fox jumps over the lazy dog'; pos = indexw(string, 'fox'); run;
In this example, the value of the variable 'pos' will be 6, because the word 'fox' is the 6th word in the string.
The INDEXW function is useful when you want to find the position of a specific word within a character string, rather than just the position
of a specific character. It is particularly useful when working with text data, as it allows you to search for specific words rather than
individual characters.
Numeric Functions
Sum Int
Abs Min
Ceil Max
Floor
Character Functions
Sum
The SUM function calculates the sum of all the non-missing values in a numeric variable. Here is the basic syntax for using the SUM
function:
SUM(expression)
The "expression" argument can be a single numeric variable, or it can be a combination of numeric variables and constants. The SUM
function will return a numeric value that is the sum of all the non-missing values in the expression.
In this example, the dataset contains two numeric variables: "salary" and "bonus".
The SUM function is used to add these two variables together to create a new variable called ”netsal", which represents the total
salary for each employee.
The SUM function ignores missing values when calculating the sum of a numeric expression.
This means that if an expression contains any missing values (denoted by a period), the SUM function will not include those values
in the final sum.
Null dataset
A null dataset in SAS is a special type of dataset that is not physically created or stored in the work library, but can still be used to create
variables and perform calculations. The syntax for creating a null dataset is similar to that of a regular dataset, but the dataset name is
“_NULL_”.
A null dataset in SAS is a special type of dataset that is not physically created or stored in the work library, but can still be used to
create variables and perform calculations.
Unlike regular datasets, a null dataset does not have a physical presence and its contents are not accessible through the Explorer
window.
Instead, the contents of a null dataset can only be viewed using the PUT statement, which displays the values of variables in the log
window.
Null datasets are often used as a trick to perform calculations or demonstrate functions without creating and storing unnecessary
datasets.
Character Functions
ABS
The ABS function in SAS is a function that returns the absolute value of a numeric variable.
ABS(expression)
The "expression" argument can be a single numeric variable, or it can be a combination of numeric variables and constants. The ABS
function will return a numeric value that is the absolute value of the expression.
A null dataset in SAS is a special type of dataset that is not physically created or stored in the work library, but can still be used to
create variables and perform calculations.
Unlike regular datasets, a null dataset does not have a physical presence and its contents are not accessible through the Explorer
window.
Instead, the contents of a null dataset can only be viewed using the PUT statement, which displays the values of variables in the log
window.
Null datasets are often used as a trick to perform calculations or demonstrate functions without creating and storing unnecessary
datasets.
Character Functions
CEIL
FLOOR
INT
The CEIL, FLOOR, and INT functions in SAS are functions that return the smallest integer greater than or equal to a numeric
expression, the largest integer less than or equal to a numeric expression, and the integer portion of a numeric expression, respectively.
The CEIL function always returns the smallest integer that is greater than or equal to a numeric expression, while the FLOOR function
always returns the largest integer that is less than or equal to a numeric expression. For example, if the input is 5.7, the CEIL function
would return 6 and the FLOOR function would return 5.
The INT function simply returns the integer portion of a numeric expression, ignoring any decimal places. For example, if the input is
5.7, the INT function would return 5.
To use these functions in SAS, you simply specify the numeric expression within parentheses after the function keyword. For example:
ceil_val = ceil(x); floor_val = floor(x); int_val = int(x);
This would assign the smallest integer greater than or equal to the value of the variable "x" to a new variable called "ceil_val", the largest
integer less than or equal to the value of the variable "x" to a new variable called "floor_val", and the integer portion of the value of the
variable "x" to a new variable called "int_val".
Character Functions
MIN
MAX
The SUM, MIN, and MAX functions in SAS are functions that return the sum, minimum, and maximum value, respectively, of a set of
numeric variables. These functions ignore missing values when calculating the result, which means that if any of the variables in the set
contain missing values, they will be excluded from the calculation.
To use these functions in SAS, you simply specify the names of the variables within parentheses after the function keyword. For
example:
sum_val = sum(x1, x2, x3); min_val = min(x1, x2, x3); max_val = max(x1, x2, x3);
This would assign the sum of the variables "x1", "x2", and "x3" to a new variable called "sum_val", the minimum value of the variables
"x1", "x2", and "x3" to a new variable called "min_val", and the maximum value of the variables "x1", "x2", and "x3" to a new variable
called "max_val".
You can use these functions in a data step to create new variables that contain the result of these operations applied to existing numeric
variables. You can also use these functions in a proc step or in the SELECT clause of a proc SQL statement to perform calculations
with the sum, minimum, or maximum value of a set of numeric variables.
SAS formats are rules or patterns that specify how to display data values in a certain way.
They can be used to change the appearance of data values without changing the underlying data itself.
For example, a SAS format might be used to display a numeric value as a dollar amount, with a dollar sign and a comma for
thousand separators.
SAS formats can be applied to both character and numeric variables.
They are useful for making data values more meaningful and easier to read, especially when working with large datasets.
There are many built-in SAS formats available, such as dollar formats, date and time formats, and formats for identifying values as
missing or non-missing.
You can also create custom SAS formats using format definitions.
Formats are applied using the format statement or the put function in a SAS program.
Two methods of formatting data in SAS: using the format statement and using the put function.
The format statement allows you to apply a SAS format to a numeric variable and display it in a certain way, while retaining the original raw
values in the final dataset.
The put function allows you to convert a numeric variable into a character variable, and apply a SAS format to it in the final dataset.
Both methods result in the original raw values being displayed in a formatted way with additional characters, but the difference is in the type of
variable that the final values are stored in.
The format statement associates a particular format with a specific numeric variable, while the put function converts the numeric variable into a
character variable and applies the format.
The PUT function can be used to convert a numeric variable to a character variable and apply a format to it. The PUT function takes as input
parameters the name of the variable to be formatted and the name of the format to be applied, and returns a character value that is the formatted
version of the original numeric value.
In this code, the PUT function is used to apply the dollar10.2 format to the numeric variable salary, and create a new character variable called
salarytxt that contains the formatted value. The dollar10.2 format displays the value with a dollar sign, a comma for thousand separators, and two
decimal places.
SAS informats are used to convert character data into numeric or date values.
They are used in the opposite direction to formats, which take a numeric or date value and convert it into a character value with added formatting
such as currency symbols or date separators.
Informats are used in the INPUT statement to read data from external sources into SAS variables. They allow SAS to recognize and interpret
character data in a specific way and convert it into a numeric or date value that can be used in SAS.
The naming convention for SAS informats is similar to that of formats, where the informat name is followed by the X and Y parameters.
The X parameter indicates the maximum length of the incoming character value, while the Y parameter indicates the number of decimal places
in the value, if applicable.
Some common SAS informats include MMDDYY., which recognizes and converts dates in the MMDDYY format, and DOLLAR., which
recognizes and converts character data in the form of currency.
Stacking data refers to combining two or more datasets together that contain the same set of variables.
This can be done using a data step in SAS.
In this example, there is a dataset A with the variables name, gender, age and weight, and a new dataset B with two additional observations for
the same variables.
The goal is to append B to A to create a combined dataset C with all observations from A and B.
To accomplish this in SAS, the usual data step statement is used with the target dataset C and the set statement is extended to include A and B.
It is important to note that the variable types in the datasets being stacked together should match.
The example includes code that generates datasets A and B and then stacks them together to create dataset C.
Stacking data refers to pulling data from multiple datasets to create one unified dataset with all observations from individual datasets
SAS has advanced procedures for stacking data in a more efficient way
The simplest method of stacking data is through a data step, where you provide the individual dataset names and the name of the
final dataset containing all observations
Sorting can be done on one single level (variable) or multiple levels (variables)
The order of sorting can be specified in the "by" statement
The first level of sorting takes place first, and then subsequent levels of sorting follow
SAS retains the original order of observations within the groups specified in the first level of sorting
Proc sort can be used for multilevel sorting and filtering
Filtering can be accomplished by using the ”where" statement.
The value of the filtering criteria must match the exact value in the data set
When some variables have different values, "no dupe" option cannot be used.
SAS provides "no dupe key" option to handle such cases, where you can specify variables to check for duplicates.
"no dupe key" removes duplicate observations, keeping the first one and removing the rest.
To implement "no dupe key," use the "proc sort" statement followed by the input dataset and output dataset and include the "no key" option
along with variables to check for duplicates.
Stacking of data refers to appending or putting data together, one below the other, with increased number of observations and same
number of variables in the final data set.
Merging of data refers to bringing together two or more data sets that contain a new set of data and new variables, with a common
variable as the reference point for the merging.
Merging of data is important when data from multiple sources need to be available in a unified data repository.
The merging of data sets can be done using the "merge" statement in a data step in science, followed by the names of the data sets
to be merged and the "by" statement for the reference variable.
To properly run the code for merging, the data sets need to be sorted by the reference variable and matching "by" statements
should be used in both the merging and sorting statements.
Inner join is a way to merge two data sets (A and B) to reveal the common observations between them.
The result of an inner join between data sets A and B will contain the variables from both datasets (name, gender, age, weight, and
height).
Only the observations that belong to both data sets will be part of the inner join result.
Inner join can be performed with multiple data sets as well.
The syntax for performing an inner join includes a data step and a merge statement.
The data step starts with the keyword "data" followed by the target data set.
The merge statement lists the names of the data sets to be merged and specifies the common variable (name in this case) used to
merge the datasets.
The type of join performed is determined by the criteria specified in the if statement that follows the merge statement.
The if statement starts with the keyword "if" and determines the observations that will be included in the inner join result.
Full join is a type of join that combines all observations from two datasets, A and B.
The full join includes the intersection of datasets A and B and all observations from each dataset.
The full join in SAS can be performed using the data step with a statement using the keyword "or."
The resulting full join includes all the variables from both datasets, but may have missing data for observations that did not have
data in the first place.
Left join in SAS merges datasets by including observations from the first (left) data set, regardless of whether they appear in the
other data set or not.
The determining factor for the observations in the final left join is the data set on the left, in this case data set A.
The observations from the right data set (data set B) are merged as another column in the final data set, but those observations
missing in data set B will result in missing values in the final left join.
The left join is performed in SAS using a data step, followed by the name of the data set and the statement for data sets A and B,
followed by the by statement and the "if a" statement to specify the type of join.
Left joins are commonly used in real life settings as a reference data set to which multiple data sets can be combined into a unified
data set for analysis or reporting.
The left join is accomplished by writing "if a" statement in the data merge step.
Different types of joins: inner, full, left, right, far left, and far right join
Far left join is a subset of left join that only includes observations that belong to data set A and not B
Far right join is an extension of right join that only includes observations that belong to data set B and not A
Inner join: common region between two data sets
Full join: list of all observations from all datasets involved in the join
Outer join: regions that belong to either left or right side of the merging, including far left and far right joins
Final data set in far left join includes only observations from data set A (Pat and Mike)
Final data set in far right join includes only observation from data set B (Tom)
Missing values in final data sets are due to lack of corresponding observations in the other data set.
Merging of data using SAS data step involves choosing the appropriate join to get the desired result
The end result of merging data from multiple datasets is a single combined dataset for further reporting and analytics.
Proc SQL is an advanced topic in science programming, an extension of the concepts discussed in SAS.
It has multiple purposes in science programming, including dataset modifications, creation of new variables, restructuring of data, and
merging datasets.
Proc SQL is a structured query language (SQL) used in database programming.
SAS programmers can use Proc SQL to modify data sets, as it has tried to incorporate the power and elegance of SQL into SAS.
Proc SQL begins with the statement "proc sql;" and ends with a "quit statement;" instead of a "run statement".
Proc SQL can accomplish what other programming steps like the data step can do.
To copy data from one dataset to another in Proc SQL, one writes "create table B as select * from A;"
The number of statements involved in Proc SQL and the data step is pretty much the same.
In Proc SQL, to create a dataset, one says "create table (name of table)" followed by the necessary manipulation techniques.
Proc SQL can be used for sorting data in a similar way to Proc Sort.
In Proc SQL, you start with the "create table" statement to specify the output dataset.
The "as" keyword is used to specify the input dataset.
The "order by" statement is used to specify the variables to sort by.
Commas are used to separate multiple variables when sorting in Proc SQL.
The example uses the "Sacerdote" class dataset and sorts by the "sex" variable.
Then the "age" and "height" and "weight" variables are added to the sorting order.
The final output shows that the dataset is sorted by all the specified variables in the "order by" statement.
Applications of transposing of data include converting data collected in a horizontal fashion into a single variable in the final dataset.
The PROC TRANSPOSE in SAS is used to convert the columns of a SAS data set into rows. In the example you have provided, the
input data set is named vitals, and the output data set is named t_vitals.
The ID statement specifies the variable(s) that will become the new variables in the transposed data set. In this case, the ID variable
is test. This means that the values of the test variable will become the new variables in the transposed data set.
The BY statement specifies the variables that will be used to group the data. In this case, the BY statement variables are Sid and
name. This means that the values of the vitals data set will be grouped by the Sid and name variables before they are transposed.
The VAR statement specifies the variable(s) that will be transposed into the rows of the output data set. In this case, the VAR
variable is value. This means that the values of the value variable will be transposed into the rows of the t_vitals data set.
The PROC TRANSPOSE statement in this example will convert the columns of the vitals data set into rows in the t_vitals data set,
with the values of the test variable becoming the new variables in the t_vitals data set, and the values of the value variable being
transposed into the rows of the t_vitals data set. The Sid and name variables will be used to group the data.
The retain statement in SAS is used to retain or keep the value of a variable from one iteration to the next.
It allows SAS to remember the value of a specific variable in the previous observation and use it as the initial value in the current
observation. This is useful when creating cumulative variables or when the value of a variable needs to be carried forward to
subsequent observations.
The retain statement must be placed at the beginning of the SAS program and is applicable only within the context of a data step. It
resets when a new observation is encountered or when the value of an identifying variable changes.
The retain statement in SAS allows you to prefill a specific value of a variable in the subsequent observation of the dataset, even
before the entire observation is complete or created.
This process of retaining a particular value into the next observation is called a retain process, done through the retain statement.
Charts
Plots
Proc Print
Statistical procs
Proc Report
Proc Chart with VBAR is a SAS procedure used to generate a vertical bar chart from a structured dataset.
It is a simple but powerful way to visualize the data quickly and efficiently, especially for large datasets.
The statement includes the keyword "vbar" denoting vertical bar chart and the variable of interest specified.
The output appears in the results window, with the Y-axis showing frequency count and the X-axis showing categories.
This procedure is useful for gaining insight into the data and identifying trends and patterns.
The Proc Chart with HBAR is a SAS procedure used to generate a horizontal bar chart from a structured dataset.
It is similar to the vertical bar chart but displays the data horizontally, which may be useful in some cases.
The statement includes the keyword "hbar" denoting horizontal bar chart and the variable of interest specified.
The output appears in the results window, with the X-axis showing frequency count and the Y-axis showing categories.
The Proc Chart with PIE option is a SAS procedure used to generate a pie chart from a structured dataset.
It is a circular chart divided into slices that represent different categories or proportions of a whole.
The statement includes the keyword "pie" denoting pie chart and the variable of interest specified.
The output appears in the results window, with the slices representing the categories and their proportions.
This procedure is useful for comparing the proportions of different categories or variables and identifying patterns or relationships.
Similar to Proc Chart using vbar, the "hbar" statement creates a horizontal bar chart.
The "discrete" option used with the "hbar" statement is used when the variable of interest is categorical or nominal
This SAS code is generating a vertical bar chart using the PROC CHART procedure with several options:
The input dataset is named "class" and is specified after the PROC CHART statement.
The VBAR statement is used to create a vertical bar chart.
The variable of interest is "course" and is specified after the VBAR statement. The DISCRETE option is used to indicate that the
variable is categorical.
The GROUP option is used to create groups within the chart. The grouping variable is "major", which means that the chart will
display separate bars for each value of "major".
The SUBGROUP option is used to create subgroups within each group. The subgrouping variable is "gender", which means that the
chart will display separate segments for each value of "gender" within each group of "major".
The SUMVAR option is used to specify that the height of each bar will be the sum of the values of the "age" variable.
The SAS code proc plot produces a scatter plot of the age variable against the weight variable in the class dataset.
proc plot is a procedure in SAS used to create various types of plots and graphs.
plot age*weight is the plot statement, where age is plotted on the x-axis and weight on the y-axis.
This plot can be used to visualize the relationship between the two variables and to identify any potential patterns or outliers in the data.
Reports of data sets can be printed out in various formats using SAS's report output feature
The process of generating report output is done through the proc print procedure
Variables in the original data set and their values in the final report output contain the same data points
The observation column generated in the report output gives the line number of the specific observation
The war statement is used to list the variables to be included in the report output
The verb clause statement can be added to filter observations to be included in the report output
The proc print procedure is useful for viewing and printing out data sets in various formats.
The proc means procedure is used to obtain summary reports or simple statistics for numeric variables.
The output of the proc means procedure includes information about the number of observations, mean value, standard deviation, minimum
value, and maximum value of the analysis variable.
The Proc Means procedure works best with numeric analysis variables because it can obtain various statistical parameters meaningfully.
The output of the proc means procedure for the "age" variable in the "class" data set shows the number of observations, mean age, standard
deviation, minimum age, and maximum age of students.
Proc univariate is another statistical procedure that can be used to obtain detailed statistics for numeric variables.
It provides more complex and detailed statistical parameters in addition to the basic ones obtained from proc means.
The syntax for univariate is very similar to proc means, except for the change in the name of the procedure.
The output from univariate consists of multiple sections of reports that can be used based on statistical needs.
SAS provides a powerful capability called the Output Delivery System (ODS) to customize the appearance of output for final
deliverables.
The ODS can be used for many applications, such as creating output in different file formats, embedding graphics and colors, and
creating presentations.
There are several formats available in SAS, including PDF, HTML, and RDF.
A dataset can be used with the PROC PRINT procedure to obtain the default report output in SAS.
Different formats can be downloaded and viewed in their respective applications, such as Microsoft Word or a web browser.
Macro variables
Ampersand resolutions
SAS Macros
Macro functions
Macros are used to optimize code and make it more efficient by removing repetitions.
SAS can be divided into two worlds: the data world and the macro world.
The data world deals with variables and observations within data sets, while the macro world treats everything as text.
In the macro world, there are no data sets or numeric/character variables, only macro variables.
Macro variables are created using the %let keyword and are assigned values using the = symbol.
Everything in the macro world is treated as text, and macro variables can be assigned either numeric or character values.
Nested macro variables contain a macro variable within another macro variable.
To retrieve the values within the nested macro variables, consecutive ampersands are used for retrieving nested values.
Consecutive ampersands are used to obtain the innermost value of a nested macro variable.
SAS macros are useful in identifying repetitive code with patterns that can be replicated.
They can be used to make SAS programming more efficient and productive.
Macros are abstracted blocks of code that can be replaced and reused.
A specific macro name can be used to identify a block of code that is repetitive in nature.
By magnetizing code, you can convert it into a single module of code that can be used to replicate all repetitive code by replacing
only the changing parts.
The changing parts are replaced by macro variables that are defined within a macro using the %macro keyword followed by the
name of the macro, in parentheses the macro variables to be used in the section of code identified as the repetitive code.
Macro functions operate on macro variables, which exist in the exclusive macro world
Macros extensively interact with the data world and offer benefits in using them
Macro functions have a slightly different syntax than functions for data sets
Macro functions are denoted by a percent sign before the function name, eg: %upcase