Sas To db2
Sas To db2
Sas To db2
Paper 107-29
INTRODUCTION
Service level agreements, batch processing windows or breathing down your neck are all good reasons to
improve the performance of your SAS/DB2 solution. With data growing faster than processor speed we can no
longer rely on the latest hardware to solve a performance problem. Developing high performance solutions
involves looking at all areas of data processing. This paper focuses on getting the best data access
performance for a SAS application that accesses DB2 data. There are three main methods of improving
solution performance.
When optimizing a solution using SAS and DB2 all three of these areas are very important. This paper provides
examples in each of these areas and explores the impact of SAS and DB2 configuration options by outlining
different methods of accessing your DB2 database. Examples are given to highlight the performance trade-offs
of choosing different access methods, SAS 9 application parameters and DB2 8.1 configuration options It also
highlights new features in SAS 9 and DB2 V8.1.
Throughout this paper, examples from a test system are used to demonstrate the performance impact of
SAS/ACCESS tuning options. The test environment consists of SAS 9 and DB2 V8.1 running on a single four
processor Unix system with 4GB of memory and 40 fibre channel disks.
The performance results in this paper demonstrate the impact of various configuration options. They are not
intended as a method of comparing dissimilar configurations. Different configurations may yield different results
(i.e., your mileage may vary).
SUGI 29
Required software:
- DB2 V7.1 or higher
- The SAS:
Base SAS
SAS/Access for DB2
SUGI 29
You want to use SAS data access functionality (threaded reads, for example)
The procedure or data step requires it. (e.g., proc freq, proc summary)
WHEN WOULD YOU USE EXPLICIT SQL PASS-THROUGH?
Explicit SQL pass-through should be used when:
SUGI 29
run;
To run frequency statistics against the data1 table in your DB2 database simply change the libname statement
to access the database table instead of the SAS dataset:
libname mylib db=DB2 user=me using=password;
Now, when this procedure is executed, SAS/ACCESS generates the proper SQL to retrieve the data from the
database.
SELECT "STATE", "TENURE", "YRBUILT", "YRMOVED", "MSAPMSA"
FROM data1
WHERE state = 01;
Using the libname engine is the easiest way to access DB2 database tables from your SAS application
because SAS generates the appropriate SQL for you. To better
understand SAS SQL translation let us take a look at how a
SAS DATA step is processed by SAS.
Coding Tip: Accessing long or case
sensitive table names
In this example the DATA step reads and filters the data from
the source table source1 and writes the results to the
database table results1. The result table contains all the
housing records for state id 01.
libname census db=DB2 user=me
using=password;
data census.results1;
set census.source1;
where state = 01;
run;
SAS begins by opening a connection to the database to retrieve metadata and create the results1 table.
That connection is then used to read all the rows for state=01 from source1. SAS opens a second
connection to write the rows to the newly created results1 table. If this were a multi-tier configuration the
data would travel over the network twice. First, the data is sent from the database server to the SAS server,
then from the SAS server back to the database server. SAS processes the data in this manner to support
moving data between various sources. This allows you to read data from one database server and write to the
results to a different database. In this example, since a single data source is used and SAS does not need to
process the data, this operation can be made more efficient [R].
proc sql;
connect to db2 (database=census);
execute(Create Table results1
like source1) as db2;
execute(Insert Into results1
Select *
From source1
Where state = 01) by db2;
disconnect from db2;
quit;
In this case, to make this operation more efficient, you can use the explicit SQL pass-through facility of the sql
procedure. To make this statement explicit create a sql procedure with the necessary SQL statements.
Connect to the database using the connect statement and wrap the SQL using the execute statement.
On the test system the original DATA step executed in 33 seconds, the explicit proc sql version executed in
15 seconds. Changing to explicit processing improved the performance of the operation by 64%. Since all the
work here can be done by the DB2 database itself, explicit SQL is the most efficient way to process the
statement.
EXPLICIT PASSTHROUGH VS SQL TRANSLATION
SUGI 29
So why use explicit SQL instead of SQL translation (also called implicit SQL pass-through) when using proc
sql? Explicit SQL is used in this case because an implicit proc sql statement is processed using a similar
SQL translation engine as a DATA step.
If we pass this same statement as implicit SQL, SAS
breaks the statement into separate select and insert
statements. In this example, the performance gain realized
using explicit SQL resulted from all the processing being
handled by the database. A general rule of thumb would be:
If SAS does not need to process it, let the database do the
work. Explicit SQL is the best way to force this to happen.
When SAS is executing a procedure it is important to understand which operations are done in DB2 and which
operations the SAS server is processing.
LOADING AND ADDING DATA
SAS provides powerful extraction transformation and load (ETL) capabilities. Therefore, it is often used to load
data into the database. SAS supports three methods of loading data into DB2: Import, Load and CLI
Load. The load options are accessed through the bulk load interface.
If you have a procedure or DATA step that creates a DB2 table from flat file data, for example, using the
bulkload engine the default load type is IMPORT. This option is best for small loads because it is easy to use
and the user requires only insert and select privileges on the table. To enable bulk load using import set the
option BULKLOAD=yes . BULKLOAD is a data step and libname option. Which one you choose depends on
your application. For example, if you want all load operations against a libref to use import you would use the
libname option.
To load large amounts of data quickly you should use the LOAD or CLILOAD options.
To use the DB2 Load feature, add the BL_REMOTE_FILE DATA step option. The BL_REMOTE_FILE option
defines a directory for SAS to use as temporary file storage for the load operation. To process a load, SAS
reads and processes the input data and writes it to a DB2 information exchange format (IXF) file. The
BULKLOAD facility then loads the IXF file into the database using DB2 Load. Using the Load option requires
the BL_REMOTE_FILE directory to have enough space to store the entire load dataset. It also requires the
directory defined by BL_REMOTE_FILE be accessible to the DB2 server instance. This means it is on the same
machine as DB2, NFS mounted or otherwise accessible as a file system.
New in DB2 V8.1 and SAS 9 is support for DB2 CLI Load. CLI Load uses the same high performance Load
interface but allows applications to send the data directly to the database without having to create a temporary
load file. CLI Load saves processing time because it does not create a temporary file and eliminates the need
for temporary file system space. CLI Load also allows data to be loaded from a remote system. To enable the
CLI Load feature using SAS, set the BL_METHOD=CLILOAD DATA step option instead of BL_REMOTE_FILE.
Tests ran comparing the different load options to give you an idea of the performance differences. This test
executes a DATA step that loads 223,000 rows into a single database table. The following three code examples
illustrate the DATA step bulk load options for each load method.
/* Method: Import */
data HSET(BULKLOAD=YES);
<DATA step processing >
run;
/* Method: Load */
data HSET(BULKLOAD=YES
BL_REMOTE_FILE="/tmp);
<DATA step processing >
run;
/* Method: CLI Load */
data HSET( BULKLOAD=YES
BL_METHOD=CLILOAD );
<DATA step processing >
run;
SUGI 29
Load Method
Time
(seconds)
76.69
55.93
49.04
Import
Load
CLI LOAD
The table shows that by using CLI Load there is a 36 percent performance gain over import in this test. All load
options require the table does not exist before the load.
DB2 Function
Multi-Row Fetch
mod()
Custom where-clause
predicates
To examine the performance differences between these options frequency statistics were run against a
database table. To generate frequency information SAS retrieves all the rows in the table. Since the math is not
complex this a good test of I/O performance between SAS and DB2.
The first test was run using the default read options: single-row fetch and non-threaded read.
libname census db2 db=census user=db2inst1 using=password;
proc freq data=census.hrecs_db
table state tenure yrbuilt yrmoved msapmsa;
run;
This test ran in 72.02 seconds.
READBUFF
Using the default options SAS executes a single thread that reads one row at a time through the DB2 CLI
interface. Sending one row at a time is not an efficient way to process large result sets. Transfer speed can be
greatly improved by sending multiple rows in each request. DB2 supports this type of block request in CLI
using multi-row fetch. SAS/ACCESS supports the DB2 multi-row fetch feature via the libname READBUFF
option. READBUFF=100 was added to the libname statement and the test run again.
libname census db2 db=census user=db2inst1 using=password READBUFF=100;
This time the frequency procedure took only 43.33 seconds to process. That is a 40% performance
improvement over a single row fetch. This procedure was tested with other values of READBUFF. The testing
indicated that the optimal value for this configuration is somewhere between 200 and 250, which allowed the
query to run in 40.97 seconds. So from here on READBUFF is left at 200 and the new multi-threaded read
options are tested using the same procedure.
SUGI 29
Rows/Sec
READBUFF Testing
6000
5500
5000
4500
4000
3500
3000
None
100
200
300
READBUFF Values
THREADED READ
SAS 9 introduces a new data retrieval performance option called threaded read. The threaded read option
works on the divide and conquer theory. It breaks up a single select statement into multiple statements
allowing parallel fetches of the data from DB2 into SAS. SAS/ACCESS for DB2 supports the DBSLICEPARM
and DBSLICE threaded read modes.
On a single-partition DB2 system you can use the DBSLICE or DBSLICEPARM option. Testing started using the
automatic threaded read mode by setting DBSLICEPARM. When you use this option, SAS/ACCESS
automatically determines a partitioning scheme for reading the data using the mod() database function. The
freq procedure is tested using dbsliceparm=(all,2) which creates two threads that read data from the
database.
proc freq data=census.hrecs_db (dbsliceparm=(all,2));
table state tenure yrbuilt yrmoved msapmsa;
run;
When this statement is executed SAS/ACCESS automatically generates two queries: The first with the
mod(serialno,2)=0 predicate and the second with the mod(serialno,2)=1 predicate. These queries are
executed in parallel. Using the DBSLICEPARM option the same statement ran in 36.56 seconds, 10% faster
than using READBUFF alone.
/* SQL Generated By SAS/Access */
SELECT "STATE", "TENURE", "YRBUILT",
"YRMOVED", "MSAPMSA"
FROM HRECS_DB
WHERE
{FN MOD({FN ABS("SERIALNO")},2)}=0
OR "SERIALNO" IS NULL ) FOR READ ONLY
SELECT "STATE", "TENURE", "YRBUILT",
FROM HRECS_DB
WHERE
({FN MOD({FN ABS("SERIALNO")},2)}=1
OR "SERIALNO" IS NULL ) FOR READ ONLY
"YRMOVED", "MSAPMSA"
Using DBSLICEPARM works well in a single partition database, in a database with multiple partitions the most
efficient way to retrieve data is directly from each partition.
SUGI 29
-- Catalog a node
CATALOG TCPIP NODE dbnode1 REMOTE myserver SERVER 70000
-- Catalog a database
CATALOG DATABASE dbname AT NODE dbnode1
Using a two-logical partition DB2 database on the test server the DBSLICEPARM option was replaced with the
DBSLICE syntax in the freq procedure script. Since the test database consists of multiple partitions on a
single system, we only need to catalog a single node. The freq procedure was executed using the DBSLICE
option along with the NODENUMBER syntax ().
To execute the request SAS opens two connections to the database server and executes the SQL necessary
to retrieve data from each partition.
/* SQL Generated By SAS/Access */
SELECT "STATE", "TENURE", "YRBUILT",
"YRMOVED", "MSAPMSA"
FROM HRECS_DB
WHERE NODENUMBER(serialno)=0
FOR READ ONLY
SELECT
A BY statement is used
All valid columns for the mod() function are also listed in the whereclause (effects DBSLICEPARM)
SUGI 29
DBCOMMIT
DBCOMMIT sets the number of rows inserted into the database between transaction commits. To understand
how this works lets look at what is happening behind the scenes when inserting a row into the database.
To insert one row into a database table there are many operations that take place behind the scenes to
complete the transaction. For this example we will focus on the transaction logging requirements of an SQL
insert operation to demonstrate the impact of the SAS DBCOMMIT option.
The DB2 database transaction log records all modifications to the database to ensure data integrity. During an
insert operation there are multiple records recorded to the database transaction log. For an insert, the first
record is the insert itself, followed by the commit record that tells the database the transaction is complete.
Both of these actions are recorded in the database transaction log in separate log records. For example, if you
were to insert 1000 rows with each row in its own transaction it would require 2000 (1000 insert and 1000
commit) transaction log records. If all these inserts were in a single transaction you could insert all the rows
with 1001 transaction log records (1000 insert and 1 commit).
You can see that there is a considerable difference in the amount of work required to insert the same 1000
rows depending on how the transaction is structured. By this logic, if you are doing an insert, you want to set
DBCOMMIT to the total number of rows you need to insert. This requires the least amount of work, right? Not
quite; as with any performance tuning there are tradeoffs.
For example, if you were to insert 1 million rows in a single transaction, this will work but it requires a lock to be
held for each row. As the numbers of locks required increases the lock management overhead increases. With
this in mind you need to tune the value of DBCOMMIT to be large enough to limit commit processing but not so
large that you encounter long transaction issues (locking, running out of log space, etc). To test insert
performance a DATA step is used that processes the hrecs table and creates a new hrecs_temp table
containing all the rows where state = 01.
libname census db2 db=census user=db2inst1 using=password;
data census.hrecs_temp (DBCOMMIT=10000);
set census.hrecs;
where state = '01';
run;
The default for DBCOMMIT is 1000. Testing started at 10 just to see the impact.
DBCOMMIT
10
100
1,000 (default)
5,000
10,000
Time (sec)
397.10
101.13
67.03
61.98
61.77
SUGI 29
Rows/Sec
DBCOMMIT Testing
1600
1400
1200
1000
800
600
400
200
0
10
100
1,000
5,000
10,000
DBCOMMIT Value
As you can see the default works pretty well. In some situations, like this one, larger values of DBCOMMIT can
yield up to a 16% performance improvement over the default. This test shows that the best value was between
1,000 and 5,000 rows per commit.
You can see that at some point, increasing the value
of DBCOMMIT no longer improves performance
(Compare 5,000 to 10,000).
INSERTBUFF is another tunable parameter that affects the performance of SAS inserting rows into a DB2
table. INSERTBUFF enables CLI insert buffering similar to read buffering (using READBUFF) but for inserts. It
tells the CLI client how many rows to send to the DB2 server in each request. To enable insert buffering you
need to set the libname option INSERTBUFF.
libname census db2 db=census user=db2inst1 using=password;
data census.hrecs_temp (INSERTBUFF=10 DBCOMMIT=5000);
set census.hrecs;
where state = '01';
run;
INSERTBUFF is an integer ranging from 1 to 2,147,483,648. Different values of INSERTBUFF were tested to
see what impact it would have on the preceding DATA step.
INSERTBUFF
1
5
10
25
50
100
Time
(sec)
61.77
60.64
60.22
60.36
60.36
60.86
Increasing the value of INSERTBUFF from 1 to 25 improved the performance only 2.7% over the tuned
DBCOMMIT environment . Increasing the value over 25 did not have a significant impact on performance. So
why use INSERTBUFF? As mentioned earlier, parameters effect various environments differently. To show
these differences the same test was run with SAS and DB2 running on separate servers using a TCP/IP
connection instead of a local shared memory connection.
10
SUGI 29
INSERTBUFF
150.75
109.05
10
104.52
25
99.35
50
98.23
100
98.11
In this case increasing the value of INSERTBUFF from 1 to 25 had a significant impact, improving performance
51% over the default. These results show that when you are running in a local (shared memory) configuration
INSERTBUFF has little impact on performance and you may want to tune DBCOMMIT. On the other hand
when you are connecting from a separate server INSERTBUFF can have a significant impact on performance.
Notice that good values of INSERTBUFF are about 10 times smaller than good values for READBUFF. You
should keep this in mind when you are tuning your system.
FUNCTION PUSHDOWN
You have seen that SAS/ACCESS can push down where-clause processing and join operations to DB2.
SAS/ACCESS can also push down function processing. If the database supports the function you request,
SAS can pass that function processing to the database. To enable the most functions to be pushed down to
DB2 set the SQL_FUNCTIONS=all libname option.
Applying these functions in the
database can improve analysis
performance. Each aggregate (vector)
function (examples: AVG, SUM) that
DB2 processes means fewer rows of
data are passed to the SAS server.
Processing non-aggregate (scalar)
functions (ABS, UPCASE etc) can
take advantage of DB2 parallel
processing.
ABS
FLOOR
LOWCASE (LCASE)
ARCOS (ACOS)
LOG
UPCASE (UCASE)
ARSIN (ASIN)
LOG10
SUM
ATAN
SIGN
COUNT
CEILING
SIN
AVG
COS
SQRT
MIN
EXP
TAN
MAX
SAS will push down operations to the database that limit the number of rows returned. For example, if you have
a simple query that does a select with an absolute value (abs) function. The function will be applied in SAS
because applying it at the database does not limit the number of rows returned to SAS.
Proc SQL Statement:
Create table census.FUNCTABLE as
select
abs(RHHINC) as sumcol
from census.hrecs_test2;
Generated SQL:
SELECT "RHHINC"
FROM "HRECS_TEST2" FOR READ ONLY
On the other hand, if the query included a limiting function( in this case a distinct clause was added), the
operation will be pushed down to the database.
Proc SQL Statement:
create table census.FUNCTABLE as
select
distinct abs(RHHINC) as sumcol
from census.hrecs_test2;
Generated SQL:
select
distinct {fn ABS("HRECS_TEST2"."RHHINC")} as sumcol
from "HRECS_TEST2"
11
SUGI 29
Note that when you a create table containing a derived column (i.e. calling a function to generate a value) in the
database you must name the derived column using a column alias.
To take further advantage of single-pass processing you can add a view to your data in DB2. Views in DB2
allow you to join tables, transform variables (generate ranks for a variable, for example) and group results in a
single pass of the data. You can further prepare the data before SAS processing by creating a Materialized
Query Table (MQT) or by applying Multidimensional Clustering (MDC) to your table.
12
SUGI 29
Connection Mode
Time (seconds)
SHAREDREAD
38.35
SHARED
130.90
UNIQUE
39.24
RESOURCE CONSUMPTION
DBAs are always interested in understanding what impact an application has on the database. SAS workloads
can vary greatly depending on your environment, but here are a few places to start evaluating your situation:
Each SAS user is the equivalent of a single database Decision Support (DS) user. Tune the same for
SAS as you would for an equivalent number of generic DS users.
Tune to the workload. Just like any other DS application, understanding the customer requirements
can help you to improve system performance. For example, if there is a demand for quarterly or
monthly data, using a Multidimensional Clustering (MDC) table for the data may be appropriate.
SAS is a decision support tool; if you need data from an operational data store consider the impact
on your other applications. To offload some of the work you may consider creating a data mart to
provide data to your SAS customers.
In most environments the SAS server is located on a separate system from your database. Business
analysis often requires many rows to be retrieved from the database. Plan to provide the fastest
network connection possible between these systems to provide the greatest throughput.
As with any workload keep your database statistics up to date using runstats. This enables the
optimizer generate the optimal access plan.
RULES OF THUMB
Try to pass as much where-clause and join processing as you can to DB2.
Return only the rows and columns you need. Whenever possible do not use a select * from
your SAS application. Provide a list of the necessary columns, using keep=(var1, var2), for
example. To limit the number of rows returned include any appropriate filters in the where-clause.
Provide Multidimensional Clustering (MDC) tables or Materialized Query Tables (MQT) where
appropriate to provide precompiled results for faster access to data.
Use SAS threaded read (DBSLICEPARM, DBSLICE) and multi-row fetch (READBUFF) operations
whenever possible.
When loading data with SAS use the bulk load method CLI Load.
DEBUGGING
To see what SQL commands SAS is passing to the database, enable the sastrace option. For example, here
is the syntax to trace SAS/ACCESS SQL calls to DB2:
options sastrace ,,,d sastraceloc=saslog;
Applying a d in the fourth column of the sastrace options tells SAS to report SQL sent to the database. For
example this SAS data Step:
data census.hrecs_temp
(keep=YEAR RECTYPE SERIALNO STATE);
keep YEAR RECTYPE SERIALNO STATE;
set census.hrecs_db;
where state=1;
13
SUGI 29
run;
Is logged as (the SQL commands are highlighted):
TRACE: Using EXTENDED FETCH for file HRECS_DB on connection 0 476 1372784097
rtmdoit 0 DATASTEP
TRACE: SQL stmt prepared on statement 0, connection 0 is:
SELECT *
FROM HRECS_DB FOR READ ONLY 477 1372784097 rtmdoit 0 DATASTEP
TRACE: DESCRIBE on statement 0, connection 0. 478 1372784097 rtmdoit 0
DATASTEP
622 data census.hrecs_temp (keep=YEAR RECTYPE SERIALNO STATE);
623
keep YEAR RECTYPE SERIALNO STATE;
624
set census.hrecs_db;
625
where state=1;
626 run;
TRACE: Using FETCH for file HRECS_TEMP on connection 1 479 1372784097
rtmdoit 0 DATASTEP
TRACE: Successful connection made, connection id 2 480 1372784097
rtmdoit 0 DATASTEP
TRACE: Database/data source: census 481 1372784098 rtmdoit 0 DATASTEP
TRACE: USER=DB2INST1, PASS=XXXXXXX 482 1372784098 rtmdoit 0 DATASTEP
TRACE: AUTOCOMMIT is NO for connection 2 483 1372784098 rtmdoit 0
DATASTEP
TRACE: Using FETCH for file HRECS_TEMP on connection 2 484 1372784098
rtmdoit 0 DATASTEP
NOTE: SAS variable labels, formats, and lengths are not written to DBMS
tables.
TRACE: SQL stmt execute on connection 2:
CREATE TABLE HRECS_TEMP
(YEAR DATE,RECTYPE VARCHAR(1),SERIALNO
Sastrace displays the processing details of the SAS script. The log includes the exact SQL commands that are
submitted to the database. The example above executes a DATA step that SAS translates into the following
SQL statements.
SELECT *
FROM HRECS_DB;
14
SUGI 29
db2 UPDATE CLI config FOR SECTION common USING TraceFileName /tmp/mytracefile
If you choose to edit the db2cli.ini file directly add trace and TraceFileName in the common section
[COMMON]
trace=1
TraceFileName=/tmp/mytracefile
TraceFlush=1
When you enable CLI tracing DB2 begins tracing all CLI statements executed on the server. DB2 will continue
to trace CLI commands until you disable CLI tracing by setting trace to 0. Be careful: if this is a busy server
you could collect huge amounts of output. It is best to run CLI trace on a test server running a single SAS
session, if possible.
The saslog and DB2 log (sqllib/db2dump/sqdb2diag.log) are also useful places to look for information when you
are troubleshooting.
CONCLUSION
The following is a brief summary of how the topics covered fit into the three main tuning categories:
1. Reduce the data transferred
Precompile some information using database views or Materialized Query Table (MQT)
2. Tune each processing step by using
15
SUGI 29
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the author at:
Scott Fadden
IBM
DB2 Data Management Software
[email protected]
Additional information is available at ibm.com/db2 and www.sas.com
SAS, SAS/ACCCESS and all other SAS Institute Inc. product or service names are registered trademarks or
trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration.
DB2 is a registered trademark of IBM in the USA and other countries.
Other brand and product names are trademarks of their respective companies.
16