CFF - Cobol File Definition w/ Sequential File or DB2 API

Bryceson · Post by **Bryceson** » Thu Jul 12, 2007 1:34 pm

Folks,

I hope someone has run into this before . . . .

I am receiving Data file from Mainframe along with a copybook. I loaded the metadata from the copybook in DataStage (Complex Flat File Stage) as a results I have 192 elements to load into Flat File (using Sequatial File) or DB2 (using DB2 Stage).

This Policy file has about 10 Million records and I am getting bad performance:

1. CFF ----->Transformer-------->DB2 Stage (329 Rows/sec)

2. CFF ----->Transformer-------->Sequatial File
(773 Rows/sec)

What could I do to get a better row ouput in DataStage?
Is there any thing that I need to be aware in CFF Stage that could cause this issue??

Any Ideas or Knowledge is much appreciated.

Thanks . . . . Bryceson

ArndW · Post by **ArndW** » Thu Jul 12, 2007 7:18 pm

Unfortunately sequential read speed does decrease significantly as the number of columns increases. If you declare each row as just 1 column in a test job and write out to a sequential file stage what speed does it perform at (this will help give a baseline as to expected best case throughput).

If most of your columns are simple PIC X or PIC 9 then you would benefit by using a sequential file stage to read the data and using the sdk routines to convert the binary columns.

Bryceson · Post by **Bryceson** » Thu Jul 12, 2007 7:47 pm

ArndW,

I will put together a test job with each row as 1 column and see what happen.

I also have another file that has few columns (about 70) with 13 Millions rows and the throughput is 5530 rows/sec doing:

CFF ------->Transformer------------>DB2 Stage (Good performance)

Is this typical of CFF stage when it has more columns it slow down reading the source file??. I am very much confused . . . . What could be an alternative on handling this CFF??

Thanks . . . . Bryceson

ArndW · Post by **ArndW** » Thu Jul 12, 2007 8:37 pm

More tuning work has gone into the sequential file stage than the CFF stage. I think the limiter is the amount of CPU used in the CFF stage, since it needs to parse out all the columns and then interpret them. If that is the case and you have a system with more than 1 CPU you can increase throughput by splitingt your file into {n} part files that can be processed concurrently by different instances of the CFF stage. You can either physically split the file and concurrently run it through a multi-instance job or by using one job that splits the stream and writes to sequential named pipes, which are read from as CFF stages in the same job.
I recall one case a couple of years back where I used a sequential stage with one column defined to read the data, transformed a couple of COMP-3 columns into normal representation, wrote out to a nameed pipe and then read that in with the couple of hundred columns as per the metadata as that was the quickest way to process the data (this was in a POC).
Another option that might be open to you is by usnig PX - as that implicilty will split the CPU load across the multiple readers that you can specify.

you could write a job that reads the data via a sequential file stage with just one long record, the split that data stream in a transform stage into {n} streams. Each of those streams leads to a sequential file/named pipe,