performance difference when reading data from dataset/table

ssunda6 · Post by **ssunda6** » Thu Sep 13, 2007 12:42 am

Hi,

The requirement of my job is to load update/insert records into target table.

I have a sequential file(file is full extract every time) as input. The target table has 4 columns . 3 columns are taken as key and checked for updation in the fourth column.

The input file is having 30 million records as of now. And among 30 million, 10 million records might be new everytime.

Currently we are using the following approach

1) Reading source sequential file. Using change capture with file as input and table as reference, to get the update/insert records and loading them to the target table. This is taking huge time.

The other options availabe are
2) Load the data from table to a dataset. Read from dataset for reference data instead of from table.

Please provide some pointers. Will this improve performance?
And one more concern is since every day we get new data and the table will grow. Can a dataset accomodate such huge data in it? If we have incremental load of say even 1 milllion every day, after an year the data that will be loaded from table to dataset will be very large.

3) Third option is instead of using change capture, using "join" stage. After join stage, the transformer will have constraints to filter the insert/update records and finally data can be loaded to table.

4) Instead of loading data into table, if we load data to a dataset and then in next job to table, will this be faster?

Please provide your inputs.

Thank you,
ssunda.

ray.wurlod · Post by **ray.wurlod** » Thu Sep 13, 2007 1:41 am

Define "performance" in an ETL context.

Unless you have a definition of performance, you can not define metrics with which to measure the performance, and therefore can not claim to have improved or worsened performance (whatever that is).

Get parallelism happening as early as possible. When reading the sequential file, allocate more than one reader per node.

ssunda6 · Post by **ssunda6** » Thu Sep 13, 2007 2:03 am

Hi Ray,

By performance, I meant time improvement.
The job is taking huge time to complete. I want to reduce the job completion time.

Regards,
ssunda.

ray.wurlod · Post by **ray.wurlod** » Thu Sep 13, 2007 2:56 pm

Build some benchmark tests. For example,

Code: Select all

SeqFile ---> Copy

Use this to time how long you need to read the file with and without multiple readers per node.
Then experiment with Data Set. Experiment with Lookup vs Join vs Change Capture stages.