performance difference when reading data from dataset/table

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ssunda6
Participant
Posts: 91
Joined: Tue Sep 19, 2006 9:32 pm

performance difference when reading data from dataset/table

Post by ssunda6 »

Hi,

The requirement of my job is to load update/insert records into target table.

I have a sequential file(file is full extract every time) as input. The target table has 4 columns . 3 columns are taken as key and checked for updation in the fourth column.

The input file is having 30 million records as of now. And among 30 million, 10 million records might be new everytime.

Currently we are using the following approach

1) Reading source sequential file. Using change capture with file as input and table as reference, to get the update/insert records and loading them to the target table. This is taking huge time.

The other options availabe are
2) Load the data from table to a dataset. Read from dataset for reference data instead of from table.

Please provide some pointers. Will this improve performance?
And one more concern is since every day we get new data and the table will grow. Can a dataset accomodate such huge data in it? If we have incremental load of say even 1 milllion every day, after an year the data that will be loaded from table to dataset will be very large.

3) Third option is instead of using change capture, using "join" stage. After join stage, the transformer will have constraints to filter the insert/update records and finally data can be loaded to table.

4) Instead of loading data into table, if we load data to a dataset and then in next job to table, will this be faster?

Please provide your inputs.

Thank you,
ssunda.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Define "performance" in an ETL context.

Unless you have a definition of performance, you can not define metrics with which to measure the performance, and therefore can not claim to have improved or worsened performance (whatever that is).

Get parallelism happening as early as possible. When reading the sequential file, allocate more than one reader per node.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ssunda6
Participant
Posts: 91
Joined: Tue Sep 19, 2006 9:32 pm

Post by ssunda6 »

Hi Ray,

By performance, I meant time improvement.
The job is taking huge time to complete. I want to reduce the job completion time.

Regards,
ssunda.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Build some benchmark tests. For example,

Code: Select all

SeqFile ---> Copy
Use this to time how long you need to read the file with and without multiple readers per node.
Then experiment with Data Set. Experiment with Lookup vs Join vs Change Capture stages.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply