Performance Issue - identity column??

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vigneshra
Participant
Posts: 86
Joined: Wed Jun 09, 2004 6:07 am
Location: Chennai

Performance Issue - identity column??

Post by vigneshra »

Hi

We have a performance issue with a job. The job design is like this. This job reads a sequential file of 30 columns and a transformer assigns System Date and Time for two columns and also assigns a string constant for another column. Then a look-up is done on the table TABLE1 (where the records are finally inserted) to check if they are new or existing. If they are new, the records are inserted into TABLE1. If they are existing, they are rejected. This is all the functionality of the job.

Here the extraction from the sequential file is done at a rate of 60000 rows/sec. But the insert into the target table is happening at a rate of 25-30 rows/sec. The insert and select from TABLE1 is done using DB2 API Stages. Also TABLE1 is having an identity column which generates surrogate key for each record internally within database. Also the partitioning in the insert DB2 API stage is set to AUTO.

Can anyone suggest why the job performance is very poor? Is it due to the identity column that creates bottle-neck?
Vignesh.

"A conclusion is simply the place where you got tired of thinking."
tcj
Premium Member
Premium Member
Posts: 98
Joined: Tue Sep 07, 2004 6:57 pm
Location: QLD, Australia
Contact:

Post by tcj »

The main bottleneck in the job would be doing a lookup directly to the database. For every row that passes through the job it has to go off and do a select on the database. This will have a big overhead.

I suggest changing the DB2 API stages to hashed file stages as well as generating the surrogate key within a transform.
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

The main issue here is that you cannot (easily) retrieve the value of an identity column in a job processing updates and inserts without either building a hash lookup of what's already there or doing as you are - using a relational lookup.

If your scenario is such that rows will not be duplicated within the same run then you might consider extracting all rows prior to the run and building a hash for lookup purposes.

If your scenario is such that it is likely a given run will contain duplicate rows and you have to insert and then update these rows then you may consider a two pass processing approach where you filter off the dups into another file and process the original (which would now contain no dups) with the method I outlined above.

Identity columns pose their own difficulties when trying to do this type of processing, but some creativity can solve the problem.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

You posted in the server forum but you set the job category to parallel. Which is it? You mention DB2 partitioning is set to null but that you are using the API stage and not the enterprise stage.
Post Reply