How to get performance while using UserdefinedUpdate stmt

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Madhu1981
Participant
Posts: 69
Joined: Wed Feb 22, 2006 7:49 am

How to get performance while using UserdefinedUpdate stmt

Post by Madhu1981 »

Hi,

I have a job like

Dataset ---> Copy Stage --> Db2Enterprise stage and my

Write Method is : Upsert.
Upsert mode is : Userdefined Update.
Configuration File: 4 Nodes.


Now, I am getting 1 million records from input and its taking 3 hours to update the records.

Is there any way to make better performance.

thanks in advance
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Madhu, all timings are relative, so nobody here can tell you what to do or even if these values are good or bad.
What is your effective speed if you do only inserts? Is your database partitioned? Do you have multiple constraints?
madhukar
Participant
Posts: 86
Joined: Fri May 20, 2005 4:05 pm

Re: How to get performance while using UserdefinedUpdate stm

Post by madhukar »

Since it taking more time, you can try the follwoing option.

do the upsert process in datastage and do all records insert into the table.
pigpen
Participant
Posts: 38
Joined: Thu Jul 13, 2006 2:51 am

Post by pigpen »

I think of 2 trials, see if it can help.

1. Increase the commit count and the DB transaction log buffer.
2. Separate the upsert into 1 update and 1 load process.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

Another possibility is to take advantage of DB2's inherent parallel processing capabilities - write all of your data to a work table, then use normal DB2 sql to handle your inserts and updates (or use DB2 V8's merge syntax to do an upsert). That will allow you to take advantage of indexes that DataStage can't use.

Here's what we often do:
  • 1) In your DataStage job, left outer join your new data (left) to the existing table's keys (right). The table read pulls keys only and an indicator field, no data. Example:

    select account_nr, cast('UPDATE' as char(6) as ld_action
    from schema.tablename

    2) Use the indicator field to show whether a reocrd is an 'INSERT' or 'UPDATE'. Because you used a left outer join, the ld_action will be NULL if the record is new, and 'UPDATE' if not. You can use a modify stage to change NULL to 'INSERT'. Now your output record layout is input_data + ld_action.

    3) Write all records to a work table. Add an index to your keys, and to the ld_action field. You could consider using the 'load' mode instead of 'write' to avoid logging.

    4) Use SQL to insert where ld_action = 'INSERT' (or NULL if you didn't modify it before), and update where ld_action = 'UPDATE'. Make sure your work table and target table are indexed the same, and that you do a runstats on your work table after it is loaded and indexed.
1 million records or 50 million records, you should be able to get the time down to minutes instead of hours.

Hope this helps,

Brad.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

By the way, don't get me wrong. I am a staunch supporter of DataStage and use it where ever possible. However, you should always use the best tool for the job at hand. If DataStage does it best, then go for it. However, if your database can do it faster, then use it.

Brad.
Post Reply