Performance Issue

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Madhu1981
Participant
Posts: 69
Joined: Wed Feb 22, 2006 7:49 am

Performance Issue

Post by Madhu1981 »

Hi,

I have a px job containg 3 stages like:

DB2 stage --> Copy Stage --> Db2 stage

I am loading data from one table to another table. My source table contains 12 million records and my configuration file contains 4 processing nodes.

The job is taking very long time to run(0.5 million in 2 hrs). Could any one suggest me, how to improve the performance for this job.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Nail down the bottle neck.
Is your etl and database server residing in different server?
Have you tired Bulk load?
What if you try to do manual load.
Try to unload the data from the source table into a sequential file and do a manula load from command prompt.
What is the length of each record?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

Reading your table

I am assuming your first stage is a read? How have you scripted the read? Is it a user-defined SQL or a table read? If you are pulling the entire contents of a table, I'd recommend a table read. DataStage will automatically pull the data in parallel with a table read. For a user-defined SQL, you need to specify how to pull it in parallel.

Assume you have a table myschema.mybigtable with 15 million records and partitioned on account_id.

Code: Select all

select *
from myschema.mybigtable
Using this user-defined SQL, DataStage will correctly pull all of the data, but DB2 will have to pool all the data into one stream to return it to DataStage.

On the other hand:

Code: Select all

select *
from myschema.mybigtable
where nodenumber(account_id) = current node
or

Code: Select all

select *
from myschema.mybigtable
where dbpartitionnum(account_id) = current dbpartitionnum for fetch only
This user-defined SQL combined with setting the partitioning table to 'myschema.mybigtable' will pull all of the data in parallel. If your database has 4 partitions, then DataStage will pull in 4 streams. Both formats work, but it was recommended to me to use the latter format (using dbpartitionnum vs. nodenumber).

Note: when you do a table read, this is how the data is pulled. So a table read and the latter SQL are equivalent. I had a similar discussion in another posting recently:

viewtopic.php?t=99616&start=0&postdays= ... highlight=

Loading your table

How are you populating your target table? Is it a load or a write? Are you appending, creating or replacing?

A load will be much faster, especially with high volumes. If you choose to do a write, then you will probably want to set the APT_RDBMS_COMMIT_ROWS to some high value. My memory is that it defaults to 1,000 rows (i.e. do a commit every 1000 rows), but I am not sure. We set ours to 16384 or higher (we use multiples of 1024, not sure why...).

If you are using create or replace as your mode (instead of truncate or append), then make sure you are setting a partitioning key for your table. These modes will create a new table based on your record schema. If you do not specify a partitioning key, DataStage will use the first field in your record schema. If that does not contain very many values, then the partitioning gets all screwed up and you end up with data that is unbalanced. I would suggest always specifying a partitioning key if you use one of these modes, even if the first field would work okay. You set the partitioning key in the DB2 stage with the dboptions:

Code: Select all

{tablespace='mytablespace', key=key1}
or if you have multiple fields for your partitioning key:

Code: Select all

{tablespace='mytablespace', key=key1,key=key2, etc...}


Well, hope this helps! Sorry I got so long winded!
Brad.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Also find out whether all the processing nodes are actually being used. Do any stages specify an explicit node pool? Are any nodes in the configuration file NOT in the default node pool?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply