Performance Issue

Madhu1981 · Post by **Madhu1981** » Wed Apr 19, 2006 5:59 am

Hi,

I have a px job containg 3 stages like:

DB2 stage --> Copy Stage --> Db2 stage

I am loading data from one table to another table. My source table contains 12 million records and my configuration file contains 4 processing nodes.

The job is taking very long time to run(0.5 million in 2 hrs). Could any one suggest me, how to improve the performance for this job.

kumar_s · Post by **kumar_s** » Wed Apr 19, 2006 6:41 am

Nail down the bottle neck.
Is your etl and database server residing in different server?
Have you tired Bulk load?
What if you try to do manual load.
Try to unload the data from the source table into a sequential file and do a manula load from command prompt.
What is the length of each record?

bcarlson · Post by **bcarlson** » Wed Apr 19, 2006 12:52 pm

Reading your table

I am assuming your first stage is a read? How have you scripted the read? Is it a user-defined SQL or a table read? If you are pulling the entire contents of a table, I'd recommend a table read. DataStage will automatically pull the data in parallel with a table read. For a user-defined SQL, you need to specify how to pull it in parallel.

Assume you have a table myschema.mybigtable with 15 million records and partitioned on account_id.

Code: Select all

select *
from myschema.mybigtable

Using this user-defined SQL, DataStage will correctly pull all of the data, but DB2 will have to pool all the data into one stream to return it to DataStage.

On the other hand:

Code: Select all

select *
from myschema.mybigtable
where nodenumber(account_id) = current node

or

Code: Select all

select *
from myschema.mybigtable
where dbpartitionnum(account_id) = current dbpartitionnum for fetch only

This user-defined SQL combined with setting the partitioning table to 'myschema.mybigtable' will pull all of the data in parallel. If your database has 4 partitions, then DataStage will pull in 4 streams. Both formats work, but it was recommended to me to use the latter format (using dbpartitionnum vs. nodenumber).

Note: when you do a table read, this is how the data is pulled. So a table read and the latter SQL are equivalent. I had a similar discussion in another posting recently:

viewtopic.php?t=99616&start=0&postdays= ... highlight=

Loading your table

How are you populating your target table? Is it a load or a write? Are you appending, creating or replacing?

A load will be much faster, especially with high volumes. If you choose to do a write, then you will probably want to set the APT_RDBMS_COMMIT_ROWS to some high value. My memory is that it defaults to 1,000 rows (i.e. do a commit every 1000 rows), but I am not sure. We set ours to 16384 or higher (we use multiples of 1024, not sure why...).

If you are using create or replace as your mode (instead of truncate or append), then make sure you are setting a partitioning key for your table. These modes will create a new table based on your record schema. If you do not specify a partitioning key, DataStage will use the first field in your record schema. If that does not contain very many values, then the partitioning gets all screwed up and you end up with data that is unbalanced. I would suggest always specifying a partitioning key if you use one of these modes, even if the first field would work okay. You set the partitioning key in the DB2 stage with the dboptions:

Code: Select all

{tablespace='mytablespace', key=key1}

or if you have multiple fields for your partitioning key:

Code: Select all

{tablespace='mytablespace', key=key1,key=key2, etc...}

Well, hope this helps! Sorry I got so long winded!
Brad.

ray.wurlod · Post by **ray.wurlod** » Wed Apr 19, 2006 4:29 pm

Also find out whether all the processing nodes are actually being used. Do any stages specify an explicit node pool? Are any nodes in the configuration file NOT in the default node pool?