Hi,
I have a px job containg 3 stages like:
DB2 stage --> Copy Stage --> Db2 stage
I am loading data from one table to another table. My source table contains 12 million records and my configuration file contains 4 processing nodes.
The job is taking very long time to run(0.5 million in 2 hrs). Could any one suggest me, how to improve the performance for this job.
Performance Issue
Moderators: chulett, rschirm, roy
Nail down the bottle neck.
Is your etl and database server residing in different server?
Have you tired Bulk load?
What if you try to do manual load.
Try to unload the data from the source table into a sequential file and do a manula load from command prompt.
What is the length of each record?
Is your etl and database server residing in different server?
Have you tired Bulk load?
What if you try to do manual load.
Try to unload the data from the source table into a sequential file and do a manula load from command prompt.
What is the length of each record?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Reading your table
I am assuming your first stage is a read? How have you scripted the read? Is it a user-defined SQL or a table read? If you are pulling the entire contents of a table, I'd recommend a table read. DataStage will automatically pull the data in parallel with a table read. For a user-defined SQL, you need to specify how to pull it in parallel.
Assume you have a table myschema.mybigtable with 15 million records and partitioned on account_id.
Using this user-defined SQL, DataStage will correctly pull all of the data, but DB2 will have to pool all the data into one stream to return it to DataStage.
On the other hand:
or
This user-defined SQL combined with setting the partitioning table to 'myschema.mybigtable' will pull all of the data in parallel. If your database has 4 partitions, then DataStage will pull in 4 streams. Both formats work, but it was recommended to me to use the latter format (using dbpartitionnum vs. nodenumber).
Note: when you do a table read, this is how the data is pulled. So a table read and the latter SQL are equivalent. I had a similar discussion in another posting recently:
viewtopic.php?t=99616&start=0&postdays= ... highlight=
Loading your table
How are you populating your target table? Is it a load or a write? Are you appending, creating or replacing?
A load will be much faster, especially with high volumes. If you choose to do a write, then you will probably want to set the APT_RDBMS_COMMIT_ROWS to some high value. My memory is that it defaults to 1,000 rows (i.e. do a commit every 1000 rows), but I am not sure. We set ours to 16384 or higher (we use multiples of 1024, not sure why...).
If you are using create or replace as your mode (instead of truncate or append), then make sure you are setting a partitioning key for your table. These modes will create a new table based on your record schema. If you do not specify a partitioning key, DataStage will use the first field in your record schema. If that does not contain very many values, then the partitioning gets all screwed up and you end up with data that is unbalanced. I would suggest always specifying a partitioning key if you use one of these modes, even if the first field would work okay. You set the partitioning key in the DB2 stage with the dboptions:
or if you have multiple fields for your partitioning key:
Well, hope this helps! Sorry I got so long winded!
Brad.
I am assuming your first stage is a read? How have you scripted the read? Is it a user-defined SQL or a table read? If you are pulling the entire contents of a table, I'd recommend a table read. DataStage will automatically pull the data in parallel with a table read. For a user-defined SQL, you need to specify how to pull it in parallel.
Assume you have a table myschema.mybigtable with 15 million records and partitioned on account_id.
Code: Select all
select *
from myschema.mybigtable
On the other hand:
Code: Select all
select *
from myschema.mybigtable
where nodenumber(account_id) = current node
Code: Select all
select *
from myschema.mybigtable
where dbpartitionnum(account_id) = current dbpartitionnum for fetch only
Note: when you do a table read, this is how the data is pulled. So a table read and the latter SQL are equivalent. I had a similar discussion in another posting recently:
viewtopic.php?t=99616&start=0&postdays= ... highlight=
Loading your table
How are you populating your target table? Is it a load or a write? Are you appending, creating or replacing?
A load will be much faster, especially with high volumes. If you choose to do a write, then you will probably want to set the APT_RDBMS_COMMIT_ROWS to some high value. My memory is that it defaults to 1,000 rows (i.e. do a commit every 1000 rows), but I am not sure. We set ours to 16384 or higher (we use multiples of 1024, not sure why...).
If you are using create or replace as your mode (instead of truncate or append), then make sure you are setting a partitioning key for your table. These modes will create a new table based on your record schema. If you do not specify a partitioning key, DataStage will use the first field in your record schema. If that does not contain very many values, then the partitioning gets all screwed up and you end up with data that is unbalanced. I would suggest always specifying a partitioning key if you use one of these modes, even if the first field would work okay. You set the partitioning key in the DB2 stage with the dboptions:
Code: Select all
{tablespace='mytablespace', key=key1}
Code: Select all
{tablespace='mytablespace', key=key1,key=key2, etc...}
Well, hope this helps! Sorry I got so long winded!
Brad.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Also find out whether all the processing nodes are actually being used. Do any stages specify an explicit node pool? Are any nodes in the configuration file NOT in the default node pool?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.