Performance Tuning
Moderators: chulett, rschirm, roy
Performance Tuning
Hi All,
Have a scenario today in one of our jobs where in we perform a remove duplicate using the sort stage when the option "Allow Duplicates" set to False.
This sort stage hash partitions on the keys that are to be remove duplicated on. We have Stable Sort option in the sort stage set to True.
This job runs will run on 8 node config file in production.
We happened to know the number of records that we have perform remove duplicate on in production which is close to 216 Million records.
Looking at this huge size we can be pretty sure that this job will eat a lot of resource and time.
We ave following options:
1. Using the option of "Restrict Memory Usage" depending on our infrastructure team suggestion of memory.
2. Use another configuration file with node pools that can be used for sort in the sort stage.
3. Run this job when the amount of memory is more.
Any suggestions on improving the performance.
Thanks
Have a scenario today in one of our jobs where in we perform a remove duplicate using the sort stage when the option "Allow Duplicates" set to False.
This sort stage hash partitions on the keys that are to be remove duplicated on. We have Stable Sort option in the sort stage set to True.
This job runs will run on 8 node config file in production.
We happened to know the number of records that we have perform remove duplicate on in production which is close to 216 Million records.
Looking at this huge size we can be pretty sure that this job will eat a lot of resource and time.
We ave following options:
1. Using the option of "Restrict Memory Usage" depending on our infrastructure team suggestion of memory.
2. Use another configuration file with node pools that can be used for sort in the sort stage.
3. Run this job when the amount of memory is more.
Any suggestions on improving the performance.
Thanks
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Get rid of stable sort unless you have reason to keep it. It requires more resources than not using it.
Unless your other configuration file has more nodes than eight, it can not give better sort performance than eight nodes. You're only processing on average 27 million rows per node - this is not a large load.
Reading between the lines of your question your machine is already overloaded. Try to run when the total load on the machine is less.
Unless your other configuration file has more nodes than eight, it can not give better sort performance than eight nodes. You're only processing on average 27 million rows per node - this is not a large load.
Reading between the lines of your question your machine is already overloaded. Try to run when the total load on the machine is less.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Do you have any particular definition in mind of what "performance" means in an ETL context?
There are probably many other possibilities, but if I suggest some of them you will reject them (for example bigger hardware, more nodes - though you are not processing a large amount of data really).
Every case must be analyzed on its own merits, by monitoring over time and, if you're on version 8, using the resource estimation tool.
The main secret is not to do anything you don't need to do. This includes such things as processing unnecessary rows and columns.
There are probably many other possibilities, but if I suggest some of them you will reject them (for example bigger hardware, more nodes - though you are not processing a large amount of data really).
Every case must be analyzed on its own merits, by monitoring over time and, if you're on version 8, using the resource estimation tool.
The main secret is not to do anything you don't need to do. This includes such things as processing unnecessary rows and columns.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
We now have 16 node config file in production. Really not sure of the scratch disk space.. may be around 100MB with no node pools.
It really retrieves us when you mentioned that space will be enough actually plenty. Thanks...
So a record count of 216 millions will work? The only change will be to set Stable sort option to false. Right!!
Thanks
It really retrieves us when you mentioned that space will be enough actually plenty. Thanks...
So a record count of 216 millions will work? The only change will be to set Stable sort option to false. Right!!
Thanks
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The 20MB per node in the Sort stage is memory, not scratch disk. The process will go to scratch disk if that amount of real memory can not be provided on demand. Assuming nothing else is happening, you need 216 million x 20 bytes (approximately 420 MB) of scratch disk space to handle the worst case scenario for this Sort stage.
In practice other operators may also be using scratch disk, so the formula gets more complex. The resource estimator in version 8 gives per-operator figures which should be added to get the worst case figures.
In practice other operators may also be using scratch disk, so the formula gets more complex. The resource estimator in version 8 gives per-operator figures which should be added to get the worst case figures.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Ray,
I was reading this article from a while ago. Can you throw some more light on why Stable Sort is slower than not using it ??
I take your word on this at the same time want to know more details in terms of architecturally why it is designed like that or why it behaves like that ? I need to take the same to my boss to argue to disable this for one of the jobs that I am redesigning.
I was reading this article from a while ago. Can you throw some more light on why Stable Sort is slower than not using it ??
I take your word on this at the same time want to know more details in terms of architecturally why it is designed like that or why it behaves like that ? I need to take the same to my boss to argue to disable this for one of the jobs that I am redesigning.
ray.wurlod wrote:It requires more resources than not using it.wfis wrote:Need more insight on why not to go for Stable sort.
Which part of that is unclear?
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
I never claimed it was slower. I asserted that it needs more resources (mainly memory but also CPU). This is because each sort group must be kept sorted in memory. If you have sufficient spare resources, then overall processing speed may be unaffected. But best practice in all computing is never to do anything unnecessary.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.