Performance Tuning

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
wfis
Premium Member
Premium Member
Posts: 70
Joined: Wed Feb 28, 2007 2:38 am
Location: India

Performance Tuning

Post by wfis »

Hi All,

Have a scenario today in one of our jobs where in we perform a remove duplicate using the sort stage when the option "Allow Duplicates" set to False.

This sort stage hash partitions on the keys that are to be remove duplicated on. We have Stable Sort option in the sort stage set to True.

This job runs will run on 8 node config file in production.

We happened to know the number of records that we have perform remove duplicate on in production which is close to 216 Million records.
Looking at this huge size we can be pretty sure that this job will eat a lot of resource and time.

We ave following options:
1. Using the option of "Restrict Memory Usage" depending on our infrastructure team suggestion of memory.
2. Use another configuration file with node pools that can be used for sort in the sort stage.
3. Run this job when the amount of memory is more.

Any suggestions on improving the performance.


Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Get rid of stable sort unless you have reason to keep it. It requires more resources than not using it.

Unless your other configuration file has more nodes than eight, it can not give better sort performance than eight nodes. You're only processing on average 27 million rows per node - this is not a large load.

Reading between the lines of your question your machine is already overloaded. Try to run when the total load on the machine is less.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
wfis
Premium Member
Premium Member
Posts: 70
Joined: Wed Feb 28, 2007 2:38 am
Location: India

Post by wfis »

Thank you for the reply.
Yes we have constraint on the space allocated to us. We are looking at a new configuration file but will a node pool for this sort stage help.

Need more insight on why not to go for Stable sort.

Any other suggestion.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

wfis wrote:Need more insight on why not to go for Stable sort.
It requires more resources than not using it.

Which part of that is unclear?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
wfis
Premium Member
Premium Member
Posts: 70
Joined: Wed Feb 28, 2007 2:38 am
Location: India

Post by wfis »

Thanks for the reply.
Ok... Wanted to know if any other performace tuning step can be applied along with disabling stable sort and using a new config file.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Do you have any particular definition in mind of what "performance" means in an ETL context?

There are probably many other possibilities, but if I suggest some of them you will reject them (for example bigger hardware, more nodes - though you are not processing a large amount of data really).

Every case must be analyzed on its own merits, by monitoring over time and, if you're on version 8, using the resource estimation tool.

The main secret is not to do anything you don't need to do. This includes such things as processing unnecessary rows and columns.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
wfis
Premium Member
Premium Member
Posts: 70
Joined: Wed Feb 28, 2007 2:38 am
Location: India

Post by wfis »

Hi Ray,

You mentioned 27 Millions per node is this a standard? in that case will 20 MB per node for a sort stage work?

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Who knows? Do you have row sizes in bytes, or in the millions of bytes?
How long is a piece of string?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
wfis
Premium Member
Premium Member
Posts: 70
Joined: Wed Feb 28, 2007 2:38 am
Location: India

Post by wfis »

If it is keys the size of the columns together is 19 chars.

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Trivial. 20MB per node will be plenty. It may not even use all of that.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
wfis
Premium Member
Premium Member
Posts: 70
Joined: Wed Feb 28, 2007 2:38 am
Location: India

Post by wfis »

We now have 16 node config file in production. Really not sure of the scratch disk space.. may be around 100MB with no node pools.

It really retrieves us when you mentioned that space will be enough actually plenty. Thanks...

So a record count of 216 millions will work? The only change will be to set Stable sort option to false. Right!!

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The 20MB per node in the Sort stage is memory, not scratch disk. The process will go to scratch disk if that amount of real memory can not be provided on demand. Assuming nothing else is happening, you need 216 million x 20 bytes (approximately 420 MB) of scratch disk space to handle the worst case scenario for this Sort stage.

In practice other operators may also be using scratch disk, so the formula gets more complex. The resource estimator in version 8 gives per-operator figures which should be added to get the worst case figures.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Nagin
Charter Member
Charter Member
Posts: 89
Joined: Thu Jan 26, 2006 12:37 pm

Post by Nagin »

Ray,

I was reading this article from a while ago. Can you throw some more light on why Stable Sort is slower than not using it ??

I take your word on this at the same time want to know more details in terms of architecturally why it is designed like that or why it behaves like that ? I need to take the same to my boss to argue to disable this for one of the jobs that I am redesigning.

ray.wurlod wrote:
wfis wrote:Need more insight on why not to go for Stable sort.
It requires more resources than not using it.

Which part of that is unclear?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I never claimed it was slower. I asserted that it needs more resources (mainly memory but also CPU). This is because each sort group must be kept sorted in memory. If you have sufficient spare resources, then overall processing speed may be unaffected. But best practice in all computing is never to do anything unnecessary.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply