Performance Tuning

wfis · Post by **wfis** » Tue Sep 23, 2008 12:40 am

Hi All,

Have a scenario today in one of our jobs where in we perform a remove duplicate using the sort stage when the option "Allow Duplicates" set to False.

This sort stage hash partitions on the keys that are to be remove duplicated on. We have Stable Sort option in the sort stage set to True.

This job runs will run on 8 node config file in production.

We happened to know the number of records that we have perform remove duplicate on in production which is close to 216 Million records.
Looking at this huge size we can be pretty sure that this job will eat a lot of resource and time.

We ave following options:
1. Using the option of "Restrict Memory Usage" depending on our infrastructure team suggestion of memory.
2. Use another configuration file with node pools that can be used for sort in the sort stage.
3. Run this job when the amount of memory is more.

Any suggestions on improving the performance.

Thanks

ray.wurlod · Post by **ray.wurlod** » Tue Sep 23, 2008 1:01 am

Get rid of stable sort unless you have reason to keep it. It requires more resources than not using it.

Unless your other configuration file has more nodes than eight, it can not give better sort performance than eight nodes. You're only processing on average 27 million rows per node - this is not a large load.

Reading between the lines of your question your machine is already overloaded. Try to run when the total load on the machine is less.

wfis · Post by **wfis** » Tue Sep 23, 2008 2:05 am

Thank you for the reply.
Yes we have constraint on the space allocated to us. We are looking at a new configuration file but will a node pool for this sort stage help.

Need more insight on why not to go for Stable sort.

Any other suggestion.

ray.wurlod · Post by **ray.wurlod** » Tue Sep 23, 2008 2:31 am

wfis wrote:Need more insight on why not to go for Stable sort.

It requires more resources than not using it.

Which part of that is unclear?

wfis · Post by **wfis** » Tue Sep 23, 2008 3:02 am

Thanks for the reply.
Ok... Wanted to know if any other performace tuning step can be applied along with disabling stable sort and using a new config file.

ray.wurlod · Post by **ray.wurlod** » Tue Sep 23, 2008 4:05 am

Do you have any particular definition in mind of what "performance" means in an ETL context?

There are probably many other possibilities, but if I suggest some of them you will reject them (for example bigger hardware, more nodes - though you are not processing a large amount of data really).

Every case must be analyzed on its own merits, by monitoring over time and, if you're on version 8, using the resource estimation tool.

The main secret is not to do anything you don't need to do. This includes such things as processing unnecessary rows and columns.

wfis · Post by **wfis** » Tue Sep 23, 2008 5:49 am

Hi Ray,

You mentioned 27 Millions per node is this a standard? in that case will 20 MB per node for a sort stage work?

Thanks

ray.wurlod · Post by **ray.wurlod** » Tue Sep 23, 2008 5:57 am

Who knows? Do you have row sizes in bytes, or in the millions of bytes?
How long is a piece of string?

wfis · Post by **wfis** » Tue Sep 23, 2008 6:05 am

If it is keys the size of the columns together is 19 chars.

Thanks

ray.wurlod · Post by **ray.wurlod** » Tue Sep 23, 2008 7:03 am

Trivial. 20MB per node will be plenty. It may not even use all of that.

wfis · Post by **wfis** » Tue Sep 23, 2008 9:18 pm

We now have 16 node config file in production. Really not sure of the scratch disk space.. may be around 100MB with no node pools.

It really retrieves us when you mentioned that space will be enough actually plenty. Thanks...

So a record count of 216 millions will work? The only change will be to set Stable sort option to false. Right!!

Thanks

ray.wurlod · Post by **ray.wurlod** » Tue Sep 23, 2008 9:40 pm

The 20MB per node in the Sort stage is memory, not scratch disk. The process will go to scratch disk if that amount of real memory can not be provided on demand. Assuming nothing else is happening, you need 216 million x 20 bytes (approximately 420 MB) of scratch disk space to handle the worst case scenario for this Sort stage.

In practice other operators may also be using scratch disk, so the formula gets more complex. The resource estimator in version 8 gives per-operator figures which should be added to get the worst case figures.

Nagin · Post by **Nagin** » Wed Feb 18, 2009 1:02 pm

Ray,

I was reading this article from a while ago. Can you throw some more light on why Stable Sort is slower than not using it ??

I take your word on this at the same time want to know more details in terms of architecturally why it is designed like that or why it behaves like that ? I need to take the same to my boss to argue to disable this for one of the jobs that I am redesigning.

ray.wurlod wrote:
wfis wrote:Need more insight on why not to go for Stable sort.
It requires more resources than not using it.

Which part of that is unclear?

ray.wurlod · Post by **ray.wurlod** » Wed Feb 18, 2009 3:36 pm

I never claimed it was slower. I asserted that it needs more resources (mainly memory but also CPU). This is because each sort group must be kept sorted in memory. If you have sufficient spare resources, then overall processing speed may be unaffected. But best practice in all computing is never to do anything unnecessary.