Partitioning for Different stages

pavankvk · Post by **pavankvk** » Tue Feb 19, 2008 8:35 am

Hi,

is there a thumb rule for what stages you need to partition/sort data for proper functioning and for what you dont have to.

To my understanding, some stages like join,merge,aggregator,remove dup etc need the data to be partitioned and sorted for them to produce expected results. Just leaving auto partition on these stages is not going to produce correct results. is this true?

also assuming that you have a 4 node config file and all the resource disks in different nodes point to the same directory, will auto partition work for all the stages,including the stages where it is mandatory to partition and sort data? is it because that different nodes point to the same physical location, records are read such that they will be only in one partition??

sajarman · Post by **sajarman** » Tue Feb 19, 2008 11:38 am

Here goes my five cents:

I think the logic for partitioning data in a stage (or link to be more precise) is required when you have to match data between links (join/merge/lookup etc) or to compare rows within a stage (aggregator, R-Dup etc). This will help improve accuracy and performace, as you can see.

As far as Auto partitioning is concerned, I have also observed incorrect results during my early experiences. But I think that is an old story now. But anyways, I don't leave DataStage to decide things when I can make the decisions. It gives me better control and awareness on whats going on.

pavankvk · Post by **pavankvk** » Tue Feb 19, 2008 11:41 am

sajarman wrote:Here goes my five cents:

I think the logic for partitioning data in a stage (or link to be more precise) is required when you have to match data between links (join/merge/lookup etc) or to compare rows within a stage (aggregator, R-Dup etc). This will help improve accuracy and performace, as you can see.

As far as Auto partitioning is concerned, I have also observed incorrect results during my early experiences. But I think that is an old story now. But anyways, I don't leave DataStage to decide things when I can make the decisions. It gives me better control and awareness on whats going on.

When you say its a old story now, you mean to say it was a BUG which is now fixed?

pavankvk · Post by **pavankvk** » Tue Feb 19, 2008 11:45 am

sajarman wrote:Here goes my five cents:

I think the logic for partitioning data in a stage (or link to be more precise) is required when you have to match data between links (join/merge/lookup etc) or to compare rows within a stage (aggregator, R-Dup etc). This will help improve accuracy and performace, as you can see.

As far as Auto partitioning is concerned, I have also observed incorrect results during my early experiences. But I think that is an old story now. But anyways, I don't leave DataStage to decide things when I can make the decisions. It gives me better control and awareness on whats going on.

When you say its a old story now, you mean to say it was a BUG which is now fixed?

pavankvk · Post by **pavankvk** » Tue Feb 19, 2008 11:52 am

sajarman wrote:Here goes my five cents:

I think the logic for partitioning data in a stage (or link to be more precise) is required when you have to match data between links (join/merge/lookup etc) or to compare rows within a stage (aggregator, R-Dup etc). This will help improve accuracy and performace, as you can see.

As far as Auto partitioning is concerned, I have also observed incorrect results during my early experiences. But I think that is an old story now. But anyways, I don't leave DataStage to decide things when I can make the decisions. It gives me better control and awareness on whats going on.

When you say its a old story now, you mean to say it was a BUG which is now fixed?

sajarman · Post by **sajarman** » Thu Feb 21, 2008 3:51 pm

I do not know if it was a bug or not... I have not experienced that issue some time back when I experimented Auto partitioning. Now I go with Hash partition etc as one of my best practices and not to leave Auto where it matters.