Hi,
is there a thumb rule for what stages you need to partition/sort data for proper functioning and for what you dont have to.
To my understanding, some stages like join,merge,aggregator,remove dup etc need the data to be partitioned and sorted for them to produce expected results. Just leaving auto partition on these stages is not going to produce correct results. is this true?
also assuming that you have a 4 node config file and all the resource disks in different nodes point to the same directory, will auto partition work for all the stages,including the stages where it is mandatory to partition and sort data? is it because that different nodes point to the same physical location, records are read such that they will be only in one partition??
Partitioning for Different stages
Moderators: chulett, rschirm, roy
Here goes my five cents:
I think the logic for partitioning data in a stage (or link to be more precise) is required when you have to match data between links (join/merge/lookup etc) or to compare rows within a stage (aggregator, R-Dup etc). This will help improve accuracy and performace, as you can see.
As far as Auto partitioning is concerned, I have also observed incorrect results during my early experiences. But I think that is an old story now. But anyways, I don't leave DataStage to decide things when I can make the decisions. It gives me better control and awareness on whats going on.
I think the logic for partitioning data in a stage (or link to be more precise) is required when you have to match data between links (join/merge/lookup etc) or to compare rows within a stage (aggregator, R-Dup etc). This will help improve accuracy and performace, as you can see.
As far as Auto partitioning is concerned, I have also observed incorrect results during my early experiences. But I think that is an old story now. But anyways, I don't leave DataStage to decide things when I can make the decisions. It gives me better control and awareness on whats going on.
When you say its a old story now, you mean to say it was a BUG which is now fixed?sajarman wrote:Here goes my five cents:
I think the logic for partitioning data in a stage (or link to be more precise) is required when you have to match data between links (join/merge/lookup etc) or to compare rows within a stage (aggregator, R-Dup etc). This will help improve accuracy and performace, as you can see.
As far as Auto partitioning is concerned, I have also observed incorrect results during my early experiences. But I think that is an old story now. But anyways, I don't leave DataStage to decide things when I can make the decisions. It gives me better control and awareness on whats going on.
When you say its a old story now, you mean to say it was a BUG which is now fixed?sajarman wrote:Here goes my five cents:
I think the logic for partitioning data in a stage (or link to be more precise) is required when you have to match data between links (join/merge/lookup etc) or to compare rows within a stage (aggregator, R-Dup etc). This will help improve accuracy and performace, as you can see.
As far as Auto partitioning is concerned, I have also observed incorrect results during my early experiences. But I think that is an old story now. But anyways, I don't leave DataStage to decide things when I can make the decisions. It gives me better control and awareness on whats going on.
When you say its a old story now, you mean to say it was a BUG which is now fixed?sajarman wrote:Here goes my five cents:
I think the logic for partitioning data in a stage (or link to be more precise) is required when you have to match data between links (join/merge/lookup etc) or to compare rows within a stage (aggregator, R-Dup etc). This will help improve accuracy and performace, as you can see.
As far as Auto partitioning is concerned, I have also observed incorrect results during my early experiences. But I think that is an old story now. But anyways, I don't leave DataStage to decide things when I can make the decisions. It gives me better control and awareness on whats going on.