Data claims to already be sorted

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
patelamit009
Participant
Posts: 20
Joined: Fri Jan 27, 2006 12:17 am

Data claims to already be sorted

Post by patelamit009 »

I have job as below.


DB2------>Aggegator------>Join Stage.......>Dataset

Keys in Aggregator are - Key1, Key2, Key3 and Key4

Key in Join Stage is - Key1

I am using Sort method to aggregate the data and hence Hash/Partition on the key columns is done on the input link of the Aggregator stage.

Now as the key used in join stage is only one (Key1), I am again doing Hash/Sort on Key1 in the input of join stage.

I am getting following warning in the director.

"When checking operator: Data claims to already be sorted on the specified keys the 'sorted' option can be used to confirm this. Data will be resorted as necessary. Performance may improve if this sort is removed from the flow"

When I remove the Hash/Sort in join stage link from aggregator and keep it as Same, its not throwing this warning.

I am confused with the concept of Hash/Sort.

As per my understanding as the keys are different we should againg partition the data on that key.

Please guide if there is any misconception and please elaborate what does the warning mean by 'sorted' option can be used to confirm this

Thanks in Advance
Regards,
Patel
OddJob
Participant
Posts: 163
Joined: Tue Feb 28, 2006 5:00 am
Location: Sheffield, UK

Post by OddJob »

Firstly, I would suggest you read up on sorting and partitioning in the DS user guides as you seem a bit unsure of the requirements to do either.

Briefly, sorting and partitioning are different things and shouldn't be confused.
Partitioning is the process of splitting your data in smaller sets that can then be processed by multiple nodes on your server.
Sorting regards the ordering of the records once they have been partitioned.

In your scenario, you must sort by Key1,Key2,Key3,Key4 to satisfy your aggregator. By virtue of the 'leading edge' of the sort being Key1, this will also satisfy your join.

Your partitioning need only use enough keys to get a good spread across your nodes whilst still satisfying the requirements of the stages in your job. You need to use them in the same order as the sort, but you may only need to hash (or modulus if it's a numeric field) partition on Key1.

Again, partitioning by just Key1 will satisfy both your aggregator and your join. In fact, because the join only uses Key1 you can only partition on Key1 to ensure that all same values of Key1 are in the same partition and hence will be allowed to join correctly.

As a change to your job I would suggest include a sort stage prior to the aggregator - it's better coding practice to use a distinct step rather than the link sorts, and you get better control over the sort if any problems arise later.

In answer to the 'sorted' option bit, I can only think this refers to using a sort stage and specifying 'Don't sort previously sorted' for the keys in question.

Hope this is of help.
Post Reply