Which partition i need to use?

css.raghu · Post by **css.raghu** » Thu Dec 23, 2010 4:19 am

I am not getting the target data in sort manner.
I have tried with all types of paritions but no use.

Scenario is very simple as follows.

ROW GENERATOR ----->SORT STAGE------->Data Set.

source has only one column data type is integer.

Source: Row Generator
COLUMN1
0
1
2
3
4
5
6
7
8
9

Target
1
2
3
4
5
0
6
7
8
9

ray.wurlod · Post by **ray.wurlod** » Thu Dec 23, 2010 4:40 am

These are being sorted correctly, on two nodes. You will notice two sorted sub-lists. If you need to sort across the entire data, run the whole thing on one node or in sequential mode.

css.raghu · Post by **css.raghu** » Thu Dec 23, 2010 7:12 am

Yes,
we can achieve it by setting to sequential or running in single node.

My job is two node configuration.

can you please let me know is it possible with any partition settings?

I feel we can,but do not know how. by using Entire partition i am able to get but data is repeating twice.

DSguru2B · Post by **DSguru2B** » Thu Dec 23, 2010 8:09 am

Entire partition and remove duplicate. You will be doing twice the work and then some to negate the double work. Follow Ray's suggestion.

css.raghu · Post by **css.raghu** » Thu Dec 23, 2010 8:36 am

i just want to know is it possible or not, by using the partition settings.
Except Entire.
If not possible, please confirm the same.

chulett · Post by **chulett** » Thu Dec 23, 2010 8:39 am

The first half in one partition, the second in the other, just by 'partition settings'? No.

css.raghu · Post by **css.raghu** » Thu Dec 23, 2010 9:17 am

In that case we are losing the power of partitioning, Right?

jwiles · Post by **jwiles** » Thu Dec 23, 2010 11:34 am

What is your desired outcome?

As Ray has mentioned, the data IS sorted correctly. You are running a two node configuration, therefore your dataset by default contains two partitions. The Row Generator stage is running sequential by default, but the data is partitioned going into the sort stage, most likely Hash on the sort key. The data is then sorted WITHIN the partitions, not across.

When you view a partitioned dataset, you will typically see a block of records from one partition, then a block from another partition, and so on. The view (and peek in a DS job if running parallel) will not mingle the records together (they have no concept of how the data is supposed to be ordered). This is why your output appears as it does...it is showing you the records in one partition then the records in the other partition.

If you strictly are wanting to see the data in order--0 1 2 3 4 5 6 7 8 9--either run the dataset in sequential mode or write to a sequential file, in either case using a sort collection on the input link. If you will be performing other logic behind this, rest assured that the data IS sorted correctly within each partition (you might think of each partition as an independent stream).

If you're concerned with exactly which rows go to which partition, read up on the various partitioning options and how to implement them in the IS manuals. You'll likely find that, for just viewing the data, it's not worth the extra effort that some of the partition types would require.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 23, 2010 3:38 pm

For large volumes of data you can use two adjacent Sort stages, one that sorts in parallel and the other which executes sequentially, using a Sort Merge collector and does not otherwise sort at all.