Need to maintain sort order

pdntsap · Post by **pdntsap** » Wed Apr 25, 2012 9:29 am

Hello,

We have a requirement where we need to find the first record and last record in a group based on a certain number of key columns. We have a column generator stage to generate a column with a unique value for each row. We feed the data into two sorters in parallel. One of the sorters sorts the data in ascending order based on the key columns and the other sorter sorts the data in descending order based on the key columns. We then feed each sort output to a transformer stage where we set a flag to indicate the first and last record in each group.We use the common column to join the data out of the two transformers. We then have a transformer stage to filter some records from each group based on business requirements. My question is: how do I maintain the sort order and partition after the sorter stages? My results are right when I make the the two Transformer , Joiner and Transforme stages as sequential. But, when the two Transformer , Joiner and Transformer stages are made parallel, the results are not right. Any help is greatly appreciated.

Thanks.

jwiles · Post by **jwiles** » Wed Apr 25, 2012 12:57 pm

(sort and join stages, not sorter and joiner stages...different product)

How have you partitioned the data? The fact that it is not working correctly when you run in more than one partition indicates that you are not partitioning correctly. Are you joining on ONLY the "common column", or the key columns AND the common column?

Your job sounds overly complex. You should be able to identify first and last records in a group without resorting to a "fork-join" design as you have. If you are using IS 8.5 or 8.7, it should be very simple to do within a single sort and transformer by using LastRowInGroup() to identify the last records.

Regards,

pdntsap · Post by **pdntsap** » Wed Apr 25, 2012 1:38 pm

jwiles wrote:(sort and join stages, not sorter and joiner stages...different product)

Yes you are right.

How have you partitioned the data?

All the stages have partition set to Auto and I am not explicitly setting partition in any stage.

Are you joining on ONLY the "common column", or the key columns AND the common column?

Joining only on the common column. The two sort stages are sorting bassed on about 70 key columns.

Your job sounds overly complex.

I believe my requirements are quite complex. We have about 80 columns and we need to group the records based on different number of columns and identify the first and last record in each group. So, the two sorts sort on 70 columns in ascending and descending order and the transformer after the sort identifies the first and last record in each group. We have groups based on the first 5 columns, then the first 10 columns and so on and the last group is based on the first 70 columns.

Thanks.

npsandeep · Post by **npsandeep** » Wed Apr 25, 2012 1:57 pm

Dont sort in parallel or Keep everything in sequential if the # of records is less than 1m. Just designate ids for all the records and get the first and last records as per the requirements. you can use src stage, transformer, agg (find min & max), move those 2 records to the tgt stg.

I hope it helps.

jwiles · Post by **jwiles** » Wed Apr 25, 2012 2:31 pm

the two sorts sort on 70 columns

As you have partitioning set to Auto, it is likely partitioning on all 70 columns. If I understand your description correctly (grouping on first 5, first 10, and so on to first 70), auto partitioning will not work for your logic in the transformers, which is why your results are incorrect when you run the job in parallel.

If all of your groups have the first 5 key columns in common (always the same first 5), set your partitioning to Hash (on the two sorts) and select only those 5 columns as your partition keys. Set partitioning to Same on the two transformers. You should be ok to leave partitioning set to Auto elsewhere. Try running the job in parallel after this.

Worst case will be to run the job in a single partition, which you already know is working.

Regards,

pdntsap · Post by **pdntsap** » Wed Apr 25, 2012 3:25 pm

If all of your groups have the first 5 key columns in common (always the same first 5), set your partitioning to Hash (on the two sorts) and select only those 5 columns as your partition keys. Set partitioning to Same on the two transformers. You should be ok to leave partitioning set to Auto elsewhere. Try running the job in parallel after this.

I will try this method and thanks for the suggestion.

Worst case will be to run the job in a single partition, which you already know is working.

Dont sort in parallel or Keep everything in sequential if the # of records is less than 1m

We have to deal with millions of records and I hope I can make the solution suggested above work and run the whole job in parallel.