Sorting and choosing first record

pdntsap · Post by **pdntsap** » Fri Dec 09, 2011 3:37 pm

Hello,

We need to sort a file based on say n number of columns. We then need to select the first record from each group sorted on n-2 columns. So, we initialy sort the file based on 20 columns (using a sort stage) and maybe use another sort stage or remove duplicates stage to select the first record based on groups formed by using 18 column keys. I am looking for any other approaches to our situation as using two sort stages or a sort stage and a remove duplicates stage might not be recommended from a performance viewpoint.

Thanks for the help.

jwiles · Post by **jwiles** » Fri Dec 09, 2011 5:13 pm

Actually, what you describe using Sort and Remove Duplicates is probably the most efficient method to use. However, I would make the following recommendation:

Partition your data on input to sort on at MOST the n-2 columns. This will ensure that your data is grouped together correctly for the Remove Duplicates stage.

Sort followed by Sort is also acceptable provided that you use the Don't Sort, Previously Sorted option on the n-2 columns. Using a transformer is another method, but you would need to specifically code the logic to recognize the change in value of your n-2 columns. This can be tedious work for 18 columns!

Regards,

pdntsap · Post by **pdntsap** » Mon Dec 12, 2011 11:26 am

Thanks James.

We had a Sort Stage and sorted the data based on 20 columns. We then used a second Sort Stage immediately downstream of the first Sort Stage and sorted the data based on the first 18 columns of the previous 20 columns and set the Allow Duplicates option to False. We get the first record from each group (based on the 18 key columns) after the second Sort Stage but the values of the 19th and 20th column suggest that the output is not the first record based on groups formed after the first Sort Stage. We set preserver partitioing = Clear in both the Sort Stages and still faced the same issue.

Also, on the Second Sort stage, when we change the Sort Key Mode to Don't Sort,previously sorted (instead of Sort), we get a fatal error of "Record not sorted at input: record number". But he have already sorted based on 20 columns in the first Sort Stage.

We then replaced the second Sort Stage with a Remove Duplciates Stage, and retained the first record from each group (based on 18 key columns). This caused duplicate records to appear in the output and the record count was greater that what we were supposed to get. (The record count was right when used two Sort Stages)

Please let us know what we might be doing wrong.

Thanks.

jwiles · Post by **jwiles** » Mon Dec 12, 2011 2:51 pm

So, what columns did you partition your data on for the first sort? All 20 or the first 18 or less? It needs to be the first 18 or less (see my earlier post).

Have you disabled sort insertion and partition insertion for the job (APT_NO_SORT_INSERTION=1 and APT_NO_PART_INSERTION=1)?

Set the partitioning option on the second stage (Sort or RemDup) to SAME.

Make sure that the defined sort order (ascending/descending) is the same for each of the common keys between the first sort and the second sort/remdup stage. The order of the keys (a,b,c,d,e,f) should be the same as well up to key 18.

Regards,

pdntsap · Post by **pdntsap** » Mon Dec 12, 2011 4:18 pm

So, what columns did you partition your data on for the first sort ?

Partitioning option on the first sort was set to Auto.

Have you disabled sort insertion and partition insertion for the job (APT_NO_SORT_INSERTION=1 and APT_NO_PART_INSERTION=1)?

No. I will set these parameters.

Set the partitioning option on the second stage (Sort or RemDup) to SAME.

I will do this also. It was set to Auto.

Make sure that the defined sort order (ascending/descending) is the same for each of the common keys between the first sort and the second sort/remdup stage.

Yes the sort order is the same in both the stages.

I changed all the stages to run in sequential which produced the right results. This, of course defeats the purpose of using a parallel job. I need to take care of partitioning and sorting in a parallel job.

Partition your data on input to sort on at MOST the n-2 columns.

To do this, should I choose Hash partition on the 18 or less columns in the Input tab of the first sort stage?

Thanks.

jwiles · Post by **jwiles** » Mon Dec 12, 2011 4:48 pm

To do this, should I choose Hash partition on the 18 or less columns in the Input tab of the first sort stage?

Yes...you are partitioning the data in a parallel processing environment in order to meet the requirements of your business rules: Remove Duplicates while keeping a predetermined preferred record.

I changed all the stages to run in sequential which produced the right results

Naturally! One partition instead of several, so all of your records for a given key combination are automatically in the same partition (the goal of partitioning to meet these business rules). By choosing the proper partitioning strategy in your job, you ensure that this is met when running in parallel.

Auto partitioning works well in many cases, but more complex rules such as this require that you specify the partitioning...the engine can't always read your mind

Regards,

pdntsap · Post by **pdntsap** » Mon Dec 12, 2011 5:13 pm

Thanks James.

I have made the changes suggested and will provide an update.

So, one of the methods to preserve partitioning is to hash partition on the required key columns and then set Partitioning in subsequent stages to same and also set the Preserve Partitioning tab(under Stage tab) to Set. Am I right?

Thanks.

kwwilliams · Post by **kwwilliams** » Mon Dec 12, 2011 10:39 pm

pdntsap wrote:To do this, should I choose Hash partition on the 18 or less columns in the Input tab of the first sort stage?

Hash partitioning is just an attempt to evenly spread data across all partitions AND ensure that processes that need to operate on data have all of the data that they need on the same partition.

You should be able to partition on much fewer than 18 coloumns. Pick a column that the two sorts share in common and that has a lot of values (high cardinality). Choose that one column. You need to sort on all 18, but do not need to partition on 18 columns. The goal in partitioning is to spread the data evenly and partition as few times as possible. When you are choosing all of the columns in a sort stage, you are more likely going to have to repartition at a later point in the job.

jwiles · Post by **jwiles** » Tue Dec 13, 2011 8:38 am

pdntsap: That is one method to preserve the partitioning, yes.

Regarding selecting fewer columns to partition with than you are sorting with: Don't pick just any columns...you need to start with the primary sort column (the first column), in the same manner as you are sorting on 20 but deduping on only the first 18.

Regards,

pdntsap · Post by **pdntsap** » Tue Dec 13, 2011 1:24 pm

Two Sort Stages were implemented. The first sort stage used 20 columns and the input to this Sort Stage was partitioned on the first 15 columns of the 20 columns. The partition was preserved for the second Sort Stage where the first 18 columns (out of 20) were used as the Sort keys. The output seems to be right now. Thanks for the help.

What are the other methods for preserving partitioning across stages?

Now the next step will be performance improvement (use a Remove duplicates stage inplace of the second sort). Also, playing around with the partition keys, I guess.

Thanks.

jwiles · Post by **jwiles** » Tue Dec 13, 2011 8:18 pm

On your second sort, are you using the "Don't Sort, Previously Sorted" option on the 18 keys? If not, then do so as suggested previously to help with performance of using back-to-back Sorts (no need to resort the data as it's already sorted!)

Let's see - the three methods:

1) SAME partitioning
2) Set preserve partitioning (not Propagate)
3) $APT_NO_PART_INSERTION=1

Regards,