Sort Stage clarification

kennyapril · Post by **kennyapril** » Wed Aug 17, 2011 2:35 pm

In an input of 10 fields the ID and updatedate are two fields which have been already sorted and also removed duplicates having same ID and updatedate.
After they are sorted the records are passed to another sort stage and there I sorted all the fields except the date field i.e 9 fields but used option do not sort previously sorted for ID field.
In all the other fields I used sort option with ascending order. What does this do?

The requirement I need is in a consecutive order the records should be compared to the next one and if any change is there in the records then let it be or else ignore or drop it as duplicate.(used allow duplicates false)
It should form a group for all the 9 fields and check one after the other record.

So does the sort order used for all the fields make this or do I need to change something in the logic?

Thanks

jwiles · Post by **jwiles** » Wed Aug 17, 2011 3:19 pm

At a simple level: It will read a group of records by ID value, sort the records belonging to that ID, write out that group then go to the next one. Requires, as you have done, previously sorting the data on ID.

If I understand your second description and question correctly, you're basically needing to deduplicate records based on the on the 9 fields, which sort itself can do as shown with your first sort. Sort will keep the first record of a group of duplicates. If you want to keep the last, you can either sort the columns in descending order OR use the remove duplicates stage instead, select the same nine columns as the keys and select Keep Last.

Regards,

jwiles · Post by **jwiles** » Wed Aug 17, 2011 7:56 pm

The output of the second sort is will be 123,X | 123,Y. All of the 123,X rows will be sorted together and deduped, assuming you are not keeping duplicates. Date is not longer considered because you have not included it as a key. You can use the stable sort option to at least keep the order of like-keyed rows the same, but those rows will be grouped together.

If you want to keep the order you have after the first sort (which includes the date), use a transformer after the first sort to check for the changes in data. You will need to include logic to detect the change in keys (ID).

Regards,

kennyapril · Post by **kennyapril** » Thu Aug 18, 2011 8:21 am

So instead of using second sort stage I need to use transformer to find out the changes in the fields of each key till the second group occurs with different key, something like
svprev--svcurr
svcurr-
svvalue--If svprev=svcurr then 0 else 1
So when value is 1 then I will pass the records or else no passing of records. In the derivation of svcurr how can I give all the fields which are to be compared with next record.
I cannot compare only with ID as ID is same for each person with several changes.

Please suggest any idea about the svcurr derivation

Thanks

jwiles · Post by **jwiles** » Thu Aug 18, 2011 9:02 am

I would suggest using a separate set of variables to determine whether or not there was a change in data. Keep your key changes (ID) separate from your data changes.

svNewID: Your key change flag: 1 if ID has changed, 0 if not

For each of the 9 columns a pair of variables, something like this:

svCol1Change: if svNewID = 1 then 0 else if input.col1 = svPrevCol1 then 0 else 1
svPrevCol1: input.col1

After checking all the columns:

svDataChange: svCol1Change + svCol2Change + ..... + svCol9Change

If the value of svDataChange is non-zero, your data has changed from one record to the next within the ID. For the first record of a new ID, svDataChange will be 0 (i.e. no change).

A constraint for this might be:

svNewID = 1 or svDataChange = 1

Which would keep the first record of an ID and any changed records (changes from one record to the next in this example).

If you want to flag changes against the first record for an ID (any record that doesn't match the original/earliest), then update the svPrevCol# variables only when the key changes. Adjust when you update these variables to meet the particular rules you need to implement.

Regards,

kennyapril · Post by **kennyapril** » Thu Aug 18, 2011 12:20 pm

Thanks Jwiles that worked.

Instead of using different number of stage variable I tried using

Svcurr- col1:col2:col3:col4:col4:col5

i.e concatenating the fields in an order, Is this change OK to use.

Thanks

jwiles · Post by **jwiles** » Thu Aug 18, 2011 2:41 pm

If the results meet your requirements then it is ok to use. The method I suggested is just one of several ways to accomplish it.

Regards,

chulett · Post by **chulett** » Thu Aug 18, 2011 2:59 pm

Hopefully there are no nulls in any of the values you are concatenating together...