Hi
Requirement:I have a requirement to get the latest record based on timestamp field.
Scenario 1: So i have used remove duplicate stage for this.
I have done the hash partitioning on the key column and timestamp field and selected "perform sort" option in the input link of this stage.
and in the remove duplicate stage properties key is the key column[excluding timestamp field]. But the ouput is having duplicate rows for the key.
When i checked i came to know that the records are not in the same partition. So hash partition is not working correctly. Any
thing wrong in the logic?
Scenario 2: If i use sort stage where in the input link i did hashing on key column , and in the stage properties i sorted on key column and timestamp field
and then i used remove duplicate stage on key column the output is without duplicate records.It is working correctly but
Could anyone please tell me why in scenario one the hash partitining is not working? I thought it is due to duplicates but even
hash partioning problem in remove duplicate stage
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 9
- Joined: Sun Oct 19, 2008 7:09 am
I believe that in scenario 1 it is working exactly as you have coded it. You are partitioning on key and timestamp which will not produce the same data distribution as simply partitioning on the key as in your second scenario.
Mike Hester
mhester@petra-ps.com
mhester@petra-ps.com
-
- Participant
- Posts: 342
- Joined: Tue Nov 04, 2008 10:38 am
- Location: Chennai, India