transformation problem
Moderators: chulett, rschirm, roy
transformation problem
well i am given a job , where i have a source in sequential file and its has lots of duplicate records
please suggest me a transformation so that the unique records must be stored in target and the duplicate records in a reject row file
it would be of gr8 help
please suggest me a transformation so that the unique records must be stored in target and the duplicate records in a reject row file
it would be of gr8 help
Welcome aboard
To remove duplicates, you can just pass it through a hash file with the keys defined. But if you want to record the duplicates you can sort your input by the key, check the current record with previous and if they are different, only then pass it to the target, else send it down another link.
Basically use two stage variables.
In your two output links, the one that is going to the target database, provide the expression var2 = 'PASS' and in the other link where you want to record the duplicates, provide the expression in the constraint as var2 = 'REJ'
That should do the trick.
To remove duplicates, you can just pass it through a hash file with the keys defined. But if you want to record the duplicates you can sort your input by the key, check the current record with previous and if they are different, only then pass it to the target, else send it down another link.
Basically use two stage variables.
Code: Select all
var1 : in.key
var2 : if in.key <> var1 or @INROWNUM = 1 then 'PASS' else 'REJ'
That should do the trick.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It's not all that different from what you posted.
Write to a hashed file keyed to remove duplicates in your data. Create a reference link to that same hashed file so that you are writing to and looking up from the same hashed file. Make sure the reference stage has the 'Preload' option either set to 'Disabled' or 'Enabled, locked for updates' so that writes to the hashed file become immediately available for lookup.
Any hits on the hashed file are duplicates and can be sent down an appropriate 'capture' link. The ending hashed file can be sourced for a duplicate free version of the data.
One nice thing about the approach is that presorting is not necessarily required. You'd only have to do so if you wanted control over the first record written to the hashed file, but if they are truly duplicates it shouldn't matter.
That's what I assume Ray meant, anywho...
Write to a hashed file keyed to remove duplicates in your data. Create a reference link to that same hashed file so that you are writing to and looking up from the same hashed file. Make sure the reference stage has the 'Preload' option either set to 'Disabled' or 'Enabled, locked for updates' so that writes to the hashed file become immediately available for lookup.
Any hits on the hashed file are duplicates and can be sent down an appropriate 'capture' link. The ending hashed file can be sourced for a duplicate free version of the data.
One nice thing about the approach is that presorting is not necessarily required. You'd only have to do so if you wanted control over the first record written to the hashed file, but if they are truly duplicates it shouldn't matter.
That's what I assume Ray meant, anywho...
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Thats really interesting. Never thought of that
I think if a data set contains millions of records, and duplicates are to be removed and captured, the hashed file method would be a much better approach, keeping performance in mind. As Craig mentioned, it also eliminates the need to sort the data.
Very well, Thanks guys.
I think if a data set contains millions of records, and duplicates are to be removed and captured, the hashed file method would be a much better approach, keeping performance in mind. As Craig mentioned, it also eliminates the need to sort the data.
Very well, Thanks guys.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
hi Ray,
any know limitations of this method:
- number of rows,
- number of keys,
etc.?
I'm using this technique frequently to find (logical) duplicates into the database.
any know limitations of this method:
- number of rows,
- number of keys,
etc.?
I'm using this technique frequently to find (logical) duplicates into the database.
ray.wurlod wrote:Another technique is to use a hashed file updated in real time. This technique is taught on IBM's Advanced DataStage (Server) class.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
With 64-bit hashed file it's of the order of the smaller of 2 PB (two million TB) and the largest file your operating system will permit. Best performance will be had by keeping the hashed files as small as possible; for example the key values and row number only. Depends on what you actually need to get. You might also consider pre-sizing your hashed file (use the Hashed File Calculator to get a value for MINIMUM.MODULUS).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.