Page 1 of 1

transformation problem

Posted: Fri May 26, 2006 11:17 am
by ganges99
well i am given a job , where i have a source in sequential file and its has lots of duplicate records
please suggest me a transformation so that the unique records must be stored in target and the duplicate records in a reject row file
it would be of gr8 help

Posted: Fri May 26, 2006 11:24 am
by DSguru2B
Welcome aboard :)
To remove duplicates, you can just pass it through a hash file with the keys defined. But if you want to record the duplicates you can sort your input by the key, check the current record with previous and if they are different, only then pass it to the target, else send it down another link.
Basically use two stage variables.

Code: Select all

var1 : in.key
var2 : if in.key <> var1 or @INROWNUM = 1 then 'PASS' else 'REJ'
In your two output links, the one that is going to the target database, provide the expression var2 = 'PASS' and in the other link where you want to record the duplicates, provide the expression in the constraint as var2 = 'REJ'
That should do the trick.

Posted: Fri May 26, 2006 4:57 pm
by ray.wurlod
Another technique is to use a hashed file updated in real time. This technique is taught on IBM's Advanced DataStage (Server) class.

Posted: Sat May 27, 2006 5:22 pm
by DSguru2B
Ray, whenever you get a chance, could you please elaborate on the methodology you outlined. How exactly can you design a job, with the use of hashed file, to capture duplicates?
Thanks Ray

Posted: Sat May 27, 2006 5:56 pm
by chulett
It's not all that different from what you posted.

Write to a hashed file keyed to remove duplicates in your data. Create a reference link to that same hashed file so that you are writing to and looking up from the same hashed file. Make sure the reference stage has the 'Preload' option either set to 'Disabled' or 'Enabled, locked for updates' so that writes to the hashed file become immediately available for lookup.

Any hits on the hashed file are duplicates and can be sent down an appropriate 'capture' link. The ending hashed file can be sourced for a duplicate free version of the data.

One nice thing about the approach is that presorting is not necessarily required. You'd only have to do so if you wanted control over the first record written to the hashed file, but if they are truly duplicates it shouldn't matter.

That's what I assume Ray meant, anywho... :wink:

Posted: Sun May 28, 2006 1:31 am
by ray.wurlod
That's exactly the method I alluded to, and quite clearly described. You should be able to design it from that specification.

Posted: Sun May 28, 2006 2:30 am
by DSguru2B
Thats really interesting. Never thought of that :?
I think if a data set contains millions of records, and duplicates are to be removed and captured, the hashed file method would be a much better approach, keeping performance in mind. As Craig mentioned, it also eliminates the need to sort the data.
Very well, Thanks guys.

Posted: Sun May 28, 2006 3:18 am
by mdan
hi Ray,
any know limitations of this method:
- number of rows,
- number of keys,
etc.?

I'm using this technique frequently to find (logical) duplicates into the database.
ray.wurlod wrote:Another technique is to use a hashed file updated in real time. This technique is taught on IBM's Advanced DataStage (Server) class.

Posted: Sun May 28, 2006 9:32 am
by DSguru2B
Just make sure your data doesnt pass 2.2 GB. If you think it will, make it a 64 BIT file. Thats the only i can think off. I am sure Ray can think of some more.

Posted: Sun May 28, 2006 3:14 pm
by ray.wurlod
With 64-bit hashed file it's of the order of the smaller of 2 PB (two million TB) and the largest file your operating system will permit. Best performance will be had by keeping the hashed files as small as possible; for example the key values and row number only. Depends on what you actually need to get. You might also consider pre-sizing your hashed file (use the Hashed File Calculator to get a value for MINIMUM.MODULUS).