transformation problem

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ganges99
Participant
Posts: 14
Joined: Wed Oct 12, 2005 12:47 pm

transformation problem

Post by ganges99 »

well i am given a job , where i have a source in sequential file and its has lots of duplicate records
please suggest me a transformation so that the unique records must be stored in target and the duplicate records in a reject row file
it would be of gr8 help
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Welcome aboard :)
To remove duplicates, you can just pass it through a hash file with the keys defined. But if you want to record the duplicates you can sort your input by the key, check the current record with previous and if they are different, only then pass it to the target, else send it down another link.
Basically use two stage variables.

Code: Select all

var1 : in.key
var2 : if in.key <> var1 or @INROWNUM = 1 then 'PASS' else 'REJ'
In your two output links, the one that is going to the target database, provide the expression var2 = 'PASS' and in the other link where you want to record the duplicates, provide the expression in the constraint as var2 = 'REJ'
That should do the trick.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Another technique is to use a hashed file updated in real time. This technique is taught on IBM's Advanced DataStage (Server) class.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Ray, whenever you get a chance, could you please elaborate on the methodology you outlined. How exactly can you design a job, with the use of hashed file, to capture duplicates?
Thanks Ray
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

It's not all that different from what you posted.

Write to a hashed file keyed to remove duplicates in your data. Create a reference link to that same hashed file so that you are writing to and looking up from the same hashed file. Make sure the reference stage has the 'Preload' option either set to 'Disabled' or 'Enabled, locked for updates' so that writes to the hashed file become immediately available for lookup.

Any hits on the hashed file are duplicates and can be sent down an appropriate 'capture' link. The ending hashed file can be sourced for a duplicate free version of the data.

One nice thing about the approach is that presorting is not necessarily required. You'd only have to do so if you wanted control over the first record written to the hashed file, but if they are truly duplicates it shouldn't matter.

That's what I assume Ray meant, anywho... :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's exactly the method I alluded to, and quite clearly described. You should be able to design it from that specification.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Thats really interesting. Never thought of that :?
I think if a data set contains millions of records, and duplicates are to be removed and captured, the hashed file method would be a much better approach, keeping performance in mind. As Craig mentioned, it also eliminates the need to sort the data.
Very well, Thanks guys.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
mdan
Charter Member
Charter Member
Posts: 46
Joined: Mon Apr 28, 2003 4:21 am
Location: Brussels
Contact:

Post by mdan »

hi Ray,
any know limitations of this method:
- number of rows,
- number of keys,
etc.?

I'm using this technique frequently to find (logical) duplicates into the database.
ray.wurlod wrote:Another technique is to use a hashed file updated in real time. This technique is taught on IBM's Advanced DataStage (Server) class.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Just make sure your data doesnt pass 2.2 GB. If you think it will, make it a 64 BIT file. Thats the only i can think off. I am sure Ray can think of some more.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

With 64-bit hashed file it's of the order of the smaller of 2 PB (two million TB) and the largest file your operating system will permit. Best performance will be had by keeping the hashed files as small as possible; for example the key values and row number only. Depends on what you actually need to get. You might also consider pre-sizing your hashed file (use the Hashed File Calculator to get a value for MINIMUM.MODULUS).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply