Remove duplicate in Flat file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
babbu9
Premium Member
Premium Member
Posts: 75
Joined: Tue Jun 01, 2004 9:44 am

Remove duplicate in Flat file

Post by babbu9 »

Hi

I was wondering whether there is a way to take out duplicate records in a flat file. The logic basically should be to error out all the duplicate records based on a key and pass unique records to oracle.

Logic

Flat file ------>identify duplicates-----> unique records to oracle
|
|
V
hash file with duplicate records


Please inform.

Thanks
ketfos
Participant
Posts: 562
Joined: Mon May 03, 2004 8:58 pm
Location: san francisco
Contact:

Post by ketfos »

Hi,

1. You can load records from flat file to a hash file.
In the hash file, define the key columns.
So when data is loaded to hash file, duplicate rows would be avoided.

2. Or you may load from flat file directly to Oracle table in which key columns have been defined.
The duplicate records will not be loaded due to integriy constraint in the target Oracle table.


Ketfos
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You could pass the data through a QualityStage UNDUP operation.

A hashed file is a good way to remove duplicates based on key value, but is not appropriate otherwise.

If you are contemplating the need to remove duplicate rows, sort the input (use sort as a filter in the Sequential File stage) and use stage variables in a Transformer stage to detect whether you've seen the previous row before.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Or use 'sort' as a filter and leverage the '-u' option to sort out only unique records based on your sort keys. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
babbu9
Premium Member
Premium Member
Posts: 75
Joined: Tue Jun 01, 2004 9:44 am

thank you, how to capture duplicate records

Post by babbu9 »

Ray
Thank you for the reply. I was wondering if there was a way for capturing duplicate records when source and target are flat files.

Thanks
Bob
ray.wurlod wrote:You could pass the data through a QualityStage UNDUP operation.

A hashed file is a good way to remove duplicates based on key value, but is not appropriate otherwise.

If you are contemplating the need to remove duplicate rows, sort the input (use sort as a filter in the Sequential File stage) and use stage variables in a Transformer stage to detect whether you've seen the previous row before.
Hi

I was wondering whether there is a way to take out duplicate records in a flat file. The logic basically should be to error out all the duplicate records based on a key and pass unique records to oracle.

Logic

Flat file ------>identify duplicates-----> unique records to oracle
|
|
V
hash file with duplicate records


Please inform.

Thanks
PilotBaha
Premium Member
Premium Member
Posts: 202
Joined: Mon Jan 12, 2004 8:05 pm

Post by PilotBaha »

Babbu,
the QS will work pretty good for your requirement as Ray mentioned. You can even make this thing a QS only case unless you want to stream the data.
babbu9
Premium Member
Premium Member
Posts: 75
Joined: Tue Jun 01, 2004 9:44 am

no QS

Post by babbu9 »

Hi
Thanks for the reply. The problem is we donot have quality stage. I have to do this just with Datastage 6. Any suggestions in this regard are welcome.

Babu

PilotBaha wrote:Babbu,
the QS will work pretty good for your requirement as Ray mentioned. You can even make this thing a QS only case unless you want to stream the data.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The Transfomer stage with stage variables approach will work irrespective of source and target. The only requirement is that source is sorted.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
babbu9
Premium Member
Premium Member
Posts: 75
Joined: Tue Jun 01, 2004 9:44 am

thanks

Post by babbu9 »

Gentleman
Thank you for the advice. Appreciate it.

Babu
Post Reply