removing duplicate rows from Flatfile

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ranga1970
Participant
Posts: 141
Joined: Thu Nov 04, 2004 3:29 pm
Location: Hyderabad

removing duplicate rows from Flatfile

Post by ranga1970 »

Folks;
I have


Seq ------xfm ------ora
I am getting duplicate records in sequential file, I have two keys providerkey, addresskey and 8 nonkey fields STREET_LINE_1,STREET_LINE_2, ZIP_CODE, City, state,ADDRESS_TYPE, ZIP_CODE_EXTENSION_2,ZIP_CODE_EXTENSION_1 my addresskey is surrogatekey, providerkey and thease 8 non key come from source
I can have duplicate provider different address then I have to load but some time I am getting a record exactly same by all 8 fields then I have to reject those records.

Hope I am clear
a same provider can have different address, but in source records I am having some time same provider with same address then i have reject it,
so I can not check by providerid only but i have to check by all the keys
is this possible in aggregator by selecting last or first by all fileds???
it is not possible to have lookups, its simple load by eliminating real duplicates.
one option is stage variable, i think that logic is too complicated
if it si possible in agregator it would be simple, but if i select first or last in agregator on key fileds then I am loosing data for same key and different address, which i do not want to.
any suggestions from senors and experts I will apreciate

thanks


[/quote]
RRCHINTALA
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Use a filter on the Sequential File stage. As the filter command use sort -u to remove the duplicates. Specify the sort keys using the appropriate form of command line arguments (for example -k1,2)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
talk2shaanc
Charter Member
Charter Member
Posts: 199
Joined: Tue Jan 18, 2005 2:50 am
Location: India

Post by talk2shaanc »

A question.
Say today you inserted/updated a record with some information. Now if tomorrow again that account comes with same information as yesterday's(duplicate records again), would you be processing it ?( This could be a case when you source system is sending you a full-refresh file).


If your answer is YES, then you should implement CDC(Change Data Capture). Which means that on any day if a account information is changing than only you would process it otherwaise you will ignore/reject the data. it not only removes duplicates but also improves performance by rejecting the duplicates from re-processing.
For this you need to using checksum function of datastage, for comparison of a source record with an existing record for an account.
Usage of Hash file in the design, would improve your performance further.
Shantanu Choudhary
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Technically, in the "If your answer is YES" paragraph, what is being described is CDD - Change Data Detection - not CDC.

CDC is tightly coupled to the source system and is a process that can capture both both and after images of changed records, inserted records and even deleted records - only sending you only those records that have been touched in some way since the last CDC run.

CDD is handled outside the source system where it's your responsibility to check for new and changed records. Unfortunately, you lose the ability to know when a source record has been deleted. Well, not really lose - more like don't have. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ranga1970
Participant
Posts: 141
Joined: Thu Nov 04, 2004 3:29 pm
Location: Hyderabad

Post by ranga1970 »

Talk2shank, Chullet
Mine is dialy load but truncate the dables and reload, there is no question of Changed data, no look up.
you can treat them as one time load

Ray,
I will work on your suggestion come back if any quesions

thanks
RRCHINTALA
Post Reply