Page 1 of 1

removing duplicate rows from Flatfile

Posted: Wed May 11, 2005 3:34 pm
by ranga1970
Folks;
I have


Seq ------xfm ------ora
I am getting duplicate records in sequential file, I have two keys providerkey, addresskey and 8 nonkey fields STREET_LINE_1,STREET_LINE_2, ZIP_CODE, City, state,ADDRESS_TYPE, ZIP_CODE_EXTENSION_2,ZIP_CODE_EXTENSION_1 my addresskey is surrogatekey, providerkey and thease 8 non key come from source
I can have duplicate provider different address then I have to load but some time I am getting a record exactly same by all 8 fields then I have to reject those records.

Hope I am clear
a same provider can have different address, but in source records I am having some time same provider with same address then i have reject it,
so I can not check by providerid only but i have to check by all the keys
is this possible in aggregator by selecting last or first by all fileds???
it is not possible to have lookups, its simple load by eliminating real duplicates.
one option is stage variable, i think that logic is too complicated
if it si possible in agregator it would be simple, but if i select first or last in agregator on key fileds then I am loosing data for same key and different address, which i do not want to.
any suggestions from senors and experts I will apreciate

thanks


[/quote]

Posted: Wed May 11, 2005 5:41 pm
by ray.wurlod
Use a filter on the Sequential File stage. As the filter command use sort -u to remove the duplicates. Specify the sort keys using the appropriate form of command line arguments (for example -k1,2)

Posted: Thu May 12, 2005 12:34 am
by talk2shaanc
A question.
Say today you inserted/updated a record with some information. Now if tomorrow again that account comes with same information as yesterday's(duplicate records again), would you be processing it ?( This could be a case when you source system is sending you a full-refresh file).


If your answer is YES, then you should implement CDC(Change Data Capture). Which means that on any day if a account information is changing than only you would process it otherwaise you will ignore/reject the data. it not only removes duplicates but also improves performance by rejecting the duplicates from re-processing.
For this you need to using checksum function of datastage, for comparison of a source record with an existing record for an account.
Usage of Hash file in the design, would improve your performance further.

Posted: Thu May 12, 2005 6:36 am
by chulett
Technically, in the "If your answer is YES" paragraph, what is being described is CDD - Change Data Detection - not CDC.

CDC is tightly coupled to the source system and is a process that can capture both both and after images of changed records, inserted records and even deleted records - only sending you only those records that have been touched in some way since the last CDC run.

CDD is handled outside the source system where it's your responsibility to check for new and changed records. Unfortunately, you lose the ability to know when a source record has been deleted. Well, not really lose - more like don't have. :wink:

Posted: Thu May 12, 2005 11:03 am
by ranga1970
Talk2shank, Chullet
Mine is dialy load but truncate the dables and reload, there is no question of Changed data, no look up.
you can treat them as one time load

Ray,
I will work on your suggestion come back if any quesions

thanks