Comparing Two Datasets

iamnagus · Post by **iamnagus** » Mon Dec 17, 2007 3:13 am

I have two input files here,
Before_file1 & After_file2(sequential file stages).

Now I want to compare these two files and to get the MODIFIED, NEW-INSERTED & DELETED records.

I know this problem in Parallel Jobs, by using Change Caprute stage.
But I dont know in server addition.

In my current project it is necessary..

Can anyone help me?

Thanks in Advance

ArndW · Post by **ArndW** » Mon Dec 17, 2007 4:36 am

Load one of the files into a hashed file, the use that hashed file as a reference for the other. If the lookup fails it is a "N"ew one, if it succeeds then it is a "M"odified one.
Capturing deleted records this way is a bit more difficult, you could always load both files into hashed files and do lookups, or combine both files, sort and aggregate. A lot depends upon the sizes of your files.

chulett · Post by **chulett** » Mon Dec 17, 2007 7:34 am

For deletes, just reverse the roles and check again. A hashed file miss would be a deleted record then.

jdmiceli · Post by **jdmiceli** » Mon Dec 17, 2007 4:11 pm

Depending on how much data is involved with each row in the sequential files, follow Arnd's suggestion about using the hashed file, but only put the PK fields in there, this will make your hashed file smaller and possibly speed things up as well. Once again, this depends on the amount of data you are dealing with. Just a slight modification to his suggestion.

Hope that helps!

jdmiceli · Post by **jdmiceli** » Mon Dec 17, 2007 4:12 pm

Depending on how much data is involved with each row in the sequential files, follow Arnd's suggestion about using the hashed file, but only put the PK fields in there, this will make your hashed file smaller and possibly speed things up as well. Once again, this depends on the amount of data you are dealing with. Just a slight modification to his suggestion.

Hope that helps!