Best way for comparison of multiple columns

sharmabhavesh · Post by **sharmabhavesh** » Thu May 12, 2016 9:49 am

Hi,
I have a before and after dataset and I want to identify all the updated records in the after dataset. The problem is that I need to compare records based on all the columns (1 key and all other as change values).
I think doing it through a CDC or SCD will take a long time as job would be comparing 25-30 columns if key value matches to see if there are changes.

Can I create a checksum value using Datastage checksum stage for all the value columns. Is there a guarantee that these checksum values will be unique for different set of values?
Or any other better way of doing this?

chulett · Post by **chulett** » Thu May 12, 2016 10:43 am

sharmabhavesh wrote:I think doing it through a CDC or SCD will take a long time as job would be comparing 25-30 columns if key value matches to see if there are changes.

My suggestion? Rather than think something may be a problem and avoid that approach - try it. Consider those other options if your concerns become a reality. I would imagine the Change Detection process is just fine for what you need to do.

sharmabhavesh · Post by **sharmabhavesh** » Thu May 12, 2016 11:01 am

Surely, I will try that as soon as I get the DS access. The problem is, right now I am just at the design phase with hardly any access to Datastage but your point taken.
But could you please throw some light on the checksum stage. Is checksum always unique for different set of values. How much are the chances of collision?

chulett · Post by **chulett** » Thu May 12, 2016 11:17 am

You are going to get differing opinions on that and Google will happily enlighten you on "collisions" in MD5 hashes. For what you need, the odds that a hash on the same key for different values would somehow be the same are infinitesimal. And there's no need to worry about them being "unique" in this application. I'm just concerned that the overhead of computing the checksum for each side of every comparison would far outweigh the time the CD stage would take. However, the only way to know for certain would be to try both in your environment with your data.

UCDI · Post by **UCDI** » Fri May 13, 2016 9:57 am

Checksum is going to cost as much or more than just comparing them.

I would start with CDC.

If performance is an issue, 2 things you can try to boost it:
pipe the input into one of the unix shell checksum commands (try SUM).

Or write a C memcmp function to do it.

DSXchange

Best way for comparison of multiple columns

Best way for comparison of multiple columns

Re: Best way for comparison of multiple columns