Hi,
I have a before and after dataset and I want to identify all the updated records in the after dataset. The problem is that I need to compare records based on all the columns (1 key and all other as change values).
I think doing it through a CDC or SCD will take a long time as job would be comparing 25-30 columns if key value matches to see if there are changes.
Can I create a checksum value using Datastage checksum stage for all the value columns. Is there a guarantee that these checksum values will be unique for different set of values?
Or any other better way of doing this?
Best way for comparison of multiple columns
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 38
- Joined: Tue Jun 19, 2012 11:03 pm
- Location: India
Re: Best way for comparison of multiple columns
My suggestion? Rather than think something may be a problem and avoid that approach - try it. Consider those other options if your concerns become a reality. I would imagine the Change Detection process is just fine for what you need to do.sharmabhavesh wrote:I think doing it through a CDC or SCD will take a long time as job would be comparing 25-30 columns if key value matches to see if there are changes.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 38
- Joined: Tue Jun 19, 2012 11:03 pm
- Location: India
Surely, I will try that as soon as I get the DS access. The problem is, right now I am just at the design phase with hardly any access to Datastage but your point taken.
But could you please throw some light on the checksum stage. Is checksum always unique for different set of values. How much are the chances of collision?
But could you please throw some light on the checksum stage. Is checksum always unique for different set of values. How much are the chances of collision?
You are going to get differing opinions on that and Google will happily enlighten you on "collisions" in MD5 hashes. For what you need, the odds that a hash on the same key for different values would somehow be the same are infinitesimal. And there's no need to worry about them being "unique" in this application. I'm just concerned that the overhead of computing the checksum for each side of every comparison would far outweigh the time the CD stage would take. However, the only way to know for certain would be to try both in your environment with your data.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers