Best way for comparison of multiple columns

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sharmabhavesh
Premium Member
Premium Member
Posts: 38
Joined: Tue Jun 19, 2012 11:03 pm
Location: India

Best way for comparison of multiple columns

Post by sharmabhavesh »

Hi,
I have a before and after dataset and I want to identify all the updated records in the after dataset. The problem is that I need to compare records based on all the columns (1 key and all other as change values).
I think doing it through a CDC or SCD will take a long time as job would be comparing 25-30 columns if key value matches to see if there are changes.

Can I create a checksum value using Datastage checksum stage for all the value columns. Is there a guarantee that these checksum values will be unique for different set of values?
Or any other better way of doing this?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: Best way for comparison of multiple columns

Post by chulett »

sharmabhavesh wrote:I think doing it through a CDC or SCD will take a long time as job would be comparing 25-30 columns if key value matches to see if there are changes.
My suggestion? Rather than think something may be a problem and avoid that approach - try it. Consider those other options if your concerns become a reality. I would imagine the Change Detection process is just fine for what you need to do.
-craig

"You can never have too many knives" -- Logan Nine Fingers
sharmabhavesh
Premium Member
Premium Member
Posts: 38
Joined: Tue Jun 19, 2012 11:03 pm
Location: India

Post by sharmabhavesh »

Surely, I will try that as soon as I get the DS access. The problem is, right now I am just at the design phase with hardly any access to Datastage but your point taken.
But could you please throw some light on the checksum stage. Is checksum always unique for different set of values. How much are the chances of collision?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You are going to get differing opinions on that and Google will happily enlighten you on "collisions" in MD5 hashes. For what you need, the odds that a hash on the same key for different values would somehow be the same are infinitesimal. And there's no need to worry about them being "unique" in this application. I'm just concerned that the overhead of computing the checksum for each side of every comparison would far outweigh the time the CD stage would take. However, the only way to know for certain would be to try both in your environment with your data.
-craig

"You can never have too many knives" -- Logan Nine Fingers
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

Checksum is going to cost as much or more than just comparing them.

I would start with CDC.

If performance is an issue, 2 things you can try to boost it:
pipe the input into one of the unix shell checksum commands (try SUM).

Or write a C memcmp function to do it.
Post Reply