Page 1 of 1

How CRC can be implimented in Datastage to determine two rec

Posted: Fri Feb 04, 2005 9:01 am
by olgc
Hi there,

The cyclic redundancy checksum (CRC) can be useed to determine quickly if two complex records from source and target separately are the same or not. If they have the same CRC, then they can be treated as unchanged since last update. If the two CRCs are different, a new record is added to the target. The problem is how CRC is implemented in Datastage. In other language, such as C or C++, only a long pointer pointing the record storage is enuogh. But how can this be implemented in Datastage?

Have a nice day everybody!

Posted: Fri Feb 04, 2005 10:10 am
by ArndW
Hello olgc,

to do a CRC check on two records you would need to treat those records as one long string - that is not a big issue and is discussed elsewhere. DataStage has 2 (mundane) checks built into the system that both have their advantages/disadvantages:

a) do a SOUNDEX on a string. This is not a great way to work, as it misses out on lots of changes, but it does work.
b) do a <> string comparison directly. The code that gets executed and is buried underneath the covers does a bit of optimization, as it will stop the byte-by-byte compare once a difference is found.

Statistically you might not gain much execution speed using an algorithm vs. straight string compare. If most of your strings are the same and/or have differences at the far right AND the strings are long (let's use 1000 chars) then perhaps a simple BASIC routine that performs a CRC algorithm might be faster than a string "not equal" compare.

On the whole implementing some routine or even an external C++ routine call will take as long as a simple DataStage compare. With the Parallel Extender you have a CDC stage which elegantly solves this question.

What data volumes (no. of rows) and record lengths are you looking at?

Posted: Fri Feb 04, 2005 10:38 am
by chulett
Michael Hester has posted a nice CRC32 routine for DataStage. You can find the thread about it here and it links to where it lives in the Download section on ADN.

Posted: Fri Feb 04, 2005 10:42 am
by Sainath.Srinivasan
To perform this, you can concatenate the columns you want to locate the change in (delimited by some special character) and apply crc32 function in datastage.

The purpose of this special character delimiter is to differentiate between 2 fields of 'A,B' and 'AB,'.

I think soundex works mostly on the first few characters rather than the whole set.

Posted: Fri Feb 04, 2005 11:07 am
by ArndW
Sainath,

the soundex removes vowels then assigns numbers based in similar sounds (for English only). It is not really useful in this context, I just threw it in there to show that DS already has something akin to what he might want. I'll download the routine suggested above shortly, though...

Posted: Fri Feb 04, 2005 11:14 am
by ogmios
I stopped using CRC32 a couple of years ago after being bitten by the 'case that will never happen in your life time'. I had it 2 times within 6 months that changes were lost due to the fact that the CRC32 of 2 rows were the same while the contents was not :(

Now I just do a complete compare.

Ogmios

Posted: Fri Feb 04, 2005 3:12 pm
by olgc
Thanks everybody, I see the CRC32 function in Datastage.

There is another question: the concatenate operation ':'. When two strings concatenate, the result is null if any one is null. This causes problem. Is any opeartion there in Datastage that returns non null result if any of the two is not null. Until both null then returns null.

If no this kind of operations exist in Datastage, that means for a table, one routine has to be created to concatenate the fields of a record into a string before the CRC32 function can be used.

Have a nice weekend!

Posted: Fri Feb 04, 2005 4:09 pm
by Sainath.Srinivasan
You can write a routine for this as follows

NVL(InputValue, DefaultValue)
If IsNULL(InputValue) Or Trim(InputValue) ='' Then
Ans = DefaultValue
End
Else
Ans = InputValue
End

You can then use the above method in your transform derivations.

Posted: Fri Feb 04, 2005 4:43 pm
by olgc
Thanks, that's a good idea.

I write a routine as the following:

If IsNull(Arg1) Then
Ans = ''
End
Else
Ans = Arg1
End

When it is tested with the Arg1 = @Null, the result is @Null. Is the Datastage Code Test does not identify @NULL?

Thanks,

Posted: Fri Feb 04, 2005 6:06 pm
by chulett
There just so happens to be an entire thread on this very topic posted today! Short answer is you can't, but the linked thread has some workarounds.