How CRC can be implimented in Datastage to determine two rec

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
olgc
Participant
Posts: 145
Joined: Tue Nov 18, 2003 9:00 am

How CRC can be implimented in Datastage to determine two rec

Post by olgc »

Hi there,

The cyclic redundancy checksum (CRC) can be useed to determine quickly if two complex records from source and target separately are the same or not. If they have the same CRC, then they can be treated as unchanged since last update. If the two CRCs are different, a new record is added to the target. The problem is how CRC is implemented in Datastage. In other language, such as C or C++, only a long pointer pointing the record storage is enuogh. But how can this be implemented in Datastage?

Have a nice day everybody!
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Hello olgc,

to do a CRC check on two records you would need to treat those records as one long string - that is not a big issue and is discussed elsewhere. DataStage has 2 (mundane) checks built into the system that both have their advantages/disadvantages:

a) do a SOUNDEX on a string. This is not a great way to work, as it misses out on lots of changes, but it does work.
b) do a <> string comparison directly. The code that gets executed and is buried underneath the covers does a bit of optimization, as it will stop the byte-by-byte compare once a difference is found.

Statistically you might not gain much execution speed using an algorithm vs. straight string compare. If most of your strings are the same and/or have differences at the far right AND the strings are long (let's use 1000 chars) then perhaps a simple BASIC routine that performs a CRC algorithm might be faster than a string "not equal" compare.

On the whole implementing some routine or even an external C++ routine call will take as long as a simple DataStage compare. With the Parallel Extender you have a CDC stage which elegantly solves this question.

What data volumes (no. of rows) and record lengths are you looking at?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Michael Hester has posted a nice CRC32 routine for DataStage. You can find the thread about it here and it links to where it lives in the Download section on ADN.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

To perform this, you can concatenate the columns you want to locate the change in (delimited by some special character) and apply crc32 function in datastage.

The purpose of this special character delimiter is to differentiate between 2 fields of 'A,B' and 'AB,'.

I think soundex works mostly on the first few characters rather than the whole set.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Sainath,

the soundex removes vowels then assigns numbers based in similar sounds (for English only). It is not really useful in this context, I just threw it in there to show that DS already has something akin to what he might want. I'll download the routine suggested above shortly, though...
ogmios
Participant
Posts: 659
Joined: Tue Mar 11, 2003 3:40 pm

Post by ogmios »

I stopped using CRC32 a couple of years ago after being bitten by the 'case that will never happen in your life time'. I had it 2 times within 6 months that changes were lost due to the fact that the CRC32 of 2 rows were the same while the contents was not :(

Now I just do a complete compare.

Ogmios
In theory there's no difference between theory and practice. In practice there is.
olgc
Participant
Posts: 145
Joined: Tue Nov 18, 2003 9:00 am

Post by olgc »

Thanks everybody, I see the CRC32 function in Datastage.

There is another question: the concatenate operation ':'. When two strings concatenate, the result is null if any one is null. This causes problem. Is any opeartion there in Datastage that returns non null result if any of the two is not null. Until both null then returns null.

If no this kind of operations exist in Datastage, that means for a table, one routine has to be created to concatenate the fields of a record into a string before the CRC32 function can be used.

Have a nice weekend!
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

You can write a routine for this as follows

NVL(InputValue, DefaultValue)
If IsNULL(InputValue) Or Trim(InputValue) ='' Then
Ans = DefaultValue
End
Else
Ans = InputValue
End

You can then use the above method in your transform derivations.
olgc
Participant
Posts: 145
Joined: Tue Nov 18, 2003 9:00 am

Post by olgc »

Thanks, that's a good idea.

I write a routine as the following:

If IsNull(Arg1) Then
Ans = ''
End
Else
Ans = Arg1
End

When it is tested with the Arg1 = @Null, the result is @Null. Is the Datastage Code Test does not identify @NULL?

Thanks,
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

There just so happens to be an entire thread on this very topic posted today! Short answer is you can't, but the linked thread has some workarounds.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply