How CRC can be implimented in Datastage to determine two rec
Moderators: chulett, rschirm, roy
How CRC can be implimented in Datastage to determine two rec
Hi there,
The cyclic redundancy checksum (CRC) can be useed to determine quickly if two complex records from source and target separately are the same or not. If they have the same CRC, then they can be treated as unchanged since last update. If the two CRCs are different, a new record is added to the target. The problem is how CRC is implemented in Datastage. In other language, such as C or C++, only a long pointer pointing the record storage is enuogh. But how can this be implemented in Datastage?
Have a nice day everybody!
The cyclic redundancy checksum (CRC) can be useed to determine quickly if two complex records from source and target separately are the same or not. If they have the same CRC, then they can be treated as unchanged since last update. If the two CRCs are different, a new record is added to the target. The problem is how CRC is implemented in Datastage. In other language, such as C or C++, only a long pointer pointing the record storage is enuogh. But how can this be implemented in Datastage?
Have a nice day everybody!
Hello olgc,
to do a CRC check on two records you would need to treat those records as one long string - that is not a big issue and is discussed elsewhere. DataStage has 2 (mundane) checks built into the system that both have their advantages/disadvantages:
a) do a SOUNDEX on a string. This is not a great way to work, as it misses out on lots of changes, but it does work.
b) do a <> string comparison directly. The code that gets executed and is buried underneath the covers does a bit of optimization, as it will stop the byte-by-byte compare once a difference is found.
Statistically you might not gain much execution speed using an algorithm vs. straight string compare. If most of your strings are the same and/or have differences at the far right AND the strings are long (let's use 1000 chars) then perhaps a simple BASIC routine that performs a CRC algorithm might be faster than a string "not equal" compare.
On the whole implementing some routine or even an external C++ routine call will take as long as a simple DataStage compare. With the Parallel Extender you have a CDC stage which elegantly solves this question.
What data volumes (no. of rows) and record lengths are you looking at?
to do a CRC check on two records you would need to treat those records as one long string - that is not a big issue and is discussed elsewhere. DataStage has 2 (mundane) checks built into the system that both have their advantages/disadvantages:
a) do a SOUNDEX on a string. This is not a great way to work, as it misses out on lots of changes, but it does work.
b) do a <> string comparison directly. The code that gets executed and is buried underneath the covers does a bit of optimization, as it will stop the byte-by-byte compare once a difference is found.
Statistically you might not gain much execution speed using an algorithm vs. straight string compare. If most of your strings are the same and/or have differences at the far right AND the strings are long (let's use 1000 chars) then perhaps a simple BASIC routine that performs a CRC algorithm might be faster than a string "not equal" compare.
On the whole implementing some routine or even an external C++ routine call will take as long as a simple DataStage compare. With the Parallel Extender you have a CDC stage which elegantly solves this question.
What data volumes (no. of rows) and record lengths are you looking at?
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Michael Hester has posted a nice CRC32 routine for DataStage. You can find the thread about it here and it links to where it lives in the Download section on ADN.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
To perform this, you can concatenate the columns you want to locate the change in (delimited by some special character) and apply crc32 function in datastage.
The purpose of this special character delimiter is to differentiate between 2 fields of 'A,B' and 'AB,'.
I think soundex works mostly on the first few characters rather than the whole set.
The purpose of this special character delimiter is to differentiate between 2 fields of 'A,B' and 'AB,'.
I think soundex works mostly on the first few characters rather than the whole set.
Sainath,
the soundex removes vowels then assigns numbers based in similar sounds (for English only). It is not really useful in this context, I just threw it in there to show that DS already has something akin to what he might want. I'll download the routine suggested above shortly, though...
the soundex removes vowels then assigns numbers based in similar sounds (for English only). It is not really useful in this context, I just threw it in there to show that DS already has something akin to what he might want. I'll download the routine suggested above shortly, though...
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I stopped using CRC32 a couple of years ago after being bitten by the 'case that will never happen in your life time'. I had it 2 times within 6 months that changes were lost due to the fact that the CRC32 of 2 rows were the same while the contents was not
Now I just do a complete compare.
Ogmios
Now I just do a complete compare.
Ogmios
In theory there's no difference between theory and practice. In practice there is.
Thanks everybody, I see the CRC32 function in Datastage.
There is another question: the concatenate operation ':'. When two strings concatenate, the result is null if any one is null. This causes problem. Is any opeartion there in Datastage that returns non null result if any of the two is not null. Until both null then returns null.
If no this kind of operations exist in Datastage, that means for a table, one routine has to be created to concatenate the fields of a record into a string before the CRC32 function can be used.
Have a nice weekend!
There is another question: the concatenate operation ':'. When two strings concatenate, the result is null if any one is null. This causes problem. Is any opeartion there in Datastage that returns non null result if any of the two is not null. Until both null then returns null.
If no this kind of operations exist in Datastage, that means for a table, one routine has to be created to concatenate the fields of a record into a string before the CRC32 function can be used.
Have a nice weekend!
-
- Participant
- Posts: 3337
- Joined: Mon Jan 17, 2005 4:49 am
- Location: United Kingdom
There just so happens to be an entire thread on this very topic posted today! Short answer is you can't, but the linked thread has some workarounds.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers