Storing CRC values

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

baabi_26
Participant
Posts: 14
Joined: Mon Jan 24, 2005 5:31 pm

Post by baabi_26 »

Thanks a lot.

http://img240.echo.cx/img240/7100/crc7ih.jpg

Please click on the link above for my job design. It works this way..

I'll assume file has col1, col2 and col1 is the key column.

1. I'll feed Afterset through a TRX and add a column to the existing columns and the value of the column is "1" for all the rows.

2. I'll do the same for BeforeSet, but the column value would be "2".

3. From TRXAfter, I'll store all the key col values in the HshAfter file.

4. In TRXBefore, I'll store keys in Beforeset that are there in Afterset using the HshAfter file. The output of this is HshCommon file.

5. I'll merge rows from TRXAfter and TRXBefore into a file.

6. I'll feed the rows from merge file into aggregator where I'll group by col1, col2 and will select min(col3) (we'll name it mincol3), max(col3) (maxcol3). (col3 is from TRXafter and TRXbefore).

7. In the Transformer i'll do a lookup on HshCommon file created before and Identify the changed records by checking the following constraints..
(Remember HshCommon has keys that are common in both files)
a. Lookup not found and mincol3=1 (inserts)
b. Lookup found and maxcol3=1 (updates)
c. Lookup not found and mincol3=2 (Deletes)
d. Lookup found and mincol3<>maxcol3 (Unchanged)

This will give us four files with Inserts, updates, deletes and unchanged. This becomes kind of tedious when you have lot of cols in your source files as aggregator takes that long to give the output. I could have used rowcollector instead of merge, but then there will be a Cyclic redundancy. I hope you guys understood what i explained above.

I'll explain with Data..

Assume my Afterset has

col1 col2
--------------
10 XX
20 YY
30 ZZ
40 ZX
60 AA

My BeforeSet is

Col1 col2
----------------
10 XX
20 YY
30 ZZ
40 ZY
50 AB

If you look in Afterset, row 40 is chnaged from ZY to ZX, row 60 is a insert and row 50 from Beforeset is not there in Afterset which means its a delete.

Data in TRXAfter

col1 col2 col3
---------------------
10 XX 1
20 YY 1
30 ZZ 1
40 ZX 1
60 AA 1

Data in TRXBefore

col1 col2 col3
---------------------
10 XX 2
20 YY 2
30 ZZ 2
40 ZY 2
50 AB 2

Data in HshAfter

col1
----
10
20
30
40
60

Data in HshCommon

Col1
-----
10
20
30
40

Data in Merge

col1 col2 col3
---------------------
10 XX 1
20 YY 1
30 ZZ 1
40 ZX 1
60 AA 1
10 XX 2
20 YY 2
30 ZZ 2
40 ZY 2
50 AB 2

Data After Aggregator

col1 col2 mincol3 maxcol3
-------------------------------------
10 XX 1 2
20 YY 1 2
30 ZZ 1 2
40 ZX 1 1
60 AA 1 1
40 ZY 2 2
50 AB 2 2

In TRXIdentify...when we look at the constraints

Inserts would be row 60. Delete is Row 50 , update is row 40. Remaining all are unchanged..

What do you think of this..? I know it is not customizable and all.. but I just showed an approach..
clshore
Charter Member
Charter Member
Posts: 115
Joined: Tue Oct 21, 2003 11:45 am

Post by clshore »

Hehe,
Mike and I have conversed at length on this subject.
But in general, for records >> 4 bytes in size, the chances of a CRC collision between arbitrary records are greater than 1 in 4,294,967,296.
And the larger your record set, the greater the chance of a collision.

The math *is* complex, but interesting. Google 'Birthday Paradox', or look here:

http://mathworld.wolfram.com/BirthdayProblem.html

It's not that the CRC is 'wrong', it's that two different records will yield the same computed CRC value.

If a missed update is no biggie, then don't sweat it.

Carter
mhester wrote:What Alan and Shawn recommend is {probably} the best practice regarding the usage of CRC32 and SCD processing. You need not be concerned about the liklihood of "collisions" for the reasons that Alan pointed out plus some fairly complex math. The liklihood of a failure to generate the correct CRC for a given row thereby incorrectly updating or not updating a particular row is 1 in 4,294,967,296. This does not mean that if you process 4294967296 rows of data in a given run that a failure will happen (and in all liklihood you will not see a failure in a recordset of this size).
Post Reply