Binary_Checksum in SQL Server 2000 for data change detection

jdmiceli · Post by **jdmiceli** » Mon Jul 10, 2006 4:23 pm

Hello to all you DataStage Uber-guru's!

I am still in self-paced training (which essentially means I'm reading the documentation for DataStage and postings on this wonderful website

) and I have not been using the tool for very long. I appreciate all the input I have received over the course of the last couple of months and I'm hoping someone can help me now. It is also possible this posting might help other SQL Server 2000 users.

I have been trying to get a handle on how to make data change detection work within the system framework I have, which is a very basic un-transformed pull of data from one database on a production server to a copy on a data repository server that has a different collation. I understand the theory of dcd, but I haven't yet figured out the implementation of it in DataStage.

I was researching some stuff for a stored procedure I needed to write for another task at work and stumbled across a reference to the Binary_Checksum() function in SQL Server 2000. In my initial tests I have found it to be incredibly fast at the row level, but I'm wondering if it truly will be efficient as it runs against tables with millions of rows in them. Depending on which company's databases I'm processing (I have 16 companies to monitor and process, all with the same schema), I can have several million rows in each table, always growing.

Does anyone have any input on whether this is do-able and/or what the best way to do it would be?

I believe I am following Ken Bland's general practices in creating the process:
1. Extract source data to sequential file
2. Extract natural keys from target into hashed file
3. Process sequential file and break rows out to insert and update files based on key comparisons.
4. Process insert file using bulk load
5. Process update file using OLE or ODBC or whatever get's me the best performance.

My initial thoughts are to add one field to my steps 1 and 2 that would hold the binary_checksum value for each row, but I see issues with all the rows that shouldn't be processed each because there were no changes.

Maybe it would be better to process the hashed file (see I have been paying attention) first, then pull the source data using the hashed file as a limiting factor on what gets pulled. Does that make sense?

I would appreciate your input on this

.