Generic Data Reconciliation

ray.wurlod · Post by **ray.wurlod** » Fri Jan 28, 2011 10:19 pm

I have been given a task with which I'd appreciate some guidance. The job is to move a large number of tables from Oracle and Microsoft SQL Server into another database (to be determined). The requirements are to perform each of the following with a generic job in each case (that is, one job that can handle any table) wrapped in a sequence that traverses a list of table names:
(1) to generate CREATE TABLE statements by reading metadata from source system system tables and to execute those statements
(2) to move all the rows of data from source to target (using Runtime Column Propagation)
(3) to reconcile the data transfer in each of three ways
__(a) verify identical row count in source and target tables
__(b) verify identical checksum total of rows in source and target tables
__(c) compare sums of all numeric columns and sums of lengths of all non-numeric columns
I have been able to do all except (3)(c). It is about this task that I am seeking ideas.

ray.wurlod · Post by **ray.wurlod** » Fri Jan 28, 2011 10:28 pm

I'm rather hoping that someone else has already solved this problem.

Please note that this was presented as a technical challenge, so my normal "resist stupid requirements" advice can not apply. If necessary I could probably argue for two or three jobs for (3)(c) solution, but definitely not for a separate job for each table (the whole point is "generic").

My early thinking is to create generic field names and make use of the fact that the Modify stage can used job parameters for field names in Specification expressions. Since the field names (Field001, Field002, etc.) would correspond to the column order in the CREATE TABLE statement that would suffice for reconciliation - at worst, since that sequential order had been picked up from the source system system tables, it would be possible to map back to the original column names.

jwiles · Post by **jwiles** » Sat Jan 29, 2011 12:10 am

Would a custom operator be out of scope? One could certainly traverse the incoming record schemas to identify data types and sum accordingly.

What options are open to you?

ray.wurlod · Post by **ray.wurlod** » Sat Jan 29, 2011 3:06 pm

All options are open.

I think this can be resolved without custom operator, however. The solution has to be left with, and maintained by, the client, where there are no C++ skills.

Outline of design is a sequence to traverse the list of table names, and a (perhaps multi-instance) subsequence to traverse the list of column names for that table. Lists are obtained from system tables, and column information including data type, precision and scale are obtained with the column names.

The job that performs the comparison will use a parameterized SQL statement of the form:
SELECT SUM(#jpScale# * 100 * #jpColumnName#) FROM #jpTableName#
or
SELECT SUM(CHAR_LENGTH(#jpColumnName#)) FROM #jpTableName#
(the first for numeric data types, the second for non-numeric data types).
I'll have to test how these work with nulls, but that would be a trivial addition to handle nulls.
The two results would then be joined on table name and column name and compared in a Transformer to yield a "Pass/Fail" result, and the names, totals and results written into a table.

That's my initial plan.
Can you see any flaws?

There are about a thousand tables to do, varying between five and about sixty columns. There are no BLOB columns (yet) but, even if there were, I think that the database should still be able to report the length of one of these.

jwiles · Post by **jwiles** » Sat Jan 29, 2011 8:17 pm

Ok, I see where you're headed with that. I hope that's not one job per database|table|column.

We did something similar to that (building a SQL query) using a BASIC routine called in the parameter fields of a Job Activity. You should be able to give a routine the table schema and it could spit out the SQL.

One possibility is a single SQL per table with all the sums into different columns, but then need to pivot that into single rows with table|column|sum as you're thinking.

Or, the SQL could be a number of unions of selects, each of which each produce output for a single table|column combination. This wouldn't need the pivot at least.

Would you want to dump results for each table into text files, then read them as filesets (read pattern as fileset enviro variable) to do the final join/comparison between databases?

Anything seem useful?

ray.wurlod · Post by **ray.wurlod** » Sat Jan 29, 2011 9:56 pm

Thanks for the ideas, and the brainwork behind them.

The point of parameterized SQL is "one job to do any one column in any one table" executed many times to do all columns in all tables.

For reconciliation the same query must be run against source and target tables, and the results compared. To that end, the results table will use a large Decimal data type.

There may be a nice solution along similar lines using a server job. Hmm, will have to investigate that. I wonder whether Connectors used in server jobs will still support the SQL statement in a job parameter?

The argument against a BASIC routine is the same as that against a C++ routine; no in-house skills to maintain it. (Thinks: maybe I can sell them my "Programming with DataStage BASIC" class, also available through DSXchange Learning Center.)

Results will be written into a database table keyed by tablename and columnname, containing the two sums, their difference and a Pass/Fail result. This can be queried filtered/grouped on table name for reporting purposes.

jwiles · Post by **jwiles** » Sun Jan 30, 2011 8:11 am

I understand the constraint on Basic and C++ usage. That's unfortunate but we've both faced it before.

Can't find any logical flaws with the parameterized SQL approach. I would guess you'll be using ODBC connectors for simplicity, and that the comparisons run as part of the processing for a table.

Are there any error handling or auditing requirements at the processing level? Abort the sequence if x happens? % of job failures?

Would it be too much for them to run the comparisons multi-instance? Hmm, could possibly run into contentions on inserts there, tho. Just thinking of ways to reduce the overall time required without going beyond their capability to support.

ppgoml · Post by **ppgoml** » Sun Jan 30, 2011 10:33 am

Ray,

can you also gave me a hint on how you implements (1) and (2) in datastage. There are same requirements from my client either.

Thanks for your sharing.

ray.wurlod · Post by **ray.wurlod** » Sun Jan 30, 2011 3:48 pm

Summary of techniques follows. Obviously the actual queries on the source system tables will vary with database type.

(1) For a given (job parameter) table name read all the column names (excluding synonyms) with corresponding data type, character max length, precision, scale and nullability and, if available, comment. Look up (from manually constructed file) any data types that need transformation, for example Oracle VARCHAR2 becomes VARCHAR in other databases). Construct column definition clause in Transformer stage. Terminate column definition with comma unless on last row, in which case terminate with ");". Detect last row with LastRow() function - new in version 8.5. Use a before-job subroutine to execute echo statements to put the CREATE TABLE clause into the target file; the job appends to this file.

(2) Point a Connector at the source, point a Connector at the target and put a Transformer stage between them. Specify generated SQL "Yes" in both Connectors, enable runtime column propagation and do not load table definitions. You may need to fiddle with array size and transaction size for some ODBC data targets.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 04, 2011 7:57 pm

Just to make life interesting, the customer has stored non-printing characters, including 0x00, in VARCHAR2 fields. Therefore the "purely generic read" had to be ruled out because these 0x00 characters prematurely truncated data.

The replacement approach, which is successful, is to generate SELECT and INSERT statements based on the Oracle system tables, applying RAWTOHEX for all string fields, and applying UNHEX to these when inserting into the target database.

The SQL statements, for CREATE TABLE, SELECT and INSERT were built in a Transformer stage, not in a BASIC routine. Was able to make use of the LastRow() function to avoid appending a comma to the final column in the CREATE TABLE statement.

jwiles · Post by **jwiles** » Fri Mar 04, 2011 8:31 pm

Nice solution, Ray! I like it.

Cheers!

greggknight · Post by **greggknight** » Sat Mar 05, 2011 9:34 am

One approach to think about would be using schema files with the job.
I had to read 100 differant files from a data mirror CDC output as the files arived.
I did not want to have 100 jobs.

I wrote one job which when run used a parameter set to define which schema to use. this way I only needed one job. It is wrapped in a looping sequence and does make calls to routines.
Of course you would need a little more added

To simplify the schema building process you could select your meta data from the system tables and build the schema file on the fly as a job in the seq based on parameters for #SERVER# #DATABASE# #TABLE# ......

Your precounts would be first job in the seq and post counts last job.
Precounts would insert into a balancing table you create and post counts would update the table.
Then the last job outside the looping would select all tables where pre counts <> to post counts and create a sequential file which then could be emailed.
Just some ideas.
I know theres a little more to it.

ray.wurlod · Post by **ray.wurlod** » Sat Mar 05, 2011 1:02 pm

Thanks, Gregg, but there's no Schema File property available with the ODBC Connector stage (nor with any of the other Connector stage types, as far as I am aware). Runtime Column Propagation looked after transmission of the columns for me.

kduke · Post by **kduke** » Sat Mar 05, 2011 1:22 pm

If you understood my completeness tables and jobs then I would just build the above SQL and store them in my completeness tables. The reporting is designed to compare source to target numbers. You have to name each measure. So it could be table_col_length, table_col_sum and table_counts. Normally I always do counts and sums. Sums are usually done on fact tables.

The report comes out

Measure Name | Target Total | Source Total | Percent
Cust_Count | 97 | 100 | 97.00
Cust_Name_Len | 9,000,000 | 10,000,000 | 90.00

The comparisions can be summed at any time. The reports can be done at any time. There is a complete audit trail store with the SQL ran and the number returned. You can build reports by groups if you like so all lengths are on one report and all counts on a different.

kduke · Post by **kduke** » Sat Mar 05, 2011 1:33 pm

I usually build the SQL in Access and push them into the real tables. I can write VB or SQL to help build my SQL statements. So in VB I build out my SQL from

SELECT SUM(CHAR_LENGTH(#jpColumnName#)) FROM #jpTableName#

You could use BASIC just as easy. Loop thru all your table names and columns where the type is char or varchar and populate the ETL_QA_SQL table. When you run the completeness jobs then each SQL statement stored in ETL_QA_SQL gets run and the results are saved in ETL_QA_RESULT. The reports are run based on ETL_QA_RESULT_MAP. All this table stores in the name of the measure for the source side and the name for the target side. So each SQL statement has to have a measure name associated with it.