Generic Data Reconciliation

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Generic Data Reconciliation

Post by ray.wurlod »

I have been given a task with which I'd appreciate some guidance. The job is to move a large number of tables from Oracle and Microsoft SQL Server into another database (to be determined). The requirements are to perform each of the following with a generic job in each case (that is, one job that can handle any table) wrapped in a sequence that traverses a list of table names:
(1) to generate CREATE TABLE statements by reading metadata from source system system tables and to execute those statements
(2) to move all the rows of data from source to target (using Runtime Column Propagation)
(3) to reconcile the data transfer in each of three ways
__(a) verify identical row count in source and target tables
__(b) verify identical checksum total of rows in source and target tables
__(c) compare sums of all numeric columns and sums of lengths of all non-numeric columns
I have been able to do all except (3)(c). It is about this task that I am seeking ideas.
Last edited by ray.wurlod on Sun Jan 30, 2011 1:26 pm, edited 1 time in total.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I'm rather hoping that someone else has already solved this problem.

Please note that this was presented as a technical challenge, so my normal "resist stupid requirements" advice can not apply. If necessary I could probably argue for two or three jobs for (3)(c) solution, but definitely not for a separate job for each table (the whole point is "generic").

My early thinking is to create generic field names and make use of the fact that the Modify stage can used job parameters for field names in Specification expressions. Since the field names (Field001, Field002, etc.) would correspond to the column order in the CREATE TABLE statement that would suffice for reconciliation - at worst, since that sequential order had been picked up from the source system system tables, it would be possible to map back to the original column names.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Would a custom operator be out of scope? One could certainly traverse the incoming record schemas to identify data types and sum accordingly.

What options are open to you?
- james wiles


All generalizations are false, including this one - Mark Twain.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

All options are open.

I think this can be resolved without custom operator, however. The solution has to be left with, and maintained by, the client, where there are no C++ skills.

Outline of design is a sequence to traverse the list of table names, and a (perhaps multi-instance) subsequence to traverse the list of column names for that table. Lists are obtained from system tables, and column information including data type, precision and scale are obtained with the column names.

The job that performs the comparison will use a parameterized SQL statement of the form:
SELECT SUM(#jpScale# * 100 * #jpColumnName#) FROM #jpTableName#
or
SELECT SUM(CHAR_LENGTH(#jpColumnName#)) FROM #jpTableName#
(the first for numeric data types, the second for non-numeric data types).
I'll have to test how these work with nulls, but that would be a trivial addition to handle nulls.
The two results would then be joined on table name and column name and compared in a Transformer to yield a "Pass/Fail" result, and the names, totals and results written into a table.

That's my initial plan.
Can you see any flaws?

There are about a thousand tables to do, varying between five and about sixty columns. There are no BLOB columns (yet) but, even if there were, I think that the database should still be able to report the length of one of these.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Ok, I see where you're headed with that. I hope that's not one job per database|table|column.

We did something similar to that (building a SQL query) using a BASIC routine called in the parameter fields of a Job Activity. You should be able to give a routine the table schema and it could spit out the SQL.

One possibility is a single SQL per table with all the sums into different columns, but then need to pivot that into single rows with table|column|sum as you're thinking.

Or, the SQL could be a number of unions of selects, each of which each produce output for a single table|column combination. This wouldn't need the pivot at least.

Would you want to dump results for each table into text files, then read them as filesets (read pattern as fileset enviro variable) to do the final join/comparison between databases?

Anything seem useful?
- james wiles


All generalizations are false, including this one - Mark Twain.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Thanks for the ideas, and the brainwork behind them.

The point of parameterized SQL is "one job to do any one column in any one table" executed many times to do all columns in all tables.

For reconciliation the same query must be run against source and target tables, and the results compared. To that end, the results table will use a large Decimal data type.

There may be a nice solution along similar lines using a server job. Hmm, will have to investigate that. I wonder whether Connectors used in server jobs will still support the SQL statement in a job parameter?

The argument against a BASIC routine is the same as that against a C++ routine; no in-house skills to maintain it. (Thinks: maybe I can sell them my "Programming with DataStage BASIC" class, also available through DSXchange Learning Center.)

Results will be written into a database table keyed by tablename and columnname, containing the two sums, their difference and a Pass/Fail result. This can be queried filtered/grouped on table name for reporting purposes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

I understand the constraint on Basic and C++ usage. That's unfortunate but we've both faced it before.

Can't find any logical flaws with the parameterized SQL approach. I would guess you'll be using ODBC connectors for simplicity, and that the comparisons run as part of the processing for a table.

Are there any error handling or auditing requirements at the processing level? Abort the sequence if x happens? % of job failures?

Would it be too much for them to run the comparisons multi-instance? Hmm, could possibly run into contentions on inserts there, tho. Just thinking of ways to reduce the overall time required without going beyond their capability to support.
- james wiles


All generalizations are false, including this one - Mark Twain.
ppgoml
Participant
Posts: 58
Joined: Mon Aug 20, 2007 11:00 pm

Post by ppgoml »

Ray,

can you also gave me a hint on how you implements (1) and (2) in datastage. There are same requirements from my client either.

Thanks for your sharing.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Summary of techniques follows. Obviously the actual queries on the source system tables will vary with database type.

(1) For a given (job parameter) table name read all the column names (excluding synonyms) with corresponding data type, character max length, precision, scale and nullability and, if available, comment. Look up (from manually constructed file) any data types that need transformation, for example Oracle VARCHAR2 becomes VARCHAR in other databases). Construct column definition clause in Transformer stage. Terminate column definition with comma unless on last row, in which case terminate with ");". Detect last row with LastRow() function - new in version 8.5. Use a before-job subroutine to execute echo statements to put the CREATE TABLE clause into the target file; the job appends to this file.

(2) Point a Connector at the source, point a Connector at the target and put a Transformer stage between them. Specify generated SQL "Yes" in both Connectors, enable runtime column propagation and do not load table definitions. You may need to fiddle with array size and transaction size for some ODBC data targets.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Just to make life interesting, the customer has stored non-printing characters, including 0x00, in VARCHAR2 fields. Therefore the "purely generic read" had to be ruled out because these 0x00 characters prematurely truncated data.

The replacement approach, which is successful, is to generate SELECT and INSERT statements based on the Oracle system tables, applying RAWTOHEX for all string fields, and applying UNHEX to these when inserting into the target database.

The SQL statements, for CREATE TABLE, SELECT and INSERT were built in a Transformer stage, not in a BASIC routine. Was able to make use of the LastRow() function to avoid appending a comma to the final column in the CREATE TABLE statement. 8)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Nice solution, Ray! I like it.

Cheers!
- james wiles


All generalizations are false, including this one - Mark Twain.
greggknight
Premium Member
Premium Member
Posts: 120
Joined: Thu Oct 28, 2004 4:24 pm

Post by greggknight »

One approach to think about would be using schema files with the job.
I had to read 100 differant files from a data mirror CDC output as the files arived.
I did not want to have 100 jobs.

I wrote one job which when run used a parameter set to define which schema to use. this way I only needed one job. It is wrapped in a looping sequence and does make calls to routines.
Of course you would need a little more added

To simplify the schema building process you could select your meta data from the system tables and build the schema file on the fly as a job in the seq based on parameters for #SERVER# #DATABASE# #TABLE# ......

Your precounts would be first job in the seq and post counts last job.
Precounts would insert into a balancing table you create and post counts would update the table.
Then the last job outside the looping would select all tables where pre counts <> to post counts and create a sequential file which then could be emailed.
Just some ideas.
I know theres a little more to it.
"Don't let the bull between you and the fence"

Thanks
Gregg J Knight

"Never Never Never Quit"
Winston Churchill
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Thanks, Gregg, but there's no Schema File property available with the ODBC Connector stage (nor with any of the other Connector stage types, as far as I am aware). Runtime Column Propagation looked after transmission of the columns for me.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

If you understood my completeness tables and jobs then I would just build the above SQL and store them in my completeness tables. The reporting is designed to compare source to target numbers. You have to name each measure. So it could be table_col_length, table_col_sum and table_counts. Normally I always do counts and sums. Sums are usually done on fact tables.

The report comes out

Measure Name | Target Total | Source Total | Percent
Cust_Count | 97 | 100 | 97.00
Cust_Name_Len | 9,000,000 | 10,000,000 | 90.00

The comparisions can be summed at any time. The reports can be done at any time. There is a complete audit trail store with the SQL ran and the number returned. You can build reports by groups if you like so all lengths are on one report and all counts on a different.
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I usually build the SQL in Access and push them into the real tables. I can write VB or SQL to help build my SQL statements. So in VB I build out my SQL from

SELECT SUM(CHAR_LENGTH(#jpColumnName#)) FROM #jpTableName#

You could use BASIC just as easy. Loop thru all your table names and columns where the type is char or varchar and populate the ETL_QA_SQL table. When you run the completeness jobs then each SQL statement stored in ETL_QA_SQL gets run and the results are saved in ETL_QA_RESULT. The reports are run based on ETL_QA_RESULT_MAP. All this table stores in the name of the measure for the source side and the name for the target side. So each SQL statement has to have a measure name associated with it.
Mamu Kim
Post Reply