wrong hash lookup..

scottr · Post by **scottr** » Tue Jan 04, 2005 11:35 am

we have a fact table with 200 mil recs, my previous developer develop a job to load this table by using wrong hash file as lookup(so one column in these 200 mil rows is wrong ).now i have the correct hash file too..but how can i update those 200 mil recs with correct data.

ArndW · Post by **ArndW** » Tue Jan 04, 2005 12:22 pm

Hello ScottR,

it sounds like you need to write a DataStage job to do this, or fix the original job and re-run it. Do you need to fix the Hashed lookup file, or the lookup algorithm, or the target table?

scottr · Post by **scottr** » Tue Jan 04, 2005 12:28 pm

i need to fix the target table.source is flat file and there are several couple of lookup(hash) files..now i correct the job and it's working fine for the past two weeks.but here my concern is how to fix the previous data which is around 189 mil records..

ketfos · Post by **ketfos** » Tue Jan 04, 2005 12:57 pm

Hi,

You need to run the job by selecting all records from the source table till date, unlike your daily update where you limit by a where creteria.
then match them against hash lookup before you populate the target.

This should one time effort.

ketfos

roy · Post by **roy** » Tue Jan 04, 2005 2:18 pm

Hi,
An interesting dilema.
Obviously as already said you need to correct the column.
Now there are several things to be taken in to account,
is your table partitioned?
are there any indexes on it and is any kind of update on it within reason and/or something that can be done?
what DB are you using?

usually fact tables are partitioned, having this in mind the best way that will ensure the shortest down time in availability to users of the fact table is to build new partitions with the correct data, after each partition is built switch the partitions eventually (after testing that the data is correct) deleting the old partitions.

using a bulk load is a good idea to quickly rebuilt the fact table if your daily job is not using bulk load (the DB bulk load utility is prefered over using the DS plugins in general)

if you'll specify your DB, time frames, processing time estimate for building new partitions, estimate update time if it is an option and so on people might be able to give more advice specific to your situation.
(like Sybase IQ is a killer in updating columns with givven key to the table in large volumes)

IHTH,

scottr · Post by **scottr** » Tue Jan 04, 2005 2:55 pm

Hi roy,
my db is oracle9.2,my source is flat file(which comes once in a week) and this job runs once in a week.creating new partitions is not an option.here how to update only one column in target.

thanks

ray.wurlod · Post by **ray.wurlod** » Tue Jan 04, 2005 3:04 pm

Because hashed files operate by destructive overwrite, it is highly probable that writing to a hashed file with incorrect metadata (describing the key columns) has resulted in loss of data.
You have no safe option other than to re-run into a hashed file for which the metadata are correct.

roy · Post by **roy** » Tue Jan 04, 2005 4:51 pm

Hi,
if you can't reproduce the fact from itself and the lookup and insist on updating the table...

you can update only the column by using the key columns for determining the rows for update plus the column you actually want to update.
so lets say you have 1 column as the key column for update and 1 colum as the column with the new and correct data.
by using a user defined sql you can use an update having the key column and update column as it's input (both marked as key columns in the table definition of the update stage) having the following sql: update table fact set data = :2 where key = :1
now you do realize that something like this has the potential of overflowing your rollback segment if performed on all data in your fact table, also even 1 row update might take a long time to complete.

be advised that, IMHO, your better off rebuilding your partitions and performing a switch partition operation.

Good Luck,

kcbland · Post by **kcbland** » Tue Jan 04, 2005 11:22 pm

Fixing your weekly load process and fixing your already loaded data are two different problems. Fixing the weekly process is easy and you probably already know how to do that.

Fixing the existing data is probably going to be fun. In this case you're going to want to build a throw away process. You should fetch the primary key and replacement column value information and load it into a throw away work table. Take advantage of parallel dml and simply update the existing bad value with the correct value in the work table. By putting together a work table with a supporting index, you can write a correlated update statement that will fix your data efficiently and easily.

If I had a 500 million row table with a bad column value, I would fetch the affected keys from the source and good column value and put into a work table. The correlated update statement is easy:

Code: Select all

update bigtable set badcolumn=(select goodcolumn from worktable where bigtable.key = worktable.key)
where key in (select key from worktable)

If you can't handle the volume of updates, then range the query to handle something like a month at a time. You'll know how big a bite you can handle.

kcbland · Post by **kcbland** » Wed Jan 05, 2005 9:48 am

From a private message:

i already fixed the weekly load.

here is the correct scene..
fact table has: 'claim_key' as key column.. and a 'part_key' column with wrong values.
and wrong hash file with 'part_key' and 'part_no' as fields and a correct hash file with 'part_key' and 'part_no' as fields..

after loading the work table with key column and wrong column, how to update the work table using these two hash files..

I do not understand what you mean by two hash files. Please use EXACT names of tables and files and describe the primary key of each so that we can understand the relationships.

If your fact table XXXX has a dimension foreign key "PART_KEY" that is the wrong value, then you need to update the fact table XXXX with the correct foreign key value. Since it looks to be a surrogate key for PART_NO, you have to do a surrogate key substitution when correcting.

In order to do this, you must have the fact table XXXX CLAIM_KEY and the correct PART_NO. You have some design options now.

I would create a work table to hold the CLAIM_KEY, PART_KEY (null), and PART_NO and bulk load the original source data values. Then update it with the PART_KEY value from your dimension based on the PART_NO. I'd unique index the table on CLAIM_KEY for performance. Then, update your fact table using a correlated update statement like:

Code: Select all

UPDATE XXXX SET PART_KEY = (SELECT PART_KEY FROM worktable WHERE worktable.CLAIM_KEY = XXXX.CLAIM_KEY)
WHERE XXXX.CLAIM_KEY IN (SELECT CLAIM_KEY FROM worktable)

This solution can be fastest because of bulkloading source data into the work table, it can use parallel dml to update both the work and fact table.

A second solution is to use DataStage to stream your original data in a job and do a hash lookup from PART_NO to PART_KEY and write out an update file of CLAIM_KEY, PART_KEY. Then, you can update the data using a simple job. Or, bulk load that update data into a work table and then use the above correlated update. Or, using that update file, create an external table definition (you're on 9i) and again use the same correlated update. The problem with that last solution is that you can't put an index on the table, which absolutely will be needed.

scottr · Post by **scottr** » Wed Jan 05, 2005 10:48 am

fact_claim is the fact table.the key column is 'claim_key' and has 'part_key' column(has wrong data).the guy who works before me designed a job by using hash file which has wrong data as lookup.this hash file has 'part_key' and 'part_no' fields but all 'part_key' values are wrong and loaded in to fact table(around 190 mil recs).i also have another hash file which has the same two fields but has correct data.
i designed a job using the part_key from fact table as lookup to the wrong hash file for fetching part_no,then using this part_no as lookup to the correct hash file for fetching part_key , loading in to flat file by model_year wise ,so that i can bulk load .but some how i am not getting the part_no from the first file..

kcbland · Post by **kcbland** » Wed Jan 05, 2005 10:54 am

If your hash lookup is not finding rows, then your primary key metadata in the job definitions either are wrong, or you have either a trimming or text case issue to deal with. Just like SQL, the hash lookup is an EXACT literal match. "XXX " is not equal to "XXX" and "XXX" is not equal to "xxx"