Page 1 of 1

Updating records into hashed file

Posted: Mon Sep 25, 2006 4:55 pm
by aditya
All,

I have the following job design:

DB2 Stage ----------->IPC-------->SeqFile

SeqFile---------------->Transfomer-------->HashedFile

The DB2 Stage contains a user defined query of joining two tables. 10 million records are extracted and later hashed.

While tyring to do incremental loads into the hashed file the problems encountered are that all the records have to be continued to be fetched using DB2 stage while doing a lookup against the previous generated Hashed File(say ABC) so that the hashed file(ABC) could be refreshed.

While both tables have an update timestamp field defined this would be a problem since an update on one table does'nt necessiate an update on the other and an insert on one table does'nt mean that a new record is also inserted into the other table.

Any suggestions please.

Thanks
Aditya.

Posted: Mon Sep 25, 2006 5:07 pm
by kris007
From what I understand, you are having difficulty in retrieving the Last Updated timestamp for incremental loads since you are looking at two date fields from two different tables? Am I correct? If that is the case, you might want to calculate the maximum of the two dates from the two tables and then use the greatest date among them for incremental loading. If that is not the case, please explain it again as I can't make out anything more than this. Also what I didn't understand was
While tyring to do incremental loads into the hashed file the problems encountered are that all the records have to be continued to be fetched using DB2 stage while doing a lookup against the previous generated Hashed File(say ABC) so that the hashed file(ABC) could be refreshed.
Why would you refresh ABC hashed file? What exactly do you have in that file?

Posted: Mon Sep 25, 2006 5:21 pm
by aditya
Kris,

Thanks for the reply:

The hashed file ABC which is used for the lookup is also the same hashed file which has to be fed with the incremental loads.

Thanks
Aditya.

Posted: Mon Sep 25, 2006 5:45 pm
by kris007
You didn't answer the first question. Assuming what I have asked is agreed by you, the best way I have found to do an incremental loading is, create a seperate job to get the lastupdated timestamp from your target table or hashedfile and store it in a seperate hashed file. Now, you can pass this date into the User-defined sql in your DB2 stage as a parameter from within a Job Sequence using UtilityHashLookup routine. There are quite a number of posts describing this process and how to use UtilityHashLookup routine. Search should get you them. Now, to get the maximum date from your source data, you need to do what I have explained in the earlier post. I have done this in the past successfully.

IHTH

Posted: Mon Sep 25, 2006 5:51 pm
by aditya
Kris,

The two tables which are being joined have the update timestamp fields.

When I try to bring records which have been either update or inserted in both those tables and then try to do a full outer join on those records some of the records might have empty values for the fields.

-- Aditya.

Posted: Mon Sep 25, 2006 10:33 pm
by loveojha2
aditya wrote: When I try to bring records which have been either update or inserted in both those tables and then try to do a full outer join on those records some of the records might have empty values for the fields.

-- Aditya.
Can you share with us the structure of the Hashed File (especially the key structure, from which db table they are coming).

The problem is very much similar to something popularly known as Insert/Update Anamolies with your Hashed File design.

How do you plan to handle it? Creating separate Hashed File may be :?

We would need more information to help you on this.

Posted: Mon Sep 25, 2006 11:21 pm
by ray.wurlod
Look at increasing the transaction isolation level in DB2 or find some other means for guaranteeing a consistent snapshot. That done, you can maintain the hashed file in "real time" using a pair of hashed file stages plus "disabled, lock for update" as the read cache setting.

The job design in the original post - is this two separate jobs, or one job containing two independent streams? If the latter, how do you synchronize reading from and writing to the text file?

Can you specify (in English, ignoring the fact that DataStage might be involved) what the desired result is to be? That is, can you create some form of target-from-source mapping document, and post the gist of it here?