Hash File Architecture 101: Persistent vs. Rebuild

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I would go for option 2 just to reduce the load on your server. Have option 1 available in the event that something happens to your hash file or to schedule a regular full rebuild, say once a month, just to ensure it is up to date.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I go with #1. Your ETL tool should never be a repository of "production" data. Your migration path is affected by this design choice. Furthermore, your hash file state is guaranteed under #1, but under #2 always have to deal with recovering it in the event a jobstream fails after it's already been modified, but you elect to cancel the run instead of finishing it.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
tomengers
Participant
Posts: 167
Joined: Tue Nov 19, 2002 12:20 pm
Location: Key West

Post by tomengers »

kcbland wrote: ... Your migration path is affected by this design choice.
Thanks for the feedback guys ...

Ken, would appreciate a little more detail on how this approach will affect migration.

... tom
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

If your design choice is to have hash files resident with the project, then you are at the mercy of where your code lies. If you went with a design choice of hash files being explicitly pathed, you have insulated yourself from switching from one project to another.

If you do a release oriented approach to rolling out new versions of your ETL code, projects are a fantastic way to do this. You could have a project named prod_3_0, and prod_3_1 would be the next release. You simply can start executing the new release out of the new project on implementation day. Thus, your code can be laid down ahead of time (the difference between migration and implementation) and activated later. However, if your hash files are tied to the project, your implementation plan has to include converting the hash file over.

Not a big deal, but hash files created under version 5.1 won't work under version 7.0. I believe the underlying technology had some changes going from Universe to DSEngine code. (I think)

Your migration would require you prepare your datasets everytime you migrate. If your jobstream takes care of preparing all lookups fresh with each run, then you have the elegant architecture to deploy your app into any project on any server (dev, uat, prod) with very little preparation. You have no conversion efforts if a hash file undergoes a structural change. It's wiped out and recreated each run.

There's only 2 reasons not to follow this advice: (1) you don't want the added time of going to the DB to get the data (2) you don't want the extra jobs to go to the DB to get the data and then put it into the hash file.

(1) If the target table is large, and you don't want to go fetch the whole table every ETL cycle, there is a technique of collecting natural keys from the source data and putting that into a work table in the target DB. You then do an inner join to this instead of a full table extract. You now pull out a small portion of the target rows (only those necessary). I covered this in detail on previous posts. I'll come back and drop the link.
(2) You'll still need jobs to refresh the hash file on conversions, or cleanup when a jobstream run is aborted and you need to recycle. A different approach is a filesystem level copy of the before hash file manipulation as a restore source.
(3) So many benefits are gained by getting a before-update image of the row and putting into a sequential file. That file can go into your hash file for reference, you have an audit trail, you have a piece of the incremental aggregate update puzzle, and you have a staging file/milestone/restart point.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
tomengers
Participant
Posts: 167
Joined: Tue Nov 19, 2002 12:20 pm
Location: Key West

Post by tomengers »

... very comprehensive ... thanks Ken.

... tom
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply