Hash File Architecture 101: Persistent vs. Rebuild

vmcburney · Post by **vmcburney** » Mon Jan 12, 2004 4:45 pm

I would go for option 2 just to reduce the load on your server. Have option 1 available in the event that something happens to your hash file or to schedule a regular full rebuild, say once a month, just to ensure it is up to date.

kcbland · Post by **kcbland** » Mon Jan 12, 2004 5:01 pm

I go with #1. Your ETL tool should never be a repository of "production" data. Your migration path is affected by this design choice. Furthermore, your hash file state is guaranteed under #1, but under #2 always have to deal with recovering it in the event a jobstream fails after it's already been modified, but you elect to cancel the run instead of finishing it.

tomengers · Post by **tomengers** » Tue Jan 13, 2004 9:34 am

kcbland wrote: ... Your migration path is affected by this design choice.

Thanks for the feedback guys ...

Ken, would appreciate a little more detail on how this approach will affect migration.

... tom

kcbland · Post by **kcbland** » Tue Jan 13, 2004 9:49 am

If your design choice is to have hash files resident with the project, then you are at the mercy of where your code lies. If you went with a design choice of hash files being explicitly pathed, you have insulated yourself from switching from one project to another.

If you do a release oriented approach to rolling out new versions of your ETL code, projects are a fantastic way to do this. You could have a project named prod_3_0, and prod_3_1 would be the next release. You simply can start executing the new release out of the new project on implementation day. Thus, your code can be laid down ahead of time (the difference between migration and implementation) and activated later. However, if your hash files are tied to the project, your implementation plan has to include converting the hash file over.

Not a big deal, but hash files created under version 5.1 won't work under version 7.0. I believe the underlying technology had some changes going from Universe to DSEngine code. (I think)

Your migration would require you prepare your datasets everytime you migrate. If your jobstream takes care of preparing all lookups fresh with each run, then you have the elegant architecture to deploy your app into any project on any server (dev, uat, prod) with very little preparation. You have no conversion efforts if a hash file undergoes a structural change. It's wiped out and recreated each run.

There's only 2 reasons not to follow this advice: (1) you don't want the added time of going to the DB to get the data (2) you don't want the extra jobs to go to the DB to get the data and then put it into the hash file.

(1) If the target table is large, and you don't want to go fetch the whole table every ETL cycle, there is a technique of collecting natural keys from the source data and putting that into a work table in the target DB. You then do an inner join to this instead of a full table extract. You now pull out a small portion of the target rows (only those necessary). I covered this in detail on previous posts. I'll come back and drop the link.
(2) You'll still need jobs to refresh the hash file on conversions, or cleanup when a jobstream run is aborted and you need to recycle. A different approach is a filesystem level copy of the before hash file manipulation as a restore source.
(3) So many benefits are gained by getting a before-update image of the row and putting into a sequential file. That file can go into your hash file for reference, you have an audit trail, you have a piece of the incremental aggregate update puzzle, and you have a staging file/milestone/restart point.

tomengers · Post by **tomengers** » Tue Jan 13, 2004 11:27 am

... very comprehensive ... thanks Ken.

... tom

kcbland · Post by **kcbland** » Tue Jan 13, 2004 11:37 am

Here's the links:

viewtopic.php?t=85364

viewtopic.php?t=85569

viewtopic.php?t=85822