Organizing hash files

tonystark622 · Post by **tonystark622** » Fri Aug 22, 2003 10:22 am

I would like to be able to access a hash file with the UV stage, but have the hash file in a directory other than a project directory. I know that when I create a hash file, I can put it in another directory by using the "Use directory path" option. However, if I do this, I cannot access the hash file from a UV stage without adding an entry to the VOC with SETFILE as has been discussed here before. I have added a SETFILE command as an After Stage routine where I create a hashfile, but it causes an error when I rerun the job and the VOC entry already exists. To correct this, I added a DELETE.FILE as a Before Stage to delete the VOC entry before the hash file is created, but this causes an error if the VOC entry doesn't exist.

These problems cause this process to be overly fragile, especially when moving a job from one project to another, or say, from development to production. Is there a recommended way to do these things without causing the job to be so brittle? I'm sure that these behaviors are caused by my own lack of knowledge of the DataStage environment and this is not intended to be a criticism of DataStage. ]

Does anyone have a best practice for this, because I don't like to have these hash files in a Project directory?

Thanks for your time,
Tony Stark

kcbland · Post by **kcbland** » Fri Aug 22, 2003 1:10 pm

Tony, just put a DataStage BATCH job in the execution path prior to this job. Stick the necessary logic to check for the presence of the VOC entry, verify the existence of the data, and remove the file using the appropriate commands. Follow the transformation job with another BATCH job.

Everyone seems to get stuck when they can't fit a series of commands into a before/after job routine and have forgotten that you have a BATCH job available to you with the full force and power of the internal DS BASIC language.

Good luck!

Kenneth Bland

tonystark622 · Post by **tonystark622** » Fri Aug 22, 2003 1:33 pm

Kenneth,

I hadn't considered a batch job. That would take care of it and I could handle all the conditions that make the other approach somewhat brittle.

Am I thinking correctly that having several developers put hash files under the project directory is just asking for trouble (unless someone coordinates things)? Seems like a good thing to do to keep hash files for separate jobs in separate directories. But my need to access them through the Universe stage means I have to have a VOC entry for them.

Thanks for your reply,
Tony

kcbland · Post by **kcbland** » Fri Aug 22, 2003 4:42 pm

I have waxed somewhat-eloquent about the perils of hash files in the project directory on this forum and datastagexchange and Oliver's mailing list. Hash files are fantastic for one-to-one reference lookups, but using them for relational work is sketchy. This is coming from a guy who lived and breathed by the underlying technology for years. There are ways of building nested lookups for things like surrogate key lookups for ranged queries. I have shown these techniques for years, using the power of hash files as they were meant.

If you're doing relational work in hash files, you're limited by the engines relational capabilities. If you are absolutely hell bent on using hash files with UV stages, why can't you blast those rows into a relational table and work with them there? A common mistake people make is that instead of processing the children and looking up the parent, they process the parent and expect a cartesian product from the 1 to many children lookup. If you simply reversed the process, it all works!

I suspect that you are not a Universe or Unidata shop, but are simply trying to find a way to use hash files for reference purposes and have to deal with a ranged lookup (select row where natural_key = ? and ? between begin_effective_dt and end_effective_dt). If this is the case, I have a technique I can send you in a document that quite easily shows you how to build a hash lookup with primary key of natural key, and a sorted multi-valued list of surrogate_keys and begin_effective_dt's. A couple of simply routines can be used to parse the multi-valued list of begin_effective_dt's, finding the position of your fact's business date and return the corresponding surrogate key for that point in time.

When I demonstrated this at a client a few years ago who was using Informix CLI lookups for the between statement, the proof-of-concept job (seq-->xfm w/1 CLI lookup-->seq) ran at 10 rows/second. When I introduced the predecessor lookup build job (1 minute), the proof-of-concept job using the nested hash lookup ran at 4000 rows/s.

Good luck!

Kenneth Bland