Hash file creation & update clarification

PhilHibbs · Post by **PhilHibbs** » Wed Oct 05, 2005 9:22 am

I'd just like to confirm a couple of assumptions about hash files.

If I have a single hash stage with an input and an output and the same hash name, the job first loads the hash file from the input link and then allows the reference link to be invoked.

If I have two hash file stages, with the same hash file name, and one is an output from a transformer and the other is a reference link to the same or a different transformer, the accessing of the reference link will not be delayed and both instances of the hash file will be accessed and updated simultaneously.

I can't find any documentation of this behaviour but is it true to say "If a single hash file stage is used as both an input and an output, then the output(s) will be disabled until the input(s) finish loading"?

Supplemental question: How do you create an empty hash file? My colleague has got round this by dangling an output link to a hash on a previous Transformer with a constraint of 0, but he's complaining that three copies of the hash file stage is excessive, and that it needs a Transformer that may serve no other purpose. Is there a way to omit this extra hash stage in a DataStage job?

ArndW · Post by **ArndW** » Wed Oct 05, 2005 9:30 am

Phil,

the hashed file stage does work as you described. Regarding the creation of an empty hashed file in this job, why don't you just click on the attributes in the hashed file stage to have it create the file for you?

Although a before-job routine or the job control tab could be used to execute the CREATE.FILE command, the better way would be to actually use the method your colleague has already implemented.

kcbland · Post by **kcbland** » Wed Oct 05, 2005 9:39 am

Since the order of processing is basically read a row from the input, do all references, then write all outputs, all references must exist prior to any output.

When a job starts, if the optimizer sees that some active stages (the second set) need to wait on passive references to be populated by other active stages (the first set), those second set active stage targets are not created/opened until the active stage starts. Therefore if a first set active stage references the target of a second set active stage, that target may not yet exist (create table/hashfile won't happen until the input link starts).

So, you dangle an output with an @FALSE constraint upstream of the reference active stage and that stage does the create.

Seems like a mess but that's the way it works. The cleaner approach could be to create a single-purpose job that runs ahead of this one to just create the hash file. But, there's no easier graphical solution. The before/after create hash file command is a worthy option.

Bala R · Post by **Bala R** » Wed Oct 05, 2005 9:45 am

Think you could not have a single hashstage with the same hash file for read and write. It would not compile.
We have some couple of jobs that we have just to create a empty hashfile.

Code: Select all

        
SourceDB --- > xfm ---> HashStage

where the output from the sourcedb is set to return no rows.. like where 1 = 2.

PhilHibbs · Post by **PhilHibbs** » Wed Oct 05, 2005 10:10 am

ArndW wrote:Regarding the creation of an empty hashed file in this job, why don't you just click on the attributes in the hashed file stage to have it create the file for you?

Because the job fails, the hash does not exist when it tries to reference the hash before the first row is output.

Bala R wrote:Think you could not have a single hashstage with the same hash file for read and write. It would not compile.

You can have the same hash stage with the same hash file for read and write. You can't have the links coming from and going to the same data stream, though, as I understand it.

ray.wurlod · Post by **ray.wurlod** » Wed Oct 05, 2005 4:54 pm

If you validate the job this will bring the hashed file into existence. You may need temporarily to change the hashed file name in the stage that reads from it.

Or you can execute the CREATE.FILE or CREATE TABLE or mkdbfile command in a before-job or before-stage subroutine.

PhilHibbs · Post by **PhilHibbs** » Wed May 07, 2008 11:24 am

I have discovered a better way to solve this situation. Add a dummy transformer that has a link going in to the Hashed File stage that is being used as the lookup. Set this dummy transformer up as a single row generator - @OUTROWNUM=1 as the constraint, and a Stage Variable that is not used but only exists so that a null program gets generated. This will cause the hashed file to be created prior to the main flow of the job execution - all in one place, no untidy initialiser job.

Code: Select all

     X---#
         :
 ------->X---------->

I'll probably post an image tomorrow.

chulett · Post by **chulett** » Wed May 07, 2008 11:27 am

We've documented this technique before here. And all you need for the constraint is @FALSE as no rows need to be processed simply to create or clear the existing hashed file, the link just needs to be 'opened'.

PhilHibbs · Post by **PhilHibbs** » Thu May 08, 2008 2:58 am

Ah, of course, we want no rows generated in this context, not one row! Also, I hadn't seen this no-input transform used to create a hashed file in the same job that reads & updates it.

chulett · Post by **chulett** » Thu May 08, 2008 7:16 am

Stop by my house some time.

I think I found the earliest mention that I recall making here when discussing generating test data. It was then extended to solve this situation where you read and update a hashed file in the same job... and yah it's quite cool when the light-bulb goes off in your head and you make it work for the first time.