Clearing a hash file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Clearing a hash file

Post by tonystark622 »

Good day, everyone.

I just started on a project where the client is using DataStage 6.x. I previously worked with DataStage 4.2.1, so some things have changed.

The folks at this client have just taken a DataStage class a couple of weeks ago and are still learning.

Anyway, to my question. In DataStage 4.x when we had a job where we would write to the same hash file from multiple stages, we would make sure that the hash file got cleared at the beginning of the job before any other stage was allowed to write to it... One of the folks here has told me that he thinks that this may have changed since DataStage 4 because he tried clicking the "Clear file before writing" on the two different places where the hash file was used and insists that he didn't have any problems with losing data.

Has this changed? Or is it still possible to lose data in the hash file because it's cleared twice? I still think that it is possible to lose data doing it this way. So, who's right?

I apologize if I haven't stated my question clearly. Please ask if I need to clarify something.

Thanks,
Tony
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

First off, I really don't believe that the behaviour of clearing hash files has changed. Be happy to be proven wrong, [:I] but I haven't seen anything that would lead me to that conclusion.

In my experience, there is no need to check the 'Clear' box on more than one instance of a hash when multiple stages are writing to it. DataStage is smart enough to take a look at the job and clear any hash files that need it *before* any data is processed, not when the first row hits the affected stage. The same applies to Oracle and tables that need to be truncated.

It doesn't hurt, but there's no need for it. The only reason I could think of where you might want to do it is so that, when someone else looks at the job, they know is it being cleared no matter which instance of the hash stage they look at. FWIW, I've found that text boxes (which we color code green, yellow and red based on content) are great for conveying that kind of information.

-craig
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

I have always been taught that the hash file needs to be cleared in an earlier part of the job before any of the stages that write to it are allowed to write to it. In fact, I'm pretty sure that when I did it the way you've described I've seen DataStage write data from one stage into the hash, then the other stage clears the hash file and erases the data that the other stage had written to it. Anyone else have this experience, but me?

Thanks,
Tony
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If by "part of the job" you mean portions that run in a seperate wave, then yes setting your hash file to clear in a later "portion" may cause problems. In that case, don't do that. [:o)]

Here's a suggestion: Write a test job that loads a small flat file into a hash file. Set the hash file to clear. Load your small flat file in so that it gets populated. Now, run the job again using an empty file. No rows will be processed, but the first thing that will happen is the hash file will be cleared.

Or just run it again with the same file but using the debugger. Set a breakpoint so that it stops when the first row comes out of the gate, before it gets processed at all. Check the row count in the hash file at that point - it will be empty. Heck, you could even try it again with a more complex test case that matches your worries and clear the hash later in the job stream - then you'd really know... but I'm sure you'll be fine.

-craig
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If there's only one hashed file to clear, set up a before-stage subroutine (or before-job subroutine) to use ExecUV to execute the CLEAR.FILE command.
If there's more than one hashed file to clear, use Administrator Command window to build up a list of CLEAR.FILE commands, then multi-select these and save with a name (ClearHashedFiles for example). Then set up your before-stage (before-job) subroutine to execute this saved list of commands (which is called a "paragraph") via ExecUV.
Nothing has changed in this regard between 4.2 and 6.1. There are some extra capabilities regarding caching, particularly sharing cached hashed files between jobs and the ability to lock for update when hashed file is cached; read the manual dskcache.pdf for much more information on these (it's installed with your DS6 client software in the Docs folder).

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ray, are you implying in your post that there is some reason why you would not simply want to check the 'Clear' option in the stage, or are you just pointing out another option to accomplish the same thing?

It seems like extra (unnecessary) work unless there are issues I'm not aware of, issues to which Tony has been alluding perhaps? Seeking enlightenment...

-craig
[8D]
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Checking the Clear box will work, but it leaves the design vulnerable to what the OP suggested might be a problem; that is, developers who check Clear in every hashed file stage, which CAN lead to lost data.
IF designs go through a QA or peer review process, these things can be detected; it's my belief, based on experience, that it's better to do these things as explicitly as possible.
Whereever the hashed file is cleared, I would always include some text in the long description of the stage or job reminding future developers that it occurs; it's so easy to miss a checked Checkbox.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I see your point and I totally agree - indiscriminant setting of the 'Clear' option can be dangerous. Make sure you understand how the job is structured and do it where appropriate, be it in the Stage itself or explicitly in a before-job routine.

I also pointed out the use of the Text Annotations (better than using the description space, IMHO) to indicate anything of that ilk that is important to point out to other people. We color the background of the text box to show the 'importance level' of information in it.

My main point was that, when a hashed file is checked to be cleared or an Oracle table is set to be truncated, it happens when the job starts *before* any rows are processed - not when the first row hits it. Not sure if that helped the OP, but wanted to point that out.

-craig
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

Ray and Craig,

Thanks for your replies. I appreciate both your thoughts and suggestions.

I did come up with a job design that clears the hash file(s) before anything else occurs in the job. Ray, your suggestion seems a bit cleaner than mine. I'll look into it.

Tony
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

Ray,

I don't find ExecUV anywhere. Is this something I have to install?

Thanks,
Tony
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'm sure he meant ExecTCL, which is a 'UniVerse' script.

-craig
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes, sorry, ExecTCL(). [:I]

ExecUV is one of mine, which is a bit more silent than ExecTCL. I don't like cluttering the log file unnecessarily.
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

Thanks again for your help, Ray and Craig.

I appreciate your help.

Tony
tonystark622
Premium Member
Premium Member
Posts: 483
Joined: Thu Jun 12, 2003 4:47 pm
Location: St. Louis, Missouri USA

Post by tonystark622 »

Ok, guys. I've been thinking about this for a few days now, trying to reconcile both Ray and Craig's answer. Let me see if I understand.

I mentioned checking the Clear checkbox on multiple hashfile stages for the same hash file. Craig, you mentioned that this was unnecessary because DataStage is smart enough to see if a hash file needs to be cleared and clears it before any rows are processed. Ray said that I was correct, in that, if you checked the Clear checkbox on multiple hashfile stages for the same hashfile, you could lose data. Craig, what I understood you to be saying, is that you should only check the Clear checkbox on ONE of the stages for that hashfile, not two like I did. If so, Ray, why did you recommend using a before job routine to do this work, if checking the checkbox will do the job?

Also, Ray, I couldn't get the CLEAR.FILE to work. I'm using the directory path, not account for the hash file, if that makes any difference. I've tried everything I can think of in the administrator but, always get error 30144. Can you give me any hints?

Thanks for all your help,
Tony
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ray stated that he liked to *explicitly* clear the hash files so it is more obvious to Those Who Come After, rather than relying on a (possibly overlooked) check box. We both encouraged the use of Annotations to point out the clearing. [:)]

Yes, I was saying *one* check box will do the job. We were both trying to point out the dangers of checking multiple option boxes, as single jobs may execute portions in separate 'waves', where checking one in the first wave and one in the second wave could definitely bite you. [}:)]

CLEAR.FILE only works if the hash was created with CREATE.FILE and with directory paths they are not. I don't recall off the top of my head, so search the forum on this topic - Ray has very clearly laid out the whole clear dependency on creation thing more than once - I couldn't do it justice. [:I]

-craig
Post Reply