Ensuring hashed file is created when the job executes

sbass1 · Post by **sbass1** » Wed Mar 11, 2009 10:25 pm

Hi,

I've reviewed viewtopic.php?t=92114&highlight=CREATE+TABLE but still have this question...

I have a job like:

DRS --> xfm --> DRS --> hashed_file (2)
         ^
         |
   hashed_file (1)

where hashed_file (1) and hashed_file (2) are the same physical file.

(sorry for the badly formatted ascii art, hashed_file (1) is a lookup for the xfm transformation)

hashed_file (1) is pre-loaded into memory
hashed_file (2) has create file / clear file ticked

This is all to support surrogate key generation, with code like:

If IsNull(Lookup_SKs.FooSK) Then KeyMgtGetNextValueConcurrent("source.foo") Else Lookup_SKs.FooSK

although I intend to change this to:

If (Lookup_SKs.NOTFOUND) Then ...

My issue is, after physical file cleanup, or migration to a new environment, hashed_file (1) does not exist, and the job aborts.

What is the best practice and syntax to implement the following pseudocode:

If (Directory Path): #$HashFileLocation# (File name): #DSStageName# does not exist, create an empty hashed file in #$HashFileLocation#/#DSStageName#.

This would then allow the job to run, generating new SK's for each input row, then update the hash.

The code should not generate an error if the hashed file already exists, should (hopefully) support the DS automatic variables noted above (I doubt this will be the case - would require a before routine to "peek into" the structure of the job), and should optionally allow me to write a warning to the log if the file does not exist.

I tried this command from the operating system, with the noted results:

/tmp > /opt2/ds/Ascential/DataStage/DSEngine/bin/create.file DATA TEST DYNAMIC
Unable to open VOC file.
*** Processing cannot continue. ***

Various permutations of this command resulted in the same error.

Thanks,
Scott

ray.wurlod · Post by **ray.wurlod** » Wed Mar 11, 2009 10:39 pm

Put your ASCII art inside code tags.

Best practice to ensure that a hashed file exists is to create a separate job to create it. This needs only a transformer stage and a hashed file stage.

sbass1 · Post by **sbass1** » Thu Mar 12, 2009 1:54 am

Our latest approach is:

Code: Select all

DRS (1) --> xfm --> DRS (2) --> hashed_file (2) 
             ^ 
             | 
DRS (2) -->hashed_file (1)

This:

1) Ensures hashed_file (1) always exists
2) Is always in sync with DRS (2) (the target table)
3) Ensures new SKs are derived correctly

Pros:
Ensures the hashed file is always in sync with the target table, such as after a database refresh (either truncating the table, or adding to the table, such as a refresh from UAT)

Cons:
Somewhat computationally inefficient, as it's unconditionally refreshing the hashed file when it might not need refreshing

I'm a bit ambivalent about this approach (suggested by a colleague) due to the cons above.

The reason we're using a hashed file instead of direct lookup from the target table is it's orders of magnitude faster doing the lookups.

Comments?

P.S.: I've seen no difference in behaviour whether I tick "Create file" or not. DS always seems to create the hashed file if it does not already exist.

chulett · Post by **chulett** » Thu Mar 12, 2009 11:21 am

sbass1 wrote:P.S.: I've seen no difference in behaviour whether I tick "Create file" or not. DS always seems to create the hashed file if it does not already exist.

Correct. Ticking 'Create File' just gives you the option to override the defaults used at creation time, plus access to the 'delete before' option. As you've found, it will always attempt to create the hashed file if it doesn't exist regardless of that setting.

Mike · Post by **Mike** » Thu Mar 12, 2009 2:13 pm

If you don't wish to unconditionally refresh the hashed file, you can replace your DRS (2) with a transformer stage having a stage variable and an output constraint of @FALSE. That will create the hashed file if it does not exist.

Mike

ray.wurlod · Post by **ray.wurlod** » Thu Mar 12, 2009 3:28 pm

Relying on auto-create will prove counterproductive if the job design reads from and writes to the same hashed file.

sbass1 · Post by **sbass1** » Sun Mar 15, 2009 11:03 pm

Mike wrote:If you don't wish to unconditionally refresh the hashed file, you can replace your DRS (2) with a transformer stage having a stage variable and an output constraint of @FALSE. That will create the hashed file if it does not exist.

Mike

Thanks Mike, slick trick. The hashed file never gets updated since the constraint is false, but gets created in accordance with the metadata of the link if it does not yet exist.

sbass1 · Post by **sbass1** » Sun Mar 15, 2009 11:05 pm

Edit/Delete post is only letting me edit, not delete the extra post caused by hitting back in my browser. So sorry for the extra post (this one).

ray.wurlod · Post by **ray.wurlod** » Sun Mar 15, 2009 11:24 pm

IOW ?

Isn't that the noise a cat makes?

Please refrain from all but the most common English abbreviations on DSXchange. Many of our readers don't have English as their first language, and it's tough enough as it is, without having to decode the mental processes behind arcane abbreviations.

chulett · Post by **chulett** » Mon Mar 16, 2009 7:21 am

sbass1 wrote:Edit/Delete post is only letting me edit, not delete the extra post caused by hitting back in my browser. So sorry for the extra post (this one).

You can delete a post as long as no-one has replied to it yet, which is why I always check for duplicates right away if there's any delay in posting... delay = multiple postings.

yserrano · Post by **yserrano** » Wed Aug 26, 2009 4:41 pm

sbass1 wrote:Our latest approach is:
Code: Select all
DRS (1) --> xfm --> DRS (2) --> hashed_file (2) 
             ^ 
             | 
DRS (2) -->hashed_file (1)
This:

1) Ensures hashed_file (1) always exists
2) Is always in sync with DRS (2) (the target table)
3) Ensures new SKs are derived correctly

Pros:
Ensures the hashed file is always in sync with the target table, such as after a database refresh (either truncating the table, or adding to the table, such as a refresh from UAT)

Cons:
Somewhat computationally inefficient, as it's unconditionally refreshing the hashed file when it might not need refreshing

I'm a bit ambivalent about this approach (suggested by a colleague) due to the cons above.

The reason we're using a hashed file instead of direct lookup from the target table is it's orders of magnitude faster doing the lookups.

Comments?

P.S.: I've seen no difference in behaviour whether I tick "Create file" or not. DS always seems to create the hashed file if it does not already exist.

We have used anothe rmethod very successfully to create surrogate keys, we created a transform routine that receives parameters and perform a max(SK) on the target table, the routne is used as the initial value for an stage variable that gets incremented for eache insert.

yserrano · Post by **yserrano** » Wed Aug 26, 2009 4:42 pm

sbass1 wrote:Our latest approach is:
Code: Select all
DRS (1) --> xfm --> DRS (2) --> hashed_file (2) 
             ^ 
             | 
DRS (2) -->hashed_file (1)
This:

1) Ensures hashed_file (1) always exists
2) Is always in sync with DRS (2) (the target table)
3) Ensures new SKs are derived correctly

Pros:
Ensures the hashed file is always in sync with the target table, such as after a database refresh (either truncating the table, or adding to the table, such as a refresh from UAT)

Cons:
Somewhat computationally inefficient, as it's unconditionally refreshing the hashed file when it might not need refreshing

I'm a bit ambivalent about this approach (suggested by a colleague) due to the cons above.

The reason we're using a hashed file instead of direct lookup from the target table is it's orders of magnitude faster doing the lookups.

Comments?

P.S.: I've seen no difference in behaviour whether I tick "Create file" or not. DS always seems to create the hashed file if it does not already exist.

We have used anothe rmethod very successfully to create surrogate keys, we created a transform routine that receives parameters and perform a max(SK) on the target table, the routne is used as the initial value for an stage variable that gets incremented for eache insert.