Reading and writing frmo same table/hash file

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Xanadu
Participant
Posts: 61
Joined: Thu Jul 22, 2004 9:29 am

Reading and writing frmo same table/hash file

Post by Xanadu »

*I made this post a couple of days back and had a discussion with Sumit, still the issue is not resolved among us both..can others in the group share their thoughts on this please...*
( viewtopic.php?t=88762 )

--------------------------------------------------------------------------------
Hello...
I have a job that does a lookup on table A and writes to table A which inturn rites to hash_file_A.

The setup looks like this:
Table A----------------> TFM--------------->Table A ---------> Hash_table_A
---------(ref lookup)----------(writes to A)-----------(writes hash)

The transformer does a look up on a table A and writes to Table A. The data from Table A is then written to Hashfile.

My question is : I know its better if the lookup is the hashfile instead of the table.. But will every new row be ritten to the hash file before the lookup on the next row is performed (Every new row is included in the next lookup if i use the table for lookup and set array and transaction sizes to 1 when I am writing the table.)

Is there any way each row is ritten to the table and ritten to the hash file immediately and then the lookup is performed on the next row ?

I am using the table for lookup (I set array size and transaction size to 1). SO every new row is included in next lookup. But any other better methods ?

Sumit suggested that I write to hash file just before I do the lookup (then I would write to hash file twice) ..
Would this be slower than looking from the table itself ?
( I created an index on the key in the table)

Any comments.. ?
Thanks
-Xan
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

In the general ETL world, the closer the data to the ETL process, the faster the ETL process. That is why I support maximum use of sequential and hash files.

I would invert your problem:

Code: Select all

Table A ---> xfr ---> Hash  (preload this hash file)

                       Hash
                        |
                        V
Table A -------------> xfr ---> Hash ---> xfr ---> Table A
                    (if row not found in hash, then add it)
                                                (insert flagged new rows into Table A)
A row is read from Table A, then reference lookup to the hash file, and the constrained write to the hash file occurs before then next row is read from table a. Therefore, data integrity is maintained.

If Table A's keys are not generated until the insert occurs, use a temporary, but unique, value in the hash file. After the insert into Table A, you get retrieve the assigned key, and use the hash file as a cross reference table for future processing.
Xanadu
Participant
Posts: 61
Joined: Thu Jul 22, 2004 9:29 am

Post by Xanadu »

Thanks Chuck...
I format I have described earlier is used by the existing ASCL jobs in PSFT EPM.
They actually used a hash file for lookup. So the setup looks like :
step 1) Table A -- > HASH_FILEA (like u described they first loaded the hash files..)
step 2) HASH_FILE_A---------> TFM--------------->Table A ---------> Hash_File_A
---------(ref lookup)----------(writes to A)-----------(writes hash)

But I found these problems in the existing jobs. Correct me if i am wrong.
Problem 1) When hash file is being ritten in step 2, they are writing to cache.
Problem 2) In step 2, The hash file is being pre-loaded when performing the lookup. And they did not set the option of lock for updates. So even if problem 1 is rectified (by riting to file but not to cache) , how will the newly inserted row be available for lookup. we never said go to the file if you dont fnd it in the cache..

Aren't these 2 valid concerns or am i missing something here ?

And also say these 2 are rectified, i.e. i am not writing to cache when writing to hash and when reading i set the pre-load option to "enabled,lock for updates".. would this setup work (if u notice the data is first ritten to table but not to hash file as u modified..it wud cause a problem right)
To avoid all these problems I just did the lookup from the table by riting to the table with array and transaction size to 1. and reading from the table as "read committed"..do u see any problem with this..

thanks for your response chuck
-Xan
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

What is PSFT EPM?

In my design, I do not preload into memory, or enable write cache. However, based upon the hash file caching documentation, I believe other settings may work. Advanced hash file cache sharing shares a single cache for both reads and writes. The success of other options would depend on your DataStage version and configuration in uvconfig.

I suspect that your solution with an array size and a transaction size of 1 would work, but may not represent "best practice" ETL design, and I would be concerned about is performance in a production environment.

Perhaps one other modification I would consider for our designs would be to update the hash file and Table A at the same time from the same transform if other jobs depend on this data.
datastage
Participant
Posts: 229
Joined: Wed Oct 23, 2002 10:10 am
Location: Omaha

Post by datastage »

chucksmith wrote:What is PSFT EPM?

PeopleSoft Enterprise Performance Management:

http://www.peoplesoft.com/corp/en/produ ... /index.jsp
Byron Paul
WARNING: DO NOT OPERATE DATASTAGE WITHOUT ADULT SUPERVISION.

"Strange things are afoot in the reject links" - from Bill & Ted's DataStage Adventure
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

My question is : I know its better if the lookup is the hashfile instead of the table.. But will every new row be ritten to the hash file before the lookup on the next row is performed (Every new row is included in the next lookup if i use the table for lookup and set array and transaction sizes to 1 when I am writing the table.)
I believe the answer to your question is no or at least no given the way you are creating the hash.

with something like -

Code: Select all

           Hash A
              |
              |
              V
Input -----> TFM ------>File A -----> Hash A

None of the input records written to File A will be loaded to Hash A until all rows have processed via the input to File A loop. Something like the following might work -

Code: Select all

           Hash A (ref)
              |
              |
              V
Input -----> TFM ------>File A               \
              |
              |
              V
           Hash A (output)
Disable write cache and do not preload to memory on the reference lookup. This is essentially what Chuck outlined and needs to have some constraints place on File A if you only want to write out records to File A if a lookup is successful.

Regards,
Jamesvanam
Participant
Posts: 40
Joined: Tue May 18, 2004 10:51 pm

Post by Jamesvanam »

HI,
I have a question on this, Could I somehow only write a record to Hash A(Output) which is only written to the File A. I mean can I not write the records(into Hash A) which are not written to the File A. for example for some reason the row is rejected or not written into File A but the Hash A gets Updated, so we end up with this record in the Hash A but not File A.

Anyway to prevent this.

Regards
James
Post Reply