Reading and writing frmo same table/hash file

Xanadu · Post by **Xanadu** » Thu Aug 19, 2004 8:24 am

*I made this post a couple of days back and had a discussion with Sumit, still the issue is not resolved among us both..can others in the group share their thoughts on this please...*
( viewtopic.php?t=88762 )

--------------------------------------------------------------------------------
Hello...
I have a job that does a lookup on table A and writes to table A which inturn rites to hash_file_A.

The setup looks like this:
Table A----------------> TFM--------------->Table A ---------> Hash_table_A
---------(ref lookup)----------(writes to A)-----------(writes hash)

The transformer does a look up on a table A and writes to Table A. The data from Table A is then written to Hashfile.

My question is : I know its better if the lookup is the hashfile instead of the table.. But will every new row be ritten to the hash file before the lookup on the next row is performed (Every new row is included in the next lookup if i use the table for lookup and set array and transaction sizes to 1 when I am writing the table.)

Is there any way each row is ritten to the table and ritten to the hash file immediately and then the lookup is performed on the next row ?

I am using the table for lookup (I set array size and transaction size to 1). SO every new row is included in next lookup. But any other better methods ?

Sumit suggested that I write to hash file just before I do the lookup (then I would write to hash file twice) ..
Would this be slower than looking from the table itself ?
( I created an index on the key in the table)

Any comments.. ?
Thanks
-Xan

chucksmith · Post by **chucksmith** » Thu Aug 19, 2004 9:42 am

In the general ETL world, the closer the data to the ETL process, the faster the ETL process. That is why I support maximum use of sequential and hash files.

I would invert your problem:

Code: Select all

Table A ---> xfr ---> Hash  (preload this hash file)

                       Hash
                        |
                        V
Table A -------------> xfr ---> Hash ---> xfr ---> Table A
                    (if row not found in hash, then add it)
                                                (insert flagged new rows into Table A)

A row is read from Table A, then reference lookup to the hash file, and the constrained write to the hash file occurs before then next row is read from table a. Therefore, data integrity is maintained.

If Table A's keys are not generated until the insert occurs, use a temporary, but unique, value in the hash file. After the insert into Table A, you get retrieve the assigned key, and use the hash file as a cross reference table for future processing.

Xanadu · Post by **Xanadu** » Thu Aug 19, 2004 10:29 am

Thanks Chuck...
I format I have described earlier is used by the existing ASCL jobs in PSFT EPM.
They actually used a hash file for lookup. So the setup looks like :
step 1) Table A -- > HASH_FILEA (like u described they first loaded the hash files..)
step 2) HASH_FILE_A---------> TFM--------------->Table A ---------> Hash_File_A
---------(ref lookup)----------(writes to A)-----------(writes hash)

But I found these problems in the existing jobs. Correct me if i am wrong.
Problem 1) When hash file is being ritten in step 2, they are writing to cache.
Problem 2) In step 2, The hash file is being pre-loaded when performing the lookup. And they did not set the option of lock for updates. So even if problem 1 is rectified (by riting to file but not to cache) , how will the newly inserted row be available for lookup. we never said go to the file if you dont fnd it in the cache..

Aren't these 2 valid concerns or am i missing something here ?

And also say these 2 are rectified, i.e. i am not writing to cache when writing to hash and when reading i set the pre-load option to "enabled,lock for updates".. would this setup work (if u notice the data is first ritten to table but not to hash file as u modified..it wud cause a problem right)
To avoid all these problems I just did the lookup from the table by riting to the table with array and transaction size to 1. and reading from the table as "read committed"..do u see any problem with this..

thanks for your response chuck
-Xan

chucksmith · Post by **chucksmith** » Thu Aug 19, 2004 10:57 am

What is PSFT EPM?

In my design, I do not preload into memory, or enable write cache. However, based upon the hash file caching documentation, I believe other settings may work. Advanced hash file cache sharing shares a single cache for both reads and writes. The success of other options would depend on your DataStage version and configuration in uvconfig.

I suspect that your solution with an array size and a transaction size of 1 would work, but may not represent "best practice" ETL design, and I would be concerned about is performance in a production environment.

Perhaps one other modification I would consider for our designs would be to update the hash file and Table A at the same time from the same transform if other jobs depend on this data.

datastage · Post by **datastage** » Tue Aug 24, 2004 3:44 pm

chucksmith wrote:What is PSFT EPM?

PeopleSoft Enterprise Performance Management:

http://www.peoplesoft.com/corp/en/produ ... /index.jsp

mhester · Post by **mhester** » Tue Aug 24, 2004 4:10 pm

My question is : I know its better if the lookup is the hashfile instead of the table.. But will every new row be ritten to the hash file before the lookup on the next row is performed (Every new row is included in the next lookup if i use the table for lookup and set array and transaction sizes to 1 when I am writing the table.)

I believe the answer to your question is no or at least no given the way you are creating the hash.

with something like -

Code: Select all

           Hash A
              |
              |
              V
Input -----> TFM ------>File A -----> Hash A

None of the input records written to File A will be loaded to Hash A until all rows have processed via the input to File A loop. Something like the following might work -

Code: Select all

           Hash A (ref)
              |
              |
              V
Input -----> TFM ------>File A               \
              |
              |
              V
           Hash A (output)

Disable write cache and do not preload to memory on the reference lookup. This is essentially what Chuck outlined and needs to have some constraints place on File A if you only want to write out records to File A if a lookup is successful.

Regards,

Jamesvanam · Post by **Jamesvanam** » Wed Sep 29, 2004 3:21 pm

HI,
I have a question on this, Could I somehow only write a record to Hash A(Output) which is only written to the File A. I mean can I not write the records(into Hash A) which are not written to the File A. for example for some reason the row is rejected or not written into File A but the Hash A gets Updated, so we end up with this record in the Hash A but not File A.

Anyway to prevent this.

Regards
James