lookup only once from hashfile

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
spracht
Participant
Posts: 105
Joined: Tue Apr 15, 2003 11:30 pm
Location: Germany

lookup only once from hashfile

Post by spracht »

Is there an easy way to provide that records from a hash lookup are picked only once, even if subsequent records from the stream input would match the same row again?

Stephan
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

You have two choices: remove the row from the hash file or mark a column on the hash file to indicate that the row has been referenced. Both are bad choices because you cannot use read or write caching to improve performance. Removing the row requires using a UV/SQL stage with DELETE sql (poor choice). Updating the referenced row requires simply adding a column to indicate it's been referenced (faster than UV/SQL stage, but no caching).

Maybe if you tell us what you're trying to do from a big picture, we can give an alternative approach.

Thanks.

Kenneth Bland
spracht
Participant
Posts: 105
Joined: Tue Apr 15, 2003 11:30 pm
Location: Germany

Post by spracht »

Ken

I already tried like you suggested, marking the used records, but this didn't work (I tried the options 'Disabled' and 'Disabled - lock for update' within the hash file stages).

What I want to do is as follows: there is a datasource outside our dwh, which is fed from the same source, a flat file generated by the billing system. The external datasource contains calculations on an elemantary level, that must be imported into the dwh. There is, however, no unique identifier, at least none common to both sources. The matching is done in servaral stages, starting with a 9-part key expression, gradually losening to a 6-part key. It has to be ensured that every record from the source finds it's way to the target without updating a record that was found before, 'cause any aggregation of the imported columns should lead to the same result as in the source system. I'm currently achieving it using a sort stage and a stage variable in a subsequent transformer, that eliminates the duplicates.

Hope this makes it a bit more clear. Thanks and Regards

Stephan
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

My marking suggestion has to work. You must disable write-delay caching, as well as read caching. NEVER USE LOCKING! It's a huge overhead to your job, as it will degrade job performance if the locks are not getting released. If the row is cached for write, the read cache won't "see" it.

Give this a try, it's probably your best bet. However, it brings to point that I hope you are not PERMANENTLY storing this information in your hash file. [:(] This would be a very bad practice. You should put this xref into a permanent datastore in your warehouse. Just a few words of wisdom...

Good luck.


Kenneth Bland
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Another solution is to maintain a list of keys that you have seen, and search this list. For a sufficiently small list (up to, say, a few thousand), a dynamic array in a variable declared to be in COMMON would suffice. For larger lists, a more efficient search strategy would be needed, such as a binary division.

Yet another solution is a second hashed file, which starts empty and contains just the keys that have been seen. This is similar to the marking technique, but allows the first hashed file to be cached for reading. This approach might be preferred if, for example, only a small subset of keys in the main hashed file was likely to be seen.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518
spracht
Participant
Posts: 105
Joined: Tue Apr 15, 2003 11:30 pm
Location: Germany

Post by spracht »

Ken, Ray

again: I assume, that unchecking the 'preload file to memory option' is what is meant by not using read cache. And again, the job fails to work as intended. Less records are marked as used in the hash than the monitor pretends to have found (in the hash) and updated, both in the database table and in the hash. The job log even says 'read cache enabled' where I definitely disabled it in the job design. By the way, we are working with DS 5,2 on HP-UX 11. When I think of all the time I wasted on dealing with things that just don't work as they should, it's a nightmare. If the DS server has no idea what happened, it comes with the message: Abnormal termination detected.

Stephan
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Stephan,

I caught the phrase "database and hash" and would like to make sure that the file you are writing to and reading from are one in the same? This is the only way this will work.

Ken is correct stating that if you have followed what he outlined then it has to work. This is a fairly common design for things like changed data and dimensional processing.

Please ensure that ref link "Pre-load file to memory:" drop down is set to Disabled and that the "Allow stage write cache" is not checked on the output. Performing the reference lookup and output to the hash in the same or different transformers has no impact on the outcome.

Regards,

Michael Hester
Post Reply