Lookup

sdevashis · Post by **sdevashis** » Thu Oct 16, 2003 2:38 am

Hi,
I am new to datastage. One of my jobs require a lookup to be done. I looked in the documentation. Although it says that there is a lookup control in the pellete, I dont see that. The control doesnt show up in any of the tabs.

I am using a evaluation version.

If I have to manage without the control, then whats the best way of doing a lookup in datastage.

Thanks for your replies.

Devashis

chulett · Post by **chulett** » Thu Oct 16, 2003 6:49 am

The true "Lookup" stage only exists on the Mainframe pallete - ie, DataStage XE/390. If you check the online help, you'll find it but note that it is not classified as a Server stage.

Don't have access to DataStage right now, but the process of creating a lookup is documented in one of the .pdf manuals, probably the Server Job Developers Guide.

Just use whatever stage you would normally use to interact with your database. For example, to do a lookup to Oracle, use an OCI stage. Drag a connecting link from the OCI stage to the Transform in order to create a Reference Link. The link will be a dotted line to indicate that. Then, based on your key fields in the lookup table, define in your Transform what fields in the input stream should be used to perform the lookup.

Hope this helps,

sdevashis · Post by **sdevashis** » Thu Oct 16, 2003 10:07 am

Hi Chulett,
Thanks a lot for the help. I could get that working but I still have some hiccup. May be a little more help from you can help.

The scenario is like I created a Hash file with two columns . Once the hash file is created I am doing a lookup now by taking a reference from the hash file. And suprisingly the lookup is returning NULL all the time.

.

I loaded the data from hash file on to access database and tried doing the same lookup and it worked fine.

Is there anything I am missing on the has file

Devashis

sdevashis · Post by **sdevashis** » Thu Oct 16, 2003 10:54 am

Hi KC,
Probably I wrote things in a bad manner...I have done things exactly the way you have written.

I have table PRODUCTCODE where I have the PRODUCT_CODE and PRODUCT_NAME fields. I need to do a populate my PRODUCT_MASTER table for which I need to make a lookup on the PRODUCTS table for PRODUCT_NAME. I have 12000 products and the best aproach to this would be to make a hashfile once and for all to use that over and over again.

I did exactly the same way as you have writte. I created a temporary table and populated the table. Checked for the data quality and then populated the hash file. What I was pondering after that was how to make a lookup and Craige replied back to use the reference. I tried that but it is failing.

If reference link is not the method for lookup on Hash Files then what do I need to do???

Devashis

kcbland · Post by **kcbland** » Thu Oct 16, 2003 11:07 am

Make sure your hash file has an identified primary key column, in your case PRODUCT_CODE sounds like it should be set to YES. Make sure you SAVE YOUR HASHFILE METADATA TO THE LIBRARY.

In another job that has to lookup against this hash file, LOAD THAT SAVED METADATA into your hash file reference stage. Press view data, you should see your data displayed.

Whenever you do reference lookups, JUST AS IN SQL, you must make sure you have trimmed data if there's a chance of leading or trailing spaces screwing up the equivalency in the reference. If you are comparing numbers, you have to make sure you don't getting screwed up by leading or trailing zeroes (0.00 is NOT EQUAL to 0, 0123 is NOT EQUAL to 123). There is no datatyping, it is all string comparison in DataStage. Verify that you are not gettting messed up by this. An easy way is to VIEW DATA and then double-click on the displayed field contents and look for highlighted spaces leading or trailing.

Good luck

chulett · Post by **chulett** » Thu Oct 16, 2003 4:21 pm

Devashis,

As Ken mentions, this is almost certainly a "trim" problem. More than likely you have a key populated with (for example) "100" and you are trying to look it up with a key value of "100 " - or vice versa. Make sure all of your key information is neatly trimmed and you should be fine.

sdevashis · Post by **sdevashis** » Fri Oct 17, 2003 1:54 am

Hi Craige, Kc,

Thanks for the help....I could track down the problem. there was a problem in one of the user defined queries. I learned couple of things from KC's article about performance. Thanks a lot.

Devashis

eoyylo · Post by **eoyylo** » Wed Sep 15, 2004 11:24 am

Hi Kenneth,
i try to understand how to determine minimum modulus.
I tried many many times but never i reached an over file of 2048 byte. The lower value are 372K!!!!
You wrote "Since modulus is computed using average bytes per row and number of rows": which is the algorithms to calculate the the minimum module?

thanks in advance.

regards
Mario

kcbland · Post by **kcbland** » Wed Sep 15, 2004 11:48 am

You can look into the unsupported hash file calculator utility on your DS cd written by our illustrous Ray Wurlod if you want to see exactly how to compute a minimum modulus. That's the easiest way to do it.

eoyylo · Post by **eoyylo** » Fri Sep 17, 2004 8:44 am

Hi,
I tried to apply the suggest with this exemple:
I have 1.000.000 of record in the principal stream. I 'd like to attach dimension to this data.
The dimension are in 3 hash file with this number of records
Hash 1 - 1.300.000 records
Hash 2 - 2.500.000 records
Hash 3 - 1.300.000 records.

I try to do the experiment once with modulo = 1 and after with the correct value of modulo.

the results are:

----- WRITE HASH FILE -------
with modulo=1 the overflow files are huge.
with the correct value the overflow file size are 2048 for hash1, 2048 for hash 2 and 14336 for hash 2.

with modulo 1 the cration of Hash 3 are faster then the correct value of modulo. For the other with the correct value the creation are faster then value 1.

----- READ HASH FILE -------
whit modulo =1 the speed are about 5 times faster then the correct value of modulo!!!!!

I think that with the correct value of modulo the overflow data are small and then the speed increase...... but it doesn't happen. Why?

thanks in advance.

Regards

Mario

ray.wurlod · Post by **ray.wurlod** » Fri Sep 17, 2004 3:51 pm

When you use MINIMUM.MODULUS or static hashed file, the hit of allocating disk space is taken when the hashed file is created, rather than while it is being populated.

The best way to speed population of a hashed file is to enable write cache, which means that the "writes to random groups" occurs in memory and the result is flushed to disk one group at a time. However, the cache has a finite maximum size; if your hashed file is larger than what's available caching will be reported as requested then as disabled in the job log.

When you report "faster" with modulo 1, this is an artifact of your forcing the hashed file to behave like a sequential file (all records in one group). Write performance WILL be a little faster; lookup performance, on the other hand, will be substantially slower because every lookup will have to scan the entire, badly overflowed, group.

eoyylo · Post by **eoyylo** » Sat Sep 18, 2004 11:01 am

Hi Ray,
I did the test in the same condition (same number of record, write cache etc).
First of all I created the hash file and in a second job I used it to joins the data.
In the second job the results obtained during the joins are:
with modulo = 1 2500 row/sec
with modulo = MINIMUM.MODULUS 400 row/sec
The performance are very slower than modulo=1 during the read.
For me is vary strange and i don't understand why.
Can it depends by the HD configuration?

Thanks

Mario
p.s. I know that this topic was born for DataStage 7 but I use Datastage 5.2

ray.wurlod · Post by **ray.wurlod** » Sat Sep 18, 2004 4:37 pm

Can't answer that until we know whether or not you've enabled read cache.

I have heard it argued, primarily for file dictionaries, that modulo 1 forces the contents of the hashed file all to be loaded into memory at first access (because the whole group needs to be scanned on first access), effectively caching the file dictionary at that time. Because it's being accessed for every row, that memory does not get aged by the operating system, and so tends to remain physically resident.

I suspect something similar may be occurring here, particularly if you do NOT have read cache enabled and if you're using a machine with ample memory.

eoyylo · Post by **eoyylo** » Mon Sep 20, 2004 4:53 am

hi Ray,
in the Administrator the "Read cache size" is 128 M,
the server has 500M of memory and in the job I set "Pre-load in memory" = Enable.

Does it exixt a different method to enforce to load the entrire Hash file in memory?

Thank you

Mario

ray.wurlod · Post by **ray.wurlod** » Tue Sep 21, 2004 1:37 am

Write cache can not be used if the hashed file is larger than the amount of allocated cache (or that percentage of a shared cache specified by the DCMAXPCT and other percentage parameters).

You can tune the amount of cache, but not to more than 999MB.

DSXchange

Lookup

Lookup

Lookup on access is working but hash file is not

Thanks