Hashed File Vs Routine...Performance Question ?

kaps · Post by **kaps** » Fri Mar 31, 2006 10:29 am

Guys

I would like to know which one will perform better for the below given scenorio.

Say I have a file with 500,000 records and each record has 10 columns.
Out of the 10 columns 4 columns need to be validated.
I have a hashed file with one record for each coulmn validation rule.

Scenario 1
----------

My DS Job is like this...I have four links from Hashed file as there are 4 fields need to be validated.

Code: Select all

	         Hashed File
	             | | | |	
	             | | | |
	             | | | |
seq file------->Transformer--------->Seq file

Scenario 2
----------

Code: Select all

seq file------->Transformer--------->Seq file

Here instead of doing a look up in a hashed file, I use a routine to do the same(passing value to routine and get the value back).

My question is which approach will be fatser ?

Thanks

ArndW · Post by **ArndW** » Fri Mar 31, 2006 10:34 am

The routine will be faster.
but
How are you planning on getting your 500,000 x 11 [10 cols + 1 key] pieces of information into your routine? If your routine does a hashed file lookup to get this information then your scenario 1 is suddenly the better one, since you can pre-load the data to memory and even use just one memory image with system-level caching if you enable that.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 31, 2006 4:55 pm

Why not build both and publish your results?

How good are you at writing efficient DataStage BASIC routines?

rleishman · Post by **rleishman** » Sun Apr 02, 2006 9:44 pm

This query has a certain academic interest, but surely coding your own lookups in BASIC defeats the purpose of spending AUD$100K on an ETL tool.

kaps · Post by **kaps** » Mon Apr 03, 2006 9:06 am

Thanks for the replies.
Actually we are trying to come up with a best approach to cleanse or translate our data. I was suggesting the first approach menthioned in my first post but there was an argument for sending the whole record to a routine and that routine will lookup to the hashed file to get the translated value for the fields which need to be validated if there is a record in hashed file or just pass the input value if it does not have mataching record in hashed file. This can be dynamic means if we change another rule to the rule table which will be picked up automatically ! But the process is going to spend lot of time doing lookup for the column which does not need to be validates.

Ofcourse as relishman said ETL Tools are to reduce procedural language type of coding which hides the meta data of the process !

Thanks