Slow Writing to Hashed files

asitagrawal · Post by **asitagrawal** » Fri Mar 09, 2007 10:30 am

ray.wurlod wrote:Total all the characters in a row and add 1 for each column. There are no data types - numbers are stored as character strings. ...

Great !! Thats seems pretty simple then..
Just to confirm... The number of characters would ideally be the field length, right ????

ray.wurlod · Post by **ray.wurlod** » Fri Mar 09, 2007 10:40 am

Not ideally; VarChars on average are shorter than the maximum length specified. But if you total the lengths, you'll err on the side of caution, and over-size your hashed file. This is far better than under-sizing it.

asitagrawal · Post by **asitagrawal** » Fri Mar 09, 2007 11:25 am

Hi Ray,

I have a small doubt...
The data across the 10 instances will not be evenly distributed...
so tuning the Hashed File for some random input number of records,
wont it effect the performance.... i.e in one Instance it will be good and will be poor in the other...

Please correct me wherever I am wrong...

ray.wurlod · Post by **ray.wurlod** » Fri Mar 09, 2007 2:35 pm

That's where "dynamic" hashed file comes in. It self-manages its space.

Clever, innit?

phanikvv · Post by **phanikvv** » Thu Mar 15, 2007 4:06 am

kumar_s wrote:You are right, CRC32 will give you unique single field value for combination of several fields. So you can reduce the over head of doing the lookup for all the fields. But you need to make sure, while preparing the CRC field, the datatype and length should be same, else you will end up in getting different value and lookup mismatch.

Hi Kumar,

Thanks for suggesting the CRC32 generation. I also tried to use the CRC32 to generate a unique identifier for look up purposes. The volume that I am handling is arond 8 million rows. I got distinct CRC 32 values for almost all the records, except for some 6900 rows. Those rows are having totally different fields values, but still ending up in generating the same CRC value. Not sure if I need to change something at the server side to get a unique value for each string that I process. Any help on this would be highly appreciated.

TIA
Phani

kumar_s · Post by **kumar_s** » Thu Mar 15, 2007 4:34 am

Hi Phani,
Welcome aboard :D !!!

CRC32 is not a suitable approach for your case. CRC32 has 1 in 4Million chance of generating a duplicate. Though 6900 rows for 8 million is extremely high value, its possible.
I would apologize for giving a bad suggestion, if at all you followed the approach from my post.
As widely suggested, you could use Sequence Key generator using datastage macros like @INROWNUM/@OUTROWNUM etc.
Or google and define your own hashing algorithm for more bytes involved.

kumar_s · Post by **kumar_s** » Thu Mar 15, 2007 4:49 am

Even Iam looking out for any better approach followed to generate SK based on one or many existing keys. If at all the whole approach is to create a single integer key based several Char fields, atleast inorder to save space.

rodre · Post by **rodre** » Thu Mar 15, 2007 3:18 pm

Have you checked your Server CPUs?

I was working on a similar project and realized that when I was loading the data into a large hashed file, it was taking 10% of CPU. When loading 2 hashed files in paralell, it was taking 20% of CPU and son on...

My 2 cents...

DSXchange

Slow Writing to Hashed files

Same CRC32 value for 2 totally different strings