Slow Writing to Hashed files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

ray.wurlod wrote:Total all the characters in a row and add 1 for each column. There are no data types - numbers are stored as character strings. ...
Great !! Thats seems pretty simple then..
Just to confirm... The number of characters would ideally be the field length, right ????
Share to Learn, and Learn to Share.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not ideally; VarChars on average are shorter than the maximum length specified. But if you total the lengths, you'll err on the side of caution, and over-size your hashed file. This is far better than under-sizing it.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

Hi Ray,

I have a small doubt...
The data across the 10 instances will not be evenly distributed...
so tuning the Hashed File for some random input number of records,
wont it effect the performance.... i.e in one Instance it will be good and will be poor in the other...

Please correct me wherever I am wrong... :(
Share to Learn, and Learn to Share.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's where "dynamic" hashed file comes in. It self-manages its space.

Clever, innit?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
phanikvv
Participant
Posts: 1
Joined: Mon Oct 11, 2004 12:56 pm

Same CRC32 value for 2 totally different strings

Post by phanikvv »

kumar_s wrote:You are right, CRC32 will give you unique single field value for combination of several fields. So you can reduce the over head of doing the lookup for all the fields. But you need to make sure, while preparing the CRC field, the datatype and length should be same, else you will end up in getting different value and lookup mismatch.
Hi Kumar,

Thanks for suggesting the CRC32 generation. I also tried to use the CRC32 to generate a unique identifier for look up purposes. The volume that I am handling is arond 8 million rows. I got distinct CRC 32 values for almost all the records, except for some 6900 rows. Those rows are having totally different fields values, but still ending up in generating the same CRC value. Not sure if I need to change something at the server side to get a unique value for each string that I process. Any help on this would be highly appreciated.

TIA
Phani
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Hi Phani,
Welcome aboard :D !!!

CRC32 is not a suitable approach for your case. CRC32 has 1 in 4Million chance of generating a duplicate. Though 6900 rows for 8 million is extremely high value, its possible.
I would apologize for giving a bad suggestion, if at all you followed the approach from my post.
As widely suggested, you could use Sequence Key generator using datastage macros like @INROWNUM/@OUTROWNUM etc.
Or google and define your own hashing algorithm for more bytes involved.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

Even Iam looking out for any better approach followed to generate SK based on one or many existing keys. If at all the whole approach is to create a single integer key based several Char fields, atleast inorder to save space. :roll:
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
rodre
Premium Member
Premium Member
Posts: 218
Joined: Wed Mar 01, 2006 1:28 pm
Location: Tennessee

Post by rodre »

Have you checked your Server CPUs?

I was working on a similar project and realized that when I was loading the data into a large hashed file, it was taking 10% of CPU. When loading 2 hashed files in paralell, it was taking 20% of CPU and son on...

My 2 cents...
Post Reply