Hash File Performance Degrading ??

ray.wurlod · Post by **ray.wurlod** » Thu Aug 23, 2007 4:46 pm

I suspect there is no VOC entry. Please report the result of the following command:

LIST.ITEM VOC 'Hsh_PDSPersCovgPrevHist_coal_13058'

If that reports that there is no such entry, please advise the pathname of your hashed file. You can create the VOC entry using a SETFILE command.

Vinodanand · Post by **Vinodanand** » Thu Aug 23, 2007 5:58 pm

Hi,

These are the following stats that i gathered with the help of Set File Command.

FILETYPE : DYNAMIC
HASHING ALGORITHM : GENERAL
MODULUS : 678494 (I HAD SET THIS PARAMETER )

LARGE RECORD SIZE : 1628
GROUP SIZE : 4096
LOAD FACTORS : SPLIT(80%),ACTUAL(55%),MERGE(20%) (I HAD SET THIS PARAMETER )
TOTAL SIZE : 3068252160

Vinodanand · Post by **Vinodanand** » Thu Aug 23, 2007 6:03 pm

I am sorry I missed teh number of records counted.They are 2411941.

ArndW · Post by **ArndW** » Thu Aug 23, 2007 6:44 pm

You only have a 55% load with a modulus of 678494; try using a MINIMUM.MODULUS of 284969. Also, what does your key look like? Are the rightmost couple of character/digits evenly spread? If so, you might try using the SEQ.NUM hashing algoritm.

Vinodanand · Post by **Vinodanand** » Thu Aug 23, 2007 8:34 pm

Hi Arnd,

What does Evenly spread key mean.My key columns are an ID like an SSN and a Coverage Type. I am having 2.4 million records of this combination here the ID would be a 22 byte number and coverage type would be a 2 byte number. Also the minimum modulus that i put was computed from one of the posts I read in the forum.What does actual meean.If you look at the Data.30 and Over.30 there has been a considerable amount of overflow. Please tell me is there a way to calculate the modulus apart from using the HFC as I do not have teh install CD.

Regards,
Vinod

ArndW · Post by **ArndW** » Thu Aug 23, 2007 8:47 pm

Look at the rightmost characters of your key, would they be very similar on all keys or spread out - i.e. a sequential series of numbers would be well-spread out but if the last character is always 'X' or 'Y' that would not be well distributed. I've often found that using SEQ.NUM even on a string key gives good distribution and, particularly with static hashed files, the algorithm is much more efficient.

There are so many well-thought out and descriptive posts on DSXchange regarding file sizing that I am not going to attempt to go into detail.

For dynamic files the modulus is dynamically computed according to your SPLIT and MERGE settings; using a large initial MINIMUM.MODULUS saves time required when doing SPLITs during data load, plus also pre-allocates much of the disk space used for storing the hashed file so that doesn't need to be done at runtime.

Setting the MINIMUM.MODULUS too high is not fatal, but can impact performance.

Vinodanand · Post by **Vinodanand** » Thu Aug 23, 2007 8:55 pm

Now i got what you are saying.My combination of ID and Covg type would be teh Key.But teh last 6 digits of ID are always 0 out of (22) and covg type should be between 01 to 06. so it would be like

.....000001 here 01 is the covg type.I would try to decrease the modulus and see i fi can get any gain,bit arnd I have been working on this for 2 days and my hash file takes 3 hours to build and when i use it a as look up it takes 7 hours for the job to complete. I habe also reduced the Columns from 67 to 55.Is there any more tuning that I can do.I also enable the Stage write cache.I am not sure what else i need to do.

ArndW · Post by **ArndW** » Thu Aug 23, 2007 9:15 pm

Vinodanand wrote:...But teh last 6 digits of ID are always 0 out of (22)

If you strip this dummy data out of your key you would save 18Mb of key space alone.

Vinodanand · Post by **Vinodanand** » Thu Aug 23, 2007 9:26 pm

ArndW wrote:
Vinodanand wrote:...But teh last 6 digits of ID are always 0 out of (22)
If you strip this dummy data out of your key you would save 18Mb of key space alone.

So what I need to do is to ignore this 6 bytes write out the Id and change the modulus and alogrithm to sequential and see for the same data set.I would do that to see if it helps out my performance in any way.I als feel becaus eof the overflow when I look up data it is very slow. Thanks Arnd,let me try this first.

Regards,
Vinod

ArndW · Post by **ArndW** » Thu Aug 23, 2007 9:30 pm

Not exactly what I was thinking, but not really wrong, either.

First of all, if you have any 'unused' characters in the key or columns then remove them in order to reduce the file size. Delete any columns you aren't using in your lookup from the hashed file.

The GENERAL and SEQ.NUM hashing algorithms changes aren't going to make as much of a difference as even removing a couple of bytes per row.

The MINIMUM.MODULUS setting might help your write times but won't affect your read times.

Vinodanand · Post by **Vinodanand** » Thu Sep 13, 2007 7:01 am

Hi Arnd,

I thought it would be better if i replied in this topic rather than the one for computed blink.

I ran the ANALYZE.FILE and got the following stats :

FILETYPE : DYNAMIC
HASHING ALGORITHM : GENERAL
MODULUS : 114057 (114057 minimum)

LARGE RECORD SIZE : 1628
GROUP SIZE : 2048
LOAD FACTORS : SPLIT(80%),ACTUAL(76%),MERGE(20%)
TOTAL SIZE : 2947037184
RECORD COUNT : 2814039

The write for this job ran for 4 hrs and the read ran for 8 hrs . I removed unwamted columns and number of columns is 67 now. Any other suggestions.

ArndW · Post by **ArndW** » Thu Sep 13, 2007 5:10 pm

Your write speed is about 200Kb per second - I don't know your system layout so cannot comment if that is a good speed or not.

Reading should be faster than writing.

How are you reading this file, i.e. a Hashed file stage doing a read? If so, do a test job with just the read stage and an output sequential file stage that is directed to /dev/null and see what the speed is. If it is high (which I suspect it will be) then your bottleneck is not the actual reading but some other stage in the job.

If this read speed remains only 1/2 of the write speed then I'd check out the hardware layer (i.e. is it on a SAN with some funky layout?).

Vinodanand · Post by **Vinodanand** » Thu Sep 13, 2007 5:24 pm

Hi Arnd,

Where does /dev/null point to. Do you want me to point the directory path to /dev/null. My Job is like this

[img]

HF1 HF2(2.8 Million)
| |

SeqFile ------ TF1 ------------------TF2 ---------------------SEQFILE

TF(1&2) : Transformer
HF 1 : Hashed file 1 looked up at TF 1
HF2 : Hashed File 2 looked up at TF 2

[/img]

Beacuse the second hashed file takes a lot of time the entire process slows down.If I run it without th eHF2 then the job runs in 1 hr max.

Thanks,
Vinod

Hash File Performance Degrading ??

RECORDS COUNTED