Creating hashed file takes too long

samsuf2002 · Post by **samsuf2002** » Mon May 19, 2008 9:31 am

Hi All , I am running job which does vertical pivot creating 150 columns in output hashed file, I have 10 million rows in input seq file . The job takes 11 hrs to create the hash file ....i am using link patitioner and link collector but it is not helping ....can any one please suggest me how can I improve the performance?

Thanks
in
Advance

chulett · Post by **chulett** » Mon May 19, 2008 9:38 am

Job design? Bottleneck? The pivot is a notoriously slow stage (especially given your metrics), your issue probably has nothing to do with the hashed file itself.

samsuf2002 · Post by **samsuf2002** » Mon May 19, 2008 9:42 am

Chulett I am using stage variables in transformer instead of pivot stage for doing vertical pivot ....my job design will be

Code: Select all

Seq file --- link partitioner----4 streams of trnfrm ----- link collector ----- hashed file

Krazykoolrohit · Post by **Krazykoolrohit** » Mon May 19, 2008 9:43 am

From the initial looks of it, your hash file looks corrupted. To ensure, open the hash file folder through Unix. If its corrupted you will see a large number of files in there instead of three expected ones (you may not be able to open the hash file folder )

Rename the hash file and run the job again.

narasimha · Post by **narasimha** » Mon May 19, 2008 9:49 am

To test if the hashed file is the real culprit - split your job into two.
First job - Write your output into a sequential file.
Second job - Read from the sequential file and write to the hashed file.
This way you can identify, whether its your job design or the hashed file is the bottleneck.

fridge · Post by **fridge** » Mon May 19, 2008 9:55 am

further to the above - if your hash file IS the culprit - u may well need to tune it as the default settings in terms of modulus , seperation , type (i.e. wholly numeric key , numeric variance to right etc) are often not best suited to your data.

I good place to start is the hashfile calculator (should be on utilities on client folder of installation CD) or the HASH.HELP / HASH.ANALYZE commands that can be used from the Admin tool

samsuf2002 · Post by **samsuf2002** » Mon May 19, 2008 10:51 am

I replaced output hashed file with sequential file in the job design i showed above , the job took 20 mins to load all 10 million in the output seq file. Later i am using another job to load seq file into hash file and its very slow like 600 row/sec.

ArndW · Post by **ArndW** » Mon May 19, 2008 10:59 am

pre-allocating the hashed file MINIMUM.MODULUS might make a difference. After the job completes, do a "HASH.HELP {file}" and it will tell you the current MODULUS, then do a "RESIZE {file} MINIMUM.MODULUS {size}" and re-run to see if the speed changes.

jdmiceli · Post by **jdmiceli** » Mon May 19, 2008 11:08 am

Another thing to consider with the hashed file is that it has an upper limit. If you are processing 10 million rows into a hashed file with 150 columns per row, depending on the size of the data, you could be shooting yourself in the foot.

Are you using a dynamic hashed file specification? If so, make sure you start with a monstrous mod for it so it doesn't have to reallocate too often. Also, you might want to consider pivoting the file at the OS level instead of doing it in a job. I posted Perl code that will do it for you, though it wasn't for this many fields. Still, it should be pretty efficient once fleshed out. You can find my Perl utilities at viewtopic.php?t=113429&highlight=perl+scripts. It is the next to the last script in the post. If you need assistance with it, let me know and I'll see if I can get you going.

vsi · Post by **vsi** » Mon May 19, 2008 12:57 pm

How can i see the min.mod in command from admin tool ?

jdmiceli · Post by **jdmiceli** » Mon May 19, 2008 1:10 pm

If I understand you correctly, the minimum mod is 1 with a max of 999999. I don't see why you would need the admin tool to see this.

ray.wurlod · Post by **ray.wurlod** » Mon May 19, 2008 5:07 pm

The values with which the hashed file is created are on the Options tab of the Hashed File or UniVerse stage. Subsequently you can use an ANALYZE.FILE command to determine the current settings.

Use the Hashed File Calculator to recommend an appropriate setting.

This may be a case where a static hashed file (for example Type 2 or Type 18) performs better than a dynamic hashed file, particularly if the record sizes are relatively homogeneous.