what is the best practice to load 50Mil+ in HF

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
datability_user
Premium Member
Premium Member
Posts: 50
Joined: Sat Jan 31, 2009 3:39 am
Location: Riyadh
Contact:

what is the best practice to load 50Mil+ in HF

Post by datability_user »

Hello guys;

we have a requirement to load data for multiple dimensions in hashed files, the reason for that is, the records are not unique, so we get benfit of the hashed file primary key to overwrite.
i have created 64BIT pathed hashed files, but to populate 6 files by 38Mil records, it takes more than 24 Houre only to load 25Mil and still runinig,
the load started with 1700+ per second and goes down and now it is 300-.

the job is very simple text file---->Transformer------>6 hashed files
no of columns in each 5 or 6 and the key columns between 3 and 4

appreciate your help
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

After the file has been loaded with data, do a "ANALYZE.FILE {YourHashedFile}" from either TCL or from the admin client. Add 10% to your current number of groups and using "PRIME {YourNumber}" to get the next higher prime number and choose that as your MINIMUM.MODULUS value used when you create the file.
If you do a "ANALYZE.FILE {YourHashedFile} STATS" and post the output here it might be possible to further tune the hashed file.
The default DYNAMIC file might not be the best or most efficient solution in this case and a static hashed file might give you better performance. Is the write part more important than the later read from the hashed file?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

First off, do you really need to load all records into the hashed files? Are you using the hashed files just to de-dupe the data? That's doesn't seem to me to be the best approach for a volume like that. :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Hmm.. so far the file is still below the 2Gb limit that would necessitate using a 64bit file (which is slower than the default file).

Can you output the "ANALYZE.FILE m_HFg2_6 STATS" output - that would give some more information. Do you know what the keys look like and if they are more (or less) unique on the left, right or middle? Are the records lengths for most records approximately the same? In this case, you have a modulo of 632705 with a minimum of 1 - make sure you change that minimum modulo to at least 632713.
datability_user
Premium Member
Premium Member
Posts: 50
Joined: Sat Jan 31, 2009 3:39 am
Location: Riyadh
Contact:

Post by datability_user »

Hello and thanks Arnd,

Since the file still runing, the output of ANALYZE.FILE m_HFg2_6 STATS is giving me only lines of stars,

see, i have one sequential file writing to six different hashed files at the same time (m_HFg2_1,.....,m_HFg2_6).

keys are different also,
ex. for m_HFg2_6:
it has 5 columns, 4 of them represent the key and the last column is a the month (YYYYMM) and it is constant value,
the key columns like this:
- key1 (only 2 different values "11 and 12")
- key2 (7 different values between 1 to 7)
- key3 (only 2 different values 1 or 2)
- key4 is 50-70% unique
- column5 is constant value 200901

the other files are similar to this one.

i have created all the hashed files using the hashed file stage, then
used these commands to add VOC and change to 64bit:
SETFILE /app/na/source/M_HFg2_1 M_HFg2_1 OVERWRITING
RESIZE M_HFg2_1 * * * 64BIT


using this approach , what command to use to set the minimum modulo

Thanx
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The good news is that there is a lot of room for improvement here. First off, can you remove the 5th column from the key completely? The shorter and simpler the key is, the better the hashing algorithm will work. If the 5th column cannot be removed from the key, put it to position 1 and specify SEQ.NUM for the hashing algorithm if you stick with a dynamic file. The numeric hashing method will use bits from the righthand part of the key, which will be more unique than any other part of the key.
It would also make sense to go for a static hashed file in this scenario, but you will need the file to be completely filled at least once to get the correct modulo.
Likewise, if you stick with dynamic files, change your resize to "RESIZE M_HFg2_1 DYNAMIC MINIMUM.MODULUS 632713 64BIT" or a larger number. This will pre-allocate the disk space for the groups of the file but make inserts go much faster.

The STATs option will produce a number of stars on a page as it computes the contents, but after that completes you will get a statistics page.
datability_user
Premium Member
Premium Member
Posts: 50
Joined: Sat Jan 31, 2009 3:39 am
Location: Riyadh
Contact:

Post by datability_user »

Thanks Arnd,


the 5th column is not part of the key and i can remove it completly
i will go with your valuable suggestions and come back with the result out of it.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I think that the columns comprising the key should be as small as possible. What columns with what average total length are in your data portion of each record?
datability_user
Premium Member
Premium Member
Posts: 50
Joined: Sat Jan 31, 2009 3:39 am
Location: Riyadh
Contact:

Post by datability_user »

there is one column by 13 (char) byte length in all files, the other columns are between 1 byte to 5 bytes in all hashed files.
the maximum concatinated lenght for the key columns about 18 - 22 bytes (char)
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Once your file is full of data, get the number of groups, add 10%, find the next highest prime number and use that value as the MINIMUM.MODULUS for the dynamic hashed file. Also check the UNIX files DATA.30 and OVER.30 to see if they are over 2Gb. If they aren't then a 32bit hashed file is preferable.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

datability_user wrote:there is one column by 13 (char) byte length in all files, the other columns are between 1 byte to 5 bytes in all hashed files.
the maximum concatinated lenght for the key columns about 18 - 22 bytes (char)
Plug those numbers into Hashed File Calculator and see the recommended size. Choose file type 18 as a first guess.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
datability_user
Premium Member
Premium Member
Posts: 50
Joined: Sat Jan 31, 2009 3:39 am
Location: Riyadh
Contact:

Post by datability_user »

Thank you very much guys;

the ob was finished in around 92 hours, and i have more than one job by the same concept, so i have changed all the scenario by splitting the source data into 10 flat files and the load them one by one in a sequence and it seems fine since the split file takes around 90 minutes to finish all the jobs, and of course i have set the MINIMUM.MODULUS for every hashed file (more than 50 files).

so thank you again.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It might be interesting to run the ten jobs simultaneously rather than consecutively.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply