Creation of static hashed file runtime

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
parimi123
Participant
Posts: 12
Joined: Fri Nov 04, 2005 9:43 am
Location: Atlanta

Creation of static hashed file runtime

Post by parimi123 »

We have been using dynamic hashed files (default t30) as part of our datastage jobs. Till now this did not create any issue. Now we are expecting huge data to be processed. The current dynamic files does not seem to meet the requirement.
So we decided to go with Static hashed files and the results have been very good in our test.

In this regard i need more information on how to create the static hashed file during run time depending on the size of the file.
We need to consider the file size, because the file size varies from few thousand records to couple of million records. As part of the process we load entire record into the Hashed file. The size of each record in hashed file is 915 bytes. And the key is Alphanumeric. (Mostly 13 digit telephone number but some times Alphabets come in between).

I want to use mkdbfile utility in creation of Hashed file.

Please let me know what type of file (eg t30, 5, 18 etc), Separation(hash file calculator is suggesting 8), and modulo i need to use in creation of the hashed file.
I think, modulo is the only thing that varies depending on the size of the file. So i should be able to determine modulo depending on the file size during run time.


Thank You,
Poorna
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Poorna,

The type in your case is probably going to be best set at 2 and the separation should be 1.

The static hashed file types are:

1/19 - Directory. Type 1 is limited to 14 character names, uses sub-dirs as delimiters (i.e. "testprogramname" will be "testprogramnam/e")
2 - numeric, 8 rightmost digits
3 - mostly numeric, 8 rightmost digits
4 - Alphabetic, last 5 characters
5 - ASCII, last 4 characters
6 - numeric, first 8 characters
7 - mostly numeric, first 8 characters
8 - alphabetic, first 5 characters
9 - ASCII, first 4 characters
10 - numeric, last 20 characters
11 - mostly numeric, last 20 characters
12 - alphabetic, last 16 characters
13 - ASCII, last 16 characters
14 - numeric, entire value
15 - mostly numeric, entire value
16 - alphabetic, entire value
17 - ASCII, entire value
18 - Anything, entire string
25 - b-tree, entire
30 - dynamic file

The less of the original key that needs to be hashed the better. Type 2 is generally the fastest algorithm and, in the case of telephone numbers, the further right you go the more the values will change - on the left you usually have a country code, area code and exchange as groupings. So most likely you will have by far the best distribution on the righthand side of the key and the few characters you will have are not worth changing your file type to 3 or even 4. The difference in hashing speeds between these are going to be minimal, even with hundreds of thousands of operations.
parimi123
Participant
Posts: 12
Joined: Fri Nov 04, 2005 9:43 am
Location: Atlanta

Post by parimi123 »

Arnd,

Thanks for your quick reply.
Btw some times my key (Telephone number) have alphabets in middle, i think this should not be problem for using type 2.

I got two more questions.

First one is Can I still use type 2 for 64 bit hashed files as well?

My other quesition is finding the modulo for creation of static hashed file. I need find the modulo during runtime. The number of records i load into hashed file varies from few thousand records to couple of millon records.

Thanks again,
Poorna
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

All hashed files allow the 64bit option (I'm certain about all except the b-tree). The modulo imposes a very large performance hit if it is chosen too low, this is not the case when it is set too high. The downside of specifying a large modulo is that it reserves disk space even with empty files. So in your case fill the file with the largest number of records and get a recommendation for a modulo with your favorite tool and use that value.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Trust the Hashed File Calculator.

A few alphabetics won't hurt the Type 2 hashing alogrithm if the keys are primarily numeric varying most on the right hand (line number) end.

I must challenge your assertion that you are moving away from dynamic hashed files because of size. Any kind of hashed file, including dynamic, can store large amounts of data if the -64bit option is specified.

If HFC has recommended a separation of 8 (that is, 4KB groups), that suggests that your average record size is between approximately 667 and 1334 bytes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
parimi123
Participant
Posts: 12
Joined: Fri Nov 04, 2005 9:43 am
Location: Atlanta

Post by parimi123 »

Ray,

Thanks for your suggestions,

The reason for going to static hashed file is because of my source file. my source file varies in size (abt 8,000 records to 5,500,000). And the record size is 900 bytes.

for 5.5 million records file i need to create 64bit hashed file as size of hashed file will be around 4.4 GB. for 8,00 records file i don't want to create a 64 bit one.

Forgot to mention, i call the datastage jobs using a shell wrapper. so i want to create the static hashed file before calling dsjob depending the number of records.
kris
Participant
Posts: 160
Joined: Tue Dec 09, 2003 2:45 pm
Location: virginia, usa

Post by kris »

ray.wurlod wrote: I must challenge your assertion that you are moving away from dynamic hashed files because of size. Any kind of hashed file, including dynamic, can store large amounts of data if the -64bit option is specified.
Hi Ray/Arnd,

Thanks for your time on this thread.

I worked with Parimi on this project.
At the time of this project made live, we were processing about maximum about a million record file. So we thought we would go for plain old 32 bit hash files but nothing fancy. It made live and running fine without any problems.

Then they want to start adding more feeds to the process, which will come very huge in size. At least the initial loads of these new feeds are going to be humongous.

Like you said above, creating 64-bit, dynamic hash files and not worrying about the size of the hash file is may be the solution we need.

But since we create these hash files at runtime with lot of other intermediate sequential files in one directory (specific to that file), which is created at runtime in the processing of each file. Can we create 64-bit hash file at runtime?

Can we use Mkdbfile and create 64-bit dynamic hash file?

The other way of changing 64BIT_FILES to 1 in uvconfig is not what we want, because many other projects share the same box.

Appreciate your time.

Thanks,
~Kris
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Add "64Bit" to the mkdbfile command.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can not do this with a Hashed File stage.

Add "-64bit" to a mkdbfile command executed as a before-job or before-stage subroutine. The Hashed File Calculator will give you the correct command syntax, which you can copy via the Edit menu.

Or use a UV stage with its DDL edited to include DATA, DICT and 64BIT keywords.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply