Distributed Hash Files

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
ewartpm
Participant
Posts: 97
Joined: Wed Jun 25, 2003 2:15 am
Location: South Africa
Contact:

Distributed Hash Files

Post by ewartpm »

Hi Guys

I have created a job that creates 10 hash files. Each hash file will be incorporated into a distributed hash file. The average number of rows in each hash file is 6 million. The hash file has 4 columns that make up the key, the constraint in the job is based on the value in only one of the hash file key columns.

When I create the distributributed hash file, I use the INTERNAL option and use @ID e.g. INTERNAL "IF @ID >= 4000000 AND @ID < 8000000 then 1 ELSE 2"

Seeing that @ID is the whole key in the hash file, I'm not sure this will work when trying to do a lookup against the distributed hash file because the job that creates the hash files uses only a part of the hash file key. :?

How does the SYSTEM option work when creating a distributed hash file? Will this perhaps solve my problem?

Any suggestions on how to create a distributed hash file when the hash file has more than one column as the key?
WoMaWil
Participant
Posts: 482
Joined: Thu Mar 13, 2003 7:17 am
Location: Amsterdam

Post by WoMaWil »

Distributed Hash File is:

you fill several hashfiles with same metadata structure.
you build a "master" file, which joins all the hashfiles as one.

And yes, the keys have to be completly different.

You can get those keys for example by using the @inrownumber or the @outrownumber and concatinate 1 or 2 or 3 or 4 ....0 to the end.

or you know that you have max 10^11 rows so you can add 1*10^13, 2*10^13 ..... 0*10^13 to each inrownumber for to get single keys,

Kind regards
Wolfgang
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you have 10 hashed files (with identical record structure and some characteristic of the key that consistently identifies a part file) then you can construct a Distributed file. However, your partitioning algorithm only specifies a two part Distributed file. Which do you want; two or 10 parts?

Imagine for a moment that the rightmost character of @ID (which is the entire key including separator characters (@TM by default)) is a numeric character; thus its domain is the digits "0" through "9". Assuming this gives a reasonably flat spread (which you can test with queries), then you can define a Distributed file on that basis. Let us further assume that the ten hashed files share the one dictionary (D_HF) and are called HF01 through HF10.

Code: Select all

DEFINE.DF DATA DistFile ADDING     _
     HF01  1                       _
     HF02  2                       _
     HF03  3                       _
     HF04  4                       _
     HF05  5                       _
     HF06  6                       _
     HF07  7                       _
     HF08  8                       _
     HF09  9                       _
     HF10 10                       _
INTERNAL "IF NUM(@ID[1]) THEN @ID[1] ELSE 0"

UPDATE VOC SET F3 = 'D_HF' WHERE F0 = 'DistFile';
Allocating non-numeric key rightmost digit to (the illegal) part number 0 specifies that these keys cannot be written to the Distributed file at all. If you do want to handle these, or generic keys, replace the partitioning algorithm above with something like

Code: Select all

INTERNAL "MOD(SEQ(@ID[1]),10)+1"
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ewartpm
Participant
Posts: 97
Joined: Wed Jun 25, 2003 2:15 am
Location: South Africa
Contact:

Post by ewartpm »

No sure what you mean Wolfgang. I understand that each hashfile's metadata has to be the same.

Here's my dilema...if I use @inrownum to send 1000 rows to hash file 1, the next 1000 rows to hash file 2 etc, how do I specify this in the DEFINE statement when I'm creating the distributed hash file. Can I use any column in the hash file in the INTERNAL portion of the command or do I have to user the @ID?

Also, what happens when the SYSTEM option is used instead of INTERNAL?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

SYSTEM requires a key of the form partno:delimiter:remainder. For example: 1/7603, 2/1311 and so on. The delimiter character is specified in the algorithm, for example

Code: Select all

SYSTEM '/'
for the above keys.

So, if the first column of your key can be guaranteed to contain the part numbers (1 through 10?), then you can use the SYSTEM partitioning algorithm. If not, you can't.

Incidentally, you can't define a Distributed file unless you have write permission both to VOC and to &PARTFILES&.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

And to add to your last open question - the algorithm that returns the part number is an I-Type, which is a compiled expression and can contain or use any part of the record, but usually is restricted to the key column.
ewartpm
Participant
Posts: 97
Joined: Wed Jun 25, 2003 2:15 am
Location: South Africa
Contact:

Post by ewartpm »

1. The job creating the hash files is a multi-instance job.
2. Each instance of the job is passed different parameters for a sql select statement (where akakey >= startkeyvalue and akakey < endkeyvalue).
3. The hash file contains 4 key columns, akakey being the last of the 4.
4. Each key column is defined as a BigInt (19)
5. 10 hash files HF01 --> HF10 are created.

I tried this but got an error about the algorithm not matching:

INTERNAL "IF AKAKEY >= 4000000 AND AKAKEY < 8000000 THEN 1 ELSE IF AKAKEY >= 8000000 etc

Also tried this and got the same error
INTERNAL "IF @ID[59,19] < 4000000 AND @ID[59,19] < 8000000 THEN 1 ELSE etc

:oops: I seem to be missing the boat here. I did a LIST DICT HF01 and akakey is an I column, so why can't I use it?
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

ewartpm -> The I-type needs to be in your VOC file, do a "COPY FROM DICT <PartfileWithItype> TO VOC akakey" and then do your "BUILD.DF ..... INTERNAL akakey"
ewartpm
Participant
Posts: 97
Joined: Wed Jun 25, 2003 2:15 am
Location: South Africa
Contact:

Post by ewartpm »

Me again.

I tried all the options, still get the error about the partitioning algorithm for the distributed file not matching when I run the DEFINE.DF command.

I also tried:

INTERNAL "IF FIELD(@ID,@TM,4) < 4000000 THEN 1 ELSE etc -- no joy

What is the correct procedure for creating a distributed hash file? Must I create empty hash files first, add them to the distributed file and then populate the hash files
OR
Create and populate the hash files and then add them to the distributed file (I have tried the later option).
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

ewartpm,

did you copy the I-descriptor from your file to the VOC?

I did the following test:

1. CREATE.FILE AW1 DYNAMIC
2. CREATE.FILE AW2 DYNAMIC
3. ED DICT AW1 dfkey
01 I
02 IF @ID[1,1]='1' THEN 1 ELSE 2
...
05 3R
4. LIST AW1 dfkey {to force a compile of the I-descriptor}
5. COPY FROM DICT AW1 TO VOC dfkey
6. DEFINE.DF MYDF AW1 1 AW2 2 INTERNAL dfkey
7. {insert a couple of records in MYDF}
8. LIST AW1 / LIST AW2 to see if the algorithm is working.


All of this worked without a hitch, apart from my speeling [sic.] errors :)
ewartpm
Participant
Posts: 97
Joined: Wed Jun 25, 2003 2:15 am
Location: South Africa
Contact:

Post by ewartpm »

Hi Guys

Thanks for all the replies. What seemed to be the problem is the name of the distributed file i.e. FT_AKANAMESHASH

When I changed the name to FT_AKANAMES_HASH and used INTERNAL "IF FIELD(@ID,@TM,4) < 4000000 THEN 1 ELSE etc it worked. I can view data in the distributed file, so all is almost well.

When I try to do a DELETE.FILE FT_AKANAMESHASH , does not work
When I try DEFINE.DF FT_AKANAMESHASH REMOVING FT_AKANAMESHASH01 etc, does not work

How do I delete a "corrupt" distributed file from the system?
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

ewartpm,

If DELETE.DF doesn't work, delete the component files separately, delete the VOC entry for the master file and then remove the component entries in the &PARTFILES&.

Bonne chance,
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There is no DELETE.DF (and the I-descriptor does not need to be in the VOC file - where did you read that?).

You could try reinstating the DF (verbs are VERIFY.DF and REBUILD.DF).

One problem that you may have is that each of the part files has a block of data (the "partfile" block) indicating that "I am part number N in a DF that uses partitioning algorithm blah" and these, too, need to be tidied up (unless you delete the individual hashed files at the operating system level, their individual VOC entries, the VOC entry for the DF, and all related entries from &PARTFILES&. And the dictionary of the DF if you created that too.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Damn - should have checked about the DELETE.DF before posting. Ray is right, it doesn't exist.

What you need to do is DEFINE.DF <distfile> REMOVING <partfile>
for all of your partfiles, when the last valid partfile is removed the system will remove your master file definition. The partfiles can then be deleted by using DELETE.FILE.

Sorry about the error.
ewartpm
Participant
Posts: 97
Joined: Wed Jun 25, 2003 2:15 am
Location: South Africa
Contact:

Post by ewartpm »

Thanks Guys...

Will try the cleanup as per Ray's instructions.
Post Reply