Distributed Hash Files

ewartpm · Post by **ewartpm** » Tue Jun 07, 2005 12:49 am

Hi Guys

I have created a job that creates 10 hash files. Each hash file will be incorporated into a distributed hash file. The average number of rows in each hash file is 6 million. The hash file has 4 columns that make up the key, the constraint in the job is based on the value in only one of the hash file key columns.

When I create the distributributed hash file, I use the INTERNAL option and use @ID e.g. INTERNAL "IF @ID >= 4000000 AND @ID < 8000000 then 1 ELSE 2"

Seeing that @ID is the whole key in the hash file, I'm not sure this will work when trying to do a lookup against the distributed hash file because the job that creates the hash files uses only a part of the hash file key.

How does the SYSTEM option work when creating a distributed hash file? Will this perhaps solve my problem?

Any suggestions on how to create a distributed hash file when the hash file has more than one column as the key?

WoMaWil · Post by **WoMaWil** » Tue Jun 07, 2005 2:16 am

Distributed Hash File is:

you fill several hashfiles with same metadata structure.
you build a "master" file, which joins all the hashfiles as one.

And yes, the keys have to be completly different.

You can get those keys for example by using the @inrownumber or the @outrownumber and concatinate 1 or 2 or 3 or 4 ....0 to the end.

or you know that you have max 10^11 rows so you can add 1*10^13, 2*10^13 ..... 0*10^13 to each inrownumber for to get single keys,

Kind regards
Wolfgang

ray.wurlod · Post by **ray.wurlod** » Tue Jun 07, 2005 2:57 am

If you have 10 hashed files (with identical record structure and some characteristic of the key that consistently identifies a part file) then you can construct a Distributed file. However, your partitioning algorithm only specifies a two part Distributed file. Which do you want; two or 10 parts?

Imagine for a moment that the rightmost character of @ID (which is the entire key including separator characters (@TM by default)) is a numeric character; thus its domain is the digits "0" through "9". Assuming this gives a reasonably flat spread (which you can test with queries), then you can define a Distributed file on that basis. Let us further assume that the ten hashed files share the one dictionary (D_HF) and are called HF01 through HF10.

Code: Select all

DEFINE.DF DATA DistFile ADDING     _
     HF01  1                       _
     HF02  2                       _
     HF03  3                       _
     HF04  4                       _
     HF05  5                       _
     HF06  6                       _
     HF07  7                       _
     HF08  8                       _
     HF09  9                       _
     HF10 10                       _
INTERNAL "IF NUM(@ID[1]) THEN @ID[1] ELSE 0"

UPDATE VOC SET F3 = 'D_HF' WHERE F0 = 'DistFile';

Allocating non-numeric key rightmost digit to (the illegal) part number 0 specifies that these keys cannot be written to the Distributed file at all. If you do want to handle these, or generic keys, replace the partitioning algorithm above with something like

Code: Select all

INTERNAL "MOD(SEQ(@ID[1]),10)+1"

ewartpm · Post by **ewartpm** » Tue Jun 07, 2005 3:01 am

No sure what you mean Wolfgang. I understand that each hashfile's metadata has to be the same.

Here's my dilema...if I use @inrownum to send 1000 rows to hash file 1, the next 1000 rows to hash file 2 etc, how do I specify this in the DEFINE statement when I'm creating the distributed hash file. Can I use any column in the hash file in the INTERNAL portion of the command or do I have to user the @ID?

Also, what happens when the SYSTEM option is used instead of INTERNAL?

ray.wurlod · Post by **ray.wurlod** » Tue Jun 07, 2005 3:09 am

SYSTEM requires a key of the form partno:delimiter:remainder. For example: 1/7603, 2/1311 and so on. The delimiter character is specified in the algorithm, for example

Code: Select all

SYSTEM '/'

for the above keys.

So, if the first column of your key can be guaranteed to contain the part numbers (1 through 10?), then you can use the SYSTEM partitioning algorithm. If not, you can't.

Incidentally, you can't define a Distributed file unless you have write permission both to VOC and to &PARTFILES&.

ArndW · Post by **ArndW** » Tue Jun 07, 2005 3:23 am

And to add to your last open question - the algorithm that returns the part number is an I-Type, which is a compiled expression and can contain or use any part of the record, but usually is restricted to the key column.

ewartpm · Post by **ewartpm** » Tue Jun 07, 2005 4:40 am

1. The job creating the hash files is a multi-instance job.
2. Each instance of the job is passed different parameters for a sql select statement (where akakey >= startkeyvalue and akakey < endkeyvalue).
3. The hash file contains 4 key columns, akakey being the last of the 4.
4. Each key column is defined as a BigInt (19)
5. 10 hash files HF01 --> HF10 are created.

I tried this but got an error about the algorithm not matching:

INTERNAL "IF AKAKEY >= 4000000 AND AKAKEY < 8000000 THEN 1 ELSE IF AKAKEY >= 8000000 etc

Also tried this and got the same error
INTERNAL "IF @ID[59,19] < 4000000 AND @ID[59,19] < 8000000 THEN 1 ELSE etc

I seem to be missing the boat here. I did a LIST DICT HF01 and akakey is an I column, so why can't I use it?

ArndW · Post by **ArndW** » Tue Jun 07, 2005 5:06 am

ewartpm -> The I-type needs to be in your VOC file, do a "COPY FROM DICT <PartfileWithItype> TO VOC akakey" and then do your "BUILD.DF ..... INTERNAL akakey"

ewartpm · Post by **ewartpm** » Tue Jun 07, 2005 6:06 am

Me again.

I tried all the options, still get the error about the partitioning algorithm for the distributed file not matching when I run the DEFINE.DF command.

I also tried:

INTERNAL "IF FIELD(@ID,@TM,4) < 4000000 THEN 1 ELSE etc -- no joy

What is the correct procedure for creating a distributed hash file? Must I create empty hash files first, add them to the distributed file and then populate the hash files
OR
Create and populate the hash files and then add them to the distributed file (I have tried the later option).

ArndW · Post by **ArndW** » Tue Jun 07, 2005 6:13 am

ewartpm,

did you copy the I-descriptor from your file to the VOC?

I did the following test:

1. CREATE.FILE AW1 DYNAMIC
2. CREATE.FILE AW2 DYNAMIC
3. ED DICT AW1 dfkey
01 I
02 IF @ID[1,1]='1' THEN 1 ELSE 2
...
05 3R
4. LIST AW1 dfkey {to force a compile of the I-descriptor}
5. COPY FROM DICT AW1 TO VOC dfkey
6. DEFINE.DF MYDF AW1 1 AW2 2 INTERNAL dfkey
7. {insert a couple of records in MYDF}
8. LIST AW1 / LIST AW2 to see if the algorithm is working.

All of this worked without a hitch, apart from my speeling [sic.] errors

ewartpm · Post by **ewartpm** » Tue Jun 07, 2005 6:49 am

Hi Guys

Thanks for all the replies. What seemed to be the problem is the name of the distributed file i.e. FT_AKANAMESHASH

When I changed the name to FT_AKANAMES_HASH and used INTERNAL "IF FIELD(@ID,@TM,4) < 4000000 THEN 1 ELSE etc it worked. I can view data in the distributed file, so all is almost well.

When I try to do a DELETE.FILE FT_AKANAMESHASH , does not work
When I try DEFINE.DF FT_AKANAMESHASH REMOVING FT_AKANAMESHASH01 etc, does not work

How do I delete a "corrupt" distributed file from the system?

ArndW · Post by **ArndW** » Tue Jun 07, 2005 6:55 am

ewartpm,

If DELETE.DF doesn't work, delete the component files separately, delete the VOC entry for the master file and then remove the component entries in the &PARTFILES&.

Bonne chance,

ray.wurlod · Post by **ray.wurlod** » Wed Jun 08, 2005 2:44 am

There is no DELETE.DF (and the I-descriptor does not need to be in the VOC file - where did you read that?).

You could try reinstating the DF (verbs are VERIFY.DF and REBUILD.DF).

One problem that you may have is that each of the part files has a block of data (the "partfile" block) indicating that "I am part number N in a DF that uses partitioning algorithm blah" and these, too, need to be tidied up (unless you delete the individual hashed files at the operating system level, their individual VOC entries, the VOC entry for the DF, and all related entries from &PARTFILES&. And the dictionary of the DF if you created that too.

ArndW · Post by **ArndW** » Wed Jun 08, 2005 3:45 am

Damn - should have checked about the DELETE.DF before posting. Ray is right, it doesn't exist.

What you need to do is DEFINE.DF <distfile> REMOVING <partfile>
for all of your partfiles, when the last valid partfile is removed the system will remove your master file definition. The partfiles can then be deleted by using DELETE.FILE.

Sorry about the error.

ewartpm · Post by **ewartpm** » Wed Jun 08, 2005 6:14 am

Thanks Guys...

Will try the cleanup as per Ray's instructions.