Sharing Hash File between multiple instances

asitagrawal · Post by **asitagrawal** » Mon Jan 15, 2007 3:44 pm

Hi,

In my job, I have a hashed file , which gets updated and referenced in the same job.
I am running multiple instances of this job , with different set of input data, say 1st 1000 into 1st instance, next 1000 into 2nd instance and so on., like wise I am running just 4 insatnces.

The hashed file counts the number of occurences of the incoming value in the input set.

An example,

A value as A1 (total count as 100) appears in the input data (which is split across the 4 input sets)

Now at the end of the job the count of A1's in the input is not recorded as 100 in the Hashed file. On several re-runs, I got 100 too but this output is not stable....

Plz suggest.

Kind regards,
Asit

ray.wurlod · Post by **ray.wurlod** » Mon Jan 15, 2007 3:56 pm

Are the different instances using the same keys? If so, your problem is obvious. Use different hashed files (note: it's not "hash file") or unique keys.

kcbland · Post by **kcbland** » Mon Jan 15, 2007 4:07 pm

Instead of using ranges of rows, use a small partitioning calculation, such as MOD, which will keep like values together on the same instance. For example, if you want all rows for a given "customer" to stay together on the same job instance, then pick a column that you can use in a constraint to keep them together. Since the rows will be processed by the same instance, they will be handled in the order of the input set.

asitagrawal · Post by **asitagrawal** » Mon Jan 15, 2007 4:13 pm

ray.wurlod wrote:Are the different instances using the same keys? If so, your problem is obvious. Use different hashed files (note: it's not "hash file") or unique keys. ...

Hi Ray,
Am sorry for giving a poor, rather wrong example.

Its like,
The each occurence of input, say A1, will be given an incremental ID (depending on the whatever max has been recorded for the input A1)... so in the hashed file (not hash file, sorry for that), I expect to see A1,1 to A1,100 .... but this is not happening always. Hence A1,1 to A1,100 forms a composite key.

Plz revert back for any more clearifications.

Regards,
Asit

asitagrawal · Post by **asitagrawal** » Mon Jan 15, 2007 4:16 pm

kcbland wrote:Instead of using ranges of rows, use a small partitioning calculation, such as MOD, which will keep like values together on the same instance. For example, if you want all rows for a given "customer" to stay together on the same job instance, then pick a column that you can use in a constraint to keep them together. Since the rows will be processed by the same instance, they will be handled in the order of the input set.

Hi Kenneth,

I have to considerations here,

1. Can MOD be applied to String data??
2. If my input data is not evenly distributed, say for Customer, then the spread will not be even across the multiple instances... which I think, destroys the purpose of haveing multiple instance to achieve parallelism.

Kind Regards,
Asit

kcbland · Post by **kcbland** » Mon Jan 15, 2007 5:10 pm

asitagrawal wrote: 1. Can MOD be applied to String data??

Sure, just use a fancier user function. I suggest one that checks to see if the value is numeric and then applies MOD. If it's not numeric, remove all non-numeric and apply a MOD. If that doesn't work, then take the last string character and MOD the ASCII value of that character. MOD has a 15 digit limit, so adjust for that.

Here's some code where Arg1 is the partitioning value and Arg2 is the number of partitions. If you're running 4 instances, use this function with Arg2 = 4 and the results (Ans) is always 0, 1, 2, 3.

Code: Select all

If NUM(Arg1) Then
   Ans = MOD(RIGHT(Arg1,15), Arg2)
End Else
   l_Value = OCONV(Arg1, "MCN")
   If l_Value = "" OR ISNULL(l_Value) Then
      Ans = MOD(SEQ(RIGHT(TRIM(Value),1)), Arg2)
   End Else
      Ans = MOD(RIGHT(l_Value,15), Arg2)
   End
End

asitagrawal wrote: 2. If my input data is not evenly distributed, say for Customer, then the spread will not be even across the multiple instances... which I think, destroys the purpose of haveing multiple instance to achieve parallelism.

Pick a better column that gives you a better distribution. This partitioning concept is the heart of PX theory. You pick the partitioning column that keeps like data together. The number of instances equates the number of processing nodes. PX itself uses a similar equation for "hashing" like data together. You're just doing it yourself.

The goal is not perfect distribution, but balanced.