Sharing Hash File between multiple instances

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Sharing Hash File between multiple instances

Post by asitagrawal »

Hi,

In my job, I have a hashed file , which gets updated and referenced in the same job.
I am running multiple instances of this job , with different set of input data, say 1st 1000 into 1st instance, next 1000 into 2nd instance and so on., like wise I am running just 4 insatnces.

The hashed file counts the number of occurences of the incoming value in the input set.

An example,

A value as A1 (total count as 100) appears in the input data (which is split across the 4 input sets)

Now at the end of the job the count of A1's in the input is not recorded as 100 in the Hashed file. On several re-runs, I got 100 too but this output is not stable....

Plz suggest.

Kind regards,
Asit
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Are the different instances using the same keys? If so, your problem is obvious. Use different hashed files (note: it's not "hash file") or unique keys.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Instead of using ranges of rows, use a small partitioning calculation, such as MOD, which will keep like values together on the same instance. For example, if you want all rows for a given "customer" to stay together on the same job instance, then pick a column that you can use in a constraint to keep them together. Since the rows will be processed by the same instance, they will be handled in the order of the input set.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

ray.wurlod wrote:Are the different instances using the same keys? If so, your problem is obvious. Use different hashed files (note: it's not "hash file") or unique keys. ...
Hi Ray,
Am sorry for giving a poor, rather wrong example.

Its like,
The each occurence of input, say A1, will be given an incremental ID (depending on the whatever max has been recorded for the input A1)... so in the hashed file (not hash file, sorry for that), I expect to see A1,1 to A1,100 .... but this is not happening always. Hence A1,1 to A1,100 forms a composite key.

Plz revert back for any more clearifications.


Regards,
Asit
asitagrawal
Premium Member
Premium Member
Posts: 273
Joined: Wed Oct 18, 2006 12:20 pm
Location: Porto

Post by asitagrawal »

kcbland wrote:Instead of using ranges of rows, use a small partitioning calculation, such as MOD, which will keep like values together on the same instance. For example, if you want all rows for a given "customer" to stay together on the same job instance, then pick a column that you can use in a constraint to keep them together. Since the rows will be processed by the same instance, they will be handled in the order of the input set.
Hi Kenneth,

I have to considerations here,

1. Can MOD be applied to String data??
2. If my input data is not evenly distributed, say for Customer, then the spread will not be even across the multiple instances... which I think, destroys the purpose of haveing multiple instance to achieve parallelism.

Kind Regards,
Asit
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

asitagrawal wrote: 1. Can MOD be applied to String data??
Sure, just use a fancier user function. I suggest one that checks to see if the value is numeric and then applies MOD. If it's not numeric, remove all non-numeric and apply a MOD. If that doesn't work, then take the last string character and MOD the ASCII value of that character. MOD has a 15 digit limit, so adjust for that.

Here's some code where Arg1 is the partitioning value and Arg2 is the number of partitions. If you're running 4 instances, use this function with Arg2 = 4 and the results (Ans) is always 0, 1, 2, 3.

Code: Select all

If NUM(Arg1) Then
   Ans = MOD(RIGHT(Arg1,15), Arg2)
End Else
   l_Value = OCONV(Arg1, "MCN")
   If l_Value = "" OR ISNULL(l_Value) Then
      Ans = MOD(SEQ(RIGHT(TRIM(Value),1)), Arg2)
   End Else
      Ans = MOD(RIGHT(l_Value,15), Arg2)
   End
End
asitagrawal wrote: 2. If my input data is not evenly distributed, say for Customer, then the spread will not be even across the multiple instances... which I think, destroys the purpose of haveing multiple instance to achieve parallelism.
Pick a better column that gives you a better distribution. This partitioning concept is the heart of PX theory. You pick the partitioning column that keeps like data together. The number of instances equates the number of processing nodes. PX itself uses a similar equation for "hashing" like data together. You're just doing it yourself.

The goal is not perfect distribution, but balanced.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Post Reply