Identifying different groups based on column valeus

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
SwathiCh
Premium Member
Premium Member
Posts: 64
Joined: Mon Feb 08, 2010 7:17 pm

Identifying different groups based on column valeus

Post by SwathiCh »

Hi All,

I have a requirement to identify groups which are having different values in a specific column,

suppose if we have data like as below
G1 M1 ST
G1 M2 PB
G1 M3 PD
G2 M4 P1
G2 M5 P2
G2 M6 P3

Then I need to identify group G1 since It is having the values ST and PB in column 3.

In the same way I have given different values for Col3, if Col3 contain any of those values then I need to filter out that group.

Is there any good idea to implement this logic in Parallel extender?
--
Swathi Ch
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Your explanation is incomplete. G1 has three distinct values in Col3, as does G2. Am I understanding that correctly?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
SwathiCh
Premium Member
Premium Member
Posts: 64
Joined: Mon Feb 08, 2010 7:17 pm

Post by SwathiCh »

Ray,

Even though three different values are there in COL3 for same group, I need to filter out that group If I see values ST and PB in col3.
--
Swathi Ch
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What's your expected output in the above example?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
SwathiCh
Premium Member
Premium Member
Posts: 64
Joined: Mon Feb 08, 2010 7:17 pm

Post by SwathiCh »

I need to get G1 out of that...
--
Swathi Ch
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Meaning... if Col3 has either 'ST' or 'PB' in it for any record in any particular 'Group', that entire group should be filtered out?
-craig

"You can never have too many knives" -- Logan Nine Fingers
SwathiCh
Premium Member
Premium Member
Posts: 64
Joined: Mon Feb 08, 2010 7:17 pm

Post by SwathiCh »

Yes chulett,

I am trying with Stage variables by comparing previous and current records. But not getting the results as expected, so looking for any other better solution.

Thanks,
--
Swathi Ch
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Not sure it's possible with just just stage variables as you'll basically need to check the data twice - once to see if you find those values in a group and then a second time to pass the group only if the first check didn't find anything.

I've done this with a hashed file and a Server job, that is pretty straight-foward. Not quite sure of a PX approach, although I'm sure others will have ideas. The words 'fork' and 'join' come to mind. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
SwathiCh
Premium Member
Premium Member
Posts: 64
Joined: Mon Feb 08, 2010 7:17 pm

Post by SwathiCh »

Thanks chulett,

How we can implement in Server Job? If logic works then I can manage to use SERVER job only once for calculating this logic.
--
Swathi Ch
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You create a reference hashed file with a key for your 'group' column and a data column that is just an indicator - yes/no, true/false, whatever - to show if that group includes the target values. Do a lookup for each record and then write the result back to the same uncached hashed so it stays in sync.

Set the indicator to true when it succeeds and the incoming value is one of your target values. If the lookup suceeds and the incoming value is not one of the target values, do not update the hashed file. On the first lookup miss for each group, set the data value to true/false accordingly. When you are done you will have a hashed file which records which groups should pass through (flag is false) or should be filtered out (flag is true).

Then a second job (or another section in the same job after a process break) would check each record against the hashed file and constrain accordingly.

I'm sure an equivalent(ish) fork join PX design could be made to work as well. Set a flag to zero or one, aggregator max on that per group, fork join your original data stream and constrain out the ones. Something like that. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply