Identifying different groups based on column valeus

SwathiCh · Post by **SwathiCh** » Thu Oct 21, 2010 3:05 pm

Hi All,

I have a requirement to identify groups which are having different values in a specific column,

suppose if we have data like as below
G1 M1 ST
G1 M2 PB
G1 M3 PD
G2 M4 P1
G2 M5 P2
G2 M6 P3

Then I need to identify group G1 since It is having the values ST and PB in column 3.

In the same way I have given different values for Col3, if Col3 contain any of those values then I need to filter out that group.

Is there any good idea to implement this logic in Parallel extender?

ray.wurlod · Post by **ray.wurlod** » Thu Oct 21, 2010 4:57 pm

Your explanation is incomplete. G1 has three distinct values in Col3, as does G2. Am I understanding that correctly?

SwathiCh · Post by **SwathiCh** » Thu Oct 21, 2010 5:16 pm

Ray,

Even though three different values are there in COL3 for same group, I need to filter out that group If I see values ST and PB in col3.

ray.wurlod · Post by **ray.wurlod** » Thu Oct 21, 2010 7:35 pm

What's your expected output in the above example?

SwathiCh · Post by **SwathiCh** » Thu Oct 21, 2010 7:52 pm

I need to get G1 out of that...

chulett · Post by **chulett** » Thu Oct 21, 2010 8:20 pm

Meaning... if Col3 has either 'ST' or 'PB' in it for any record in any particular 'Group', that entire group should be filtered out?

SwathiCh · Post by **SwathiCh** » Thu Oct 21, 2010 8:42 pm

Yes chulett,

I am trying with Stage variables by comparing previous and current records. But not getting the results as expected, so looking for any other better solution.

Thanks,

chulett · Post by **chulett** » Thu Oct 21, 2010 8:47 pm

Not sure it's possible with just just stage variables as you'll basically need to check the data twice - once to see if you find those values in a group and then a second time to pass the group only if the first check didn't find anything.

I've done this with a hashed file and a Server job, that is pretty straight-foward. Not quite sure of a PX approach, although I'm sure others will have ideas. The words 'fork' and 'join' come to mind.

SwathiCh · Post by **SwathiCh** » Thu Oct 21, 2010 9:23 pm

Thanks chulett,

How we can implement in Server Job? If logic works then I can manage to use SERVER job only once for calculating this logic.

chulett · Post by **chulett** » Thu Oct 21, 2010 11:12 pm

You create a reference hashed file with a key for your 'group' column and a data column that is just an indicator - yes/no, true/false, whatever - to show if that group includes the target values. Do a lookup for each record and then write the result back to the same uncached hashed so it stays in sync.

Set the indicator to true when it succeeds and the incoming value is one of your target values. If the lookup suceeds and the incoming value is not one of the target values, do not update the hashed file. On the first lookup miss for each group, set the data value to true/false accordingly. When you are done you will have a hashed file which records which groups should pass through (flag is false) or should be filtered out (flag is true).

Then a second job (or another section in the same job after a process break) would check each record against the hashed file and constrain accordingly.

I'm sure an equivalent(ish) fork join PX design could be made to work as well. Set a flag to zero or one, aggregator max on that per group, fork join your original data stream and constrain out the ones. Something like that.