Finding Similar records & assigning Groups

DSFreddie · Post by **DSFreddie** » Sun Nov 06, 2011 10:55 pm

Thanks Ray for your reply. Yes I am appending "G".

Here is the output:
-------------------------
100 ABC 1 G1
100 ABC 0 G1
200 CDE 1 G1
200 CDE 0 G1
300 DEF 1 G2
300 DEF 0 G2
400 GHI 1 G3
400 DGI 0 G3
500 XYZ 1 G4
500 XYZ 0 G4

Here records with 100 & 200 key fields has come to the same group (G1) which is incorrect. I could see the group is repeating again for the non similar key values.

Will it be an issue due to the incorrect usage of Stage variable ? (the derivation given in Stage variable is : (if not(DSLink19.keyChange) then (StageVar) else StageVar+1)

Thanks.
Freddie

chulett · Post by **chulett** » Mon Nov 07, 2011 12:55 am

That could be a partitioning problem. If you are running on more than one node, a simple experiment should answer that question - run the job on a single node. Is the result correct then?

ray.wurlod · Post by **ray.wurlod** » Mon Nov 07, 2011 2:49 am

And if that works, partition on the key field. Modulus is a good choice for the partitioning algorithm, since you have an integer key.

suse_dk · Post by **suse_dk** » Mon Nov 07, 2011 3:11 am

I guess it could be called a partitioning problem... remember that is you are running your job in on multiple nodes then you'll basically be performing the same calculation on each node... that is starting from 1 and adding one each time you'll encounter a new key.

So, you'll need to include the partition # as well, for instance append it in front of the 'G' ....or run the job sequentially...

chulett · Post by **chulett** » Mon Nov 07, 2011 7:52 am

At the moment, I don't see any other explaination for why it seemingly works for all group changes except from the first to the second.

And I don't believe you'll need to worry about the partition number, just partitioning things properly by the group keys should do the trick.

suse_dk · Post by **suse_dk** » Tue Nov 08, 2011 5:16 am

lets just assume that you are running on a 2-node config file and you have a key based partition on the "join" column.

lets that you'll have the key 100, 300, 400 and 500 running on one node/partition and the key 200 on the second node...

in this case the group numbers generated based on the boolean values would be:

Node 1:
Key Group
100: 1
300: 1+1 = 2
400: 1+2 = 3
500: 1+3 = 4

Node 2:
Key Group
200: 1

To confirm this you could look at the distribution in the monitor... and also, you should be able to get the correct result when running sequential

Arun Reddy · Post by **Arun Reddy** » Tue Nov 08, 2011 9:48 am

Hi DSFreddie,

1.funnel stage
2.sort using key based
3.transformer take 2 stage variables stg1,stg2,
4. In stg1 initial value=1, stg2 take key colume which u want to group eg.100,200 key column
5.finally if key column=stg2 then stg1+1 else 1(write this condition at stg1)

u wil get group wise numbers...
100 1
100 2
200 1
200 2
200 3
300 1
300 2....so on ...

DSFreddie · Post by **DSFreddie** » Wed Nov 09, 2011 3:12 pm

Thanks a lot all for all the inputs. It worked fine when i selected the correct partioning.

Regards,
Freddie

suse_dk · Post by **suse_dk** » Thu Nov 10, 2011 1:46 am

And that would be...modulus or?

sanjayS · Post by **sanjayS** » Thu Nov 10, 2011 9:13 am

Hi,

In transformer Use 3 stage variables sv1, sv2, sv3 with datatype as integer

sv1 : (Input column which need to be groped)

sv2 : If (sv1=sv3) Then sv2 Else (sv2+1)

sv3 : sv1

Map stage variable sv2 to o/p column GroupId---> 'G':sv2

In transformer select sequential mode in Advance tab

Thanks,
Sanjay.

DSXchange

Finding Similar records & assigning Groups

Re: Finding Similar records & assigning Groups

Re: Finding Similar records & assigning Groups