Divide data in connected groups

Pacific007 · Post by **Pacific007** » Tue Apr 12, 2011 11:08 am

Hi All,

The input data is like this:

F007 XYZ/001 XYZ/002
F008 XYZ/002 JKL/003
F009 JKL/003 XYZ/254
F010 XYZ/002 XYZ/567
F011 XYZ/254 OPQ/111
F012 XYZ/254 XYZ/222
F001 ABC/001 BCD/002
F002 BCD/002 ABC/003
F003 ABC/003 FGH/254
F014 ABC/002 ABC/567
F015 FGH/254 ABC/111
F016 FGH/254 ABC/222

I need the output like this

Group1

F007 XYZ/001 XYZ/002
F008 XYZ/002 JKL/003
F009 JKL/003 XYZ/254
F010 XYZ/002 XYZ/567
F011 XYZ/254 OPQ/111
F012 XYZ/254 XYZ/222

Group2

F001 ABC/001 BCD/002
F002 BCD/002 ABC/003
F003 ABC/003 FGH/254
F014 ABC/002 ABC/567
F015 FGH/254 ABC/111
F016 FGH/254 ABC/222

Col 2 and col 3 are connected via link in col 1

Group 1 is connected with each other and Group 2 is connected with each other respectively

Group 1 and Group 2 are not connected via any means.

Please provide you valuable suggestions.

Regards,
Prashant

ray.wurlod · Post by **ray.wurlod** » Tue Apr 12, 2011 4:37 pm

What is the rule that determines whether an input row ends up in Group 1 or Group 2?

Pacific007 · Post by **Pacific007** » Wed Apr 13, 2011 9:01 am

ray.wurlod wrote:What is the rule that determines whether an input row ends up in Group 1 or Group 2? ...

The only thing which we have to concider while dividing in groups is that those all are connected. Here suppose col2 and col3 are two ends connected by wire col1, so one row represent one wire connection we have to chech is those wire ends are connected through others wires (rows) so we have to group respective network.

I hope now you are able to understand.

How to post images here otherwise I have drawn a picture and share. I will be good if i wil get at least few suggestions.

chulett · Post by **chulett** » Wed Apr 13, 2011 9:21 am

You post images here the same way you do on any other web board - host the image somewhere and then link to it here. There are plenty of file file hosting sites and most will automatically build the img 'image' tags you'll need.

The suggestions will come once people understand what it is you are trying to accomplish.

jgreve · Post by **jgreve** » Wed Apr 13, 2011 12:32 pm

"the only thing we have to consider"
Sounds like you have it solved already

Your example reminds me of an in-order tree traversal.
Here's the original data, reformatted a little - it was kind of noisy (I stripped zeros & the ABC XYZ parts, which may not be relevant).
Anyway, I find the pattern easier to see in the formatted version:

Code: Select all

 Original            |   Formatted
F007 XYZ/001 XYZ/002 | F7 (   1,   2 )
F008 XYZ/002 JKL/003 | F8 (   2,   3 )
F009 JKL/003 XYZ/254 | F9 (   3, 254 )
F010 XYZ/002 XYZ/567 | F10(   2, 567 )
F011 XYZ/254 OPQ/111 | F11( 254, 111 )
F012 XYZ/254 XYZ/222 | F12( 254, 222 )
F001 ABC/001 BCD/002 | F1 (   1,   2 )
F002 BCD/002 ABC/003 | F2 (   2,   3 )
F003 ABC/003 FGH/254 | F3 (   3, 254 )
F014 ABC/002 ABC/567 | F14(   2, 567 )
F015 FGH/254 ABC/111 | F15( 254, 111 )
F016 FGH/254 ABC/222 | F16( 254, 222 )

The numbers-pairs look like arcs in a graph; more or less a linked list.
Let's say "F7" is a root node of a tree, then linking the groups we get a structure like this:

Code: Select all

 F7(1,2)                     :
 |                           :    First Group
 +--> F8(2,3)                : F7 (   1,   2 )
       |                     : F8 (   2,   3 )
       +--> F9(3,254)        : F9 (   3, 254 )
       |  |                  : F10(   2, 567 )
       |  +--> F11(254,111)  : F11( 254, 111 )
       |  |                  : F12( 254, 222 )
       |  +--> F12(254,222)  :
       |                     :
       +--> F10(2,567)       :
                             :
 F1(1,2)                     :
 |                           :  Second Group
 +--> F2(2,3)                : F1 (    1,   2 )
       |                     : F2 (    2,   3 )
       +--> F3(3,254)        : F3 (    3, 254 )
       |  |                  : F14(    2, 567 )
       |  +--> F15(254,111)  : F15(  254, 111 )
       |  |                  : F16(  254, 222 )
       |  +--> F16(254,222)  : 
       |                     :
       +--> F14(2,567)       :

You know, DataStage might not be my first tool of choice for solving graph problems. I wonder if this would work: it seems like something you could do in a transformer and a hash file.

It may help you to think of the record structure like so: LABEL(NODE1,NODE2)
e.g. parse F16( 254, 222 ) out like this:

Code: Select all

   LABEL=F16 (the name of the "wire" between 2 nodes)
   NODE1=254
   NODE2=222

The problem doesn't change much if you need the XYZ and ABC included,
we just treat the NODE id's as composite keys.

The hard problem is identifying which sets of nodes are connected.
That will essentially be your "grouping".

Question: are the NODE id's distinct across groups?
If not, you'll need another attribute to distinguish them
(e.g. add another field to the composite key).

I would build a hash to track your nodes, and assign each completely
new pair of nodes to a new group. I don't know how to phrase this logic
in Transformer-ese in text, so here is some pseudocode. (I suppose
you'd put the if-expressions on link conditions and do the hash-updates
down-range in other stages via different links.)

Code: Select all

stage_variable: group_cnt = 0
stage_variable: group

input record:
label: from input source (F1, F2, etc)
node1: from input source
node2: from input source
g1: lookup node_hash( node1 )
g2: lookup node_hash( node2 )

output record:
label: from input source (F1, F2, etc)
node1: from input source
node2: from input source
group: the stage var, assigned by transformer's logic below


   if( g1 == null && g2 == null ) {
      // first time for both, start a new group
      group_cnt = group_cnt + 1;
      node_hash( node1 ) = group_cnt; // store the group# for node1
      node_hash( node2 ) = group_cnt; // store the group# for node2
      group = group_cnt; // use this in the output record.
   }
   if( g1 != null && g2 != null ) {
      if( g1 != g2 ) {
         abort("ERROR: "+label+"("+node1+","+node2+") can not belong to "+g1+" AND "+g2 );
      }
      // no need to register group# (e.g. no hash write)
      // because the group# is already in the hash for these nodes.
      // And at this point we could set group=g1 or group=g2.
      group = g1;
   }
   if( g1 == null && g2 != null ) {
      // We have seen node2 before but this is the
      // first time for node1, so use node2's group#.
      group = g2;
   }
   if( g1 != null && g2 == null ) {
      // at this point, g2 must be null & g1 has a group#.
      // We have seen node1 before but this is the
      // first time for node2, so use node1's group#.
      group = g1;
   }

DSXchange

Divide data in connected groups

Divide data in connected groups

Smells like graph theory...