Divide data in connected groups

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
Pacific007
Participant
Posts: 35
Joined: Wed Oct 06, 2010 11:24 am

Divide data in connected groups

Post by Pacific007 »

Hi All,

The input data is like this:


F007 XYZ/001 XYZ/002
F008 XYZ/002 JKL/003
F009 JKL/003 XYZ/254
F010 XYZ/002 XYZ/567
F011 XYZ/254 OPQ/111
F012 XYZ/254 XYZ/222
F001 ABC/001 BCD/002
F002 BCD/002 ABC/003
F003 ABC/003 FGH/254
F014 ABC/002 ABC/567
F015 FGH/254 ABC/111
F016 FGH/254 ABC/222

I need the output like this

Group1

F007 XYZ/001 XYZ/002
F008 XYZ/002 JKL/003
F009 JKL/003 XYZ/254
F010 XYZ/002 XYZ/567
F011 XYZ/254 OPQ/111
F012 XYZ/254 XYZ/222

Group2

F001 ABC/001 BCD/002
F002 BCD/002 ABC/003
F003 ABC/003 FGH/254
F014 ABC/002 ABC/567
F015 FGH/254 ABC/111
F016 FGH/254 ABC/222


Col 2 and col 3 are connected via link in col 1

Group 1 is connected with each other and Group 2 is connected with each other respectively

Group 1 and Group 2 are not connected via any means.


Please provide you valuable suggestions.


Regards,
Prashant
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What is the rule that determines whether an input row ends up in Group 1 or Group 2?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Pacific007
Participant
Posts: 35
Joined: Wed Oct 06, 2010 11:24 am

Post by Pacific007 »

ray.wurlod wrote:What is the rule that determines whether an input row ends up in Group 1 or Group 2? ...
The only thing which we have to concider while dividing in groups is that those all are connected. Here suppose col2 and col3 are two ends connected by wire col1, so one row represent one wire connection we have to chech is those wire ends are connected through others wires (rows) so we have to group respective network.

I hope now you are able to understand.

How to post images here otherwise I have drawn a picture and share. I will be good if i wil get at least few suggestions.
Pacific
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You post images here the same way you do on any other web board - host the image somewhere and then link to it here. There are plenty of file file hosting sites and most will automatically build the img 'image' tags you'll need.

The suggestions will come once people understand what it is you are trying to accomplish.
-craig

"You can never have too many knives" -- Logan Nine Fingers
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

Smells like graph theory...

Post by jgreve »

"the only thing we have to consider"
Sounds like you have it solved already :-)

Your example reminds me of an in-order tree traversal.
Here's the original data, reformatted a little - it was kind of noisy (I stripped zeros & the ABC XYZ parts, which may not be relevant).
Anyway, I find the pattern easier to see in the formatted version:

Code: Select all

 Original            |   Formatted
F007 XYZ/001 XYZ/002 | F7 (   1,   2 )
F008 XYZ/002 JKL/003 | F8 (   2,   3 )
F009 JKL/003 XYZ/254 | F9 (   3, 254 )
F010 XYZ/002 XYZ/567 | F10(   2, 567 )
F011 XYZ/254 OPQ/111 | F11( 254, 111 )
F012 XYZ/254 XYZ/222 | F12( 254, 222 )
F001 ABC/001 BCD/002 | F1 (   1,   2 )
F002 BCD/002 ABC/003 | F2 (   2,   3 )
F003 ABC/003 FGH/254 | F3 (   3, 254 )
F014 ABC/002 ABC/567 | F14(   2, 567 )
F015 FGH/254 ABC/111 | F15( 254, 111 )
F016 FGH/254 ABC/222 | F16( 254, 222 ) 

The numbers-pairs look like arcs in a graph; more or less a linked list.
Let's say "F7" is a root node of a tree, then linking the groups we get a structure like this:

Code: Select all

 F7(1,2)                     :
 |                           :    First Group
 +--> F8(2,3)                : F7 (   1,   2 )
       |                     : F8 (   2,   3 )
       +--> F9(3,254)        : F9 (   3, 254 )
       |  |                  : F10(   2, 567 )
       |  +--> F11(254,111)  : F11( 254, 111 )
       |  |                  : F12( 254, 222 )
       |  +--> F12(254,222)  :
       |                     :
       +--> F10(2,567)       :
                             :
 F1(1,2)                     :
 |                           :  Second Group
 +--> F2(2,3)                : F1 (    1,   2 )
       |                     : F2 (    2,   3 )
       +--> F3(3,254)        : F3 (    3, 254 )
       |  |                  : F14(    2, 567 )
       |  +--> F15(254,111)  : F15(  254, 111 )
       |  |                  : F16(  254, 222 )
       |  +--> F16(254,222)  : 
       |                     :
       +--> F14(2,567)       : 
You know, DataStage might not be my first tool of choice for solving graph problems. I wonder if this would work: it seems like something you could do in a transformer and a hash file.

It may help you to think of the record structure like so: LABEL(NODE1,NODE2)
e.g. parse F16( 254, 222 ) out like this:

Code: Select all

   LABEL=F16 (the name of the "wire" between 2 nodes)
   NODE1=254
   NODE2=222
The problem doesn't change much if you need the XYZ and ABC included,
we just treat the NODE id's as composite keys.

The hard problem is identifying which sets of nodes are connected.
That will essentially be your "grouping".


Question: are the NODE id's distinct across groups?
If not, you'll need another attribute to distinguish them
(e.g. add another field to the composite key).

I would build a hash to track your nodes, and assign each completely
new pair of nodes to a new group. I don't know how to phrase this logic
in Transformer-ese in text, so here is some pseudocode. (I suppose
you'd put the if-expressions on link conditions and do the hash-updates
down-range in other stages via different links.)

Code: Select all

stage_variable: group_cnt = 0
stage_variable: group

input record:
label: from input source (F1, F2, etc)
node1: from input source
node2: from input source
g1: lookup node_hash( node1 )
g2: lookup node_hash( node2 )

output record:
label: from input source (F1, F2, etc)
node1: from input source
node2: from input source
group: the stage var, assigned by transformer's logic below


   if( g1 == null && g2 == null ) {
      // first time for both, start a new group
      group_cnt = group_cnt + 1;
      node_hash( node1 ) = group_cnt; // store the group# for node1
      node_hash( node2 ) = group_cnt; // store the group# for node2
      group = group_cnt; // use this in the output record.
   }
   if( g1 != null && g2 != null ) {
      if( g1 != g2 ) {
         abort("ERROR: "+label+"("+node1+","+node2+") can not belong to "+g1+" AND "+g2 );
      }
      // no need to register group# (e.g. no hash write)
      // because the group# is already in the hash for these nodes.
      // And at this point we could set group=g1 or group=g2.
      group = g1;
   }
   if( g1 == null && g2 != null ) {
      // We have seen node2 before but this is the
      // first time for node1, so use node2's group#.
      group = g2;
   }
   if( g1 != null && g2 == null ) {
      // at this point, g2 must be null & g1 has a group#.
      // We have seen node1 before but this is the
      // first time for node2, so use node1's group#.
      group = g1;
   }
Post Reply