"the only thing we have to consider"
Sounds like you have it solved already
Your example reminds me of an in-order tree traversal.
Here's the original data, reformatted a little - it was kind of noisy (I stripped zeros & the ABC XYZ parts, which may not be relevant).
Anyway, I find the pattern easier to see in the formatted version:
Code: Select all
Original | Formatted
F007 XYZ/001 XYZ/002 | F7 ( 1, 2 )
F008 XYZ/002 JKL/003 | F8 ( 2, 3 )
F009 JKL/003 XYZ/254 | F9 ( 3, 254 )
F010 XYZ/002 XYZ/567 | F10( 2, 567 )
F011 XYZ/254 OPQ/111 | F11( 254, 111 )
F012 XYZ/254 XYZ/222 | F12( 254, 222 )
F001 ABC/001 BCD/002 | F1 ( 1, 2 )
F002 BCD/002 ABC/003 | F2 ( 2, 3 )
F003 ABC/003 FGH/254 | F3 ( 3, 254 )
F014 ABC/002 ABC/567 | F14( 2, 567 )
F015 FGH/254 ABC/111 | F15( 254, 111 )
F016 FGH/254 ABC/222 | F16( 254, 222 )
The numbers-pairs look like arcs in a graph; more or less a linked list.
Let's say "F7" is a root node of a tree, then linking the groups we get a structure like this:
Code: Select all
F7(1,2) :
| : First Group
+--> F8(2,3) : F7 ( 1, 2 )
| : F8 ( 2, 3 )
+--> F9(3,254) : F9 ( 3, 254 )
| | : F10( 2, 567 )
| +--> F11(254,111) : F11( 254, 111 )
| | : F12( 254, 222 )
| +--> F12(254,222) :
| :
+--> F10(2,567) :
:
F1(1,2) :
| : Second Group
+--> F2(2,3) : F1 ( 1, 2 )
| : F2 ( 2, 3 )
+--> F3(3,254) : F3 ( 3, 254 )
| | : F14( 2, 567 )
| +--> F15(254,111) : F15( 254, 111 )
| | : F16( 254, 222 )
| +--> F16(254,222) :
| :
+--> F14(2,567) :
You know, DataStage might not be my first tool of choice for solving graph problems. I wonder if this would work: it seems like something you could do in a transformer and a hash file.
It may help you to think of the record structure like so: LABEL(NODE1,NODE2)
e.g. parse
F16( 254, 222 ) out like this:
Code: Select all
LABEL=F16 (the name of the "wire" between 2 nodes)
NODE1=254
NODE2=222
The problem doesn't change much if you need the XYZ and ABC included,
we just treat the NODE id's as composite keys.
The hard problem is identifying which sets of nodes are connected.
That will essentially be your "grouping".
Question: are the NODE id's distinct across groups?
If not, you'll need another attribute to distinguish them
(e.g. add another field to the composite key).
I would build a hash to track your nodes, and assign each completely
new pair of nodes to a new group. I don't know how to phrase this logic
in Transformer-ese in text, so here is some pseudocode. (I suppose
you'd put the if-expressions on link conditions and do the hash-updates
down-range in other stages via different links.)
Code: Select all
stage_variable: group_cnt = 0
stage_variable: group
input record:
label: from input source (F1, F2, etc)
node1: from input source
node2: from input source
g1: lookup node_hash( node1 )
g2: lookup node_hash( node2 )
output record:
label: from input source (F1, F2, etc)
node1: from input source
node2: from input source
group: the stage var, assigned by transformer's logic below
if( g1 == null && g2 == null ) {
// first time for both, start a new group
group_cnt = group_cnt + 1;
node_hash( node1 ) = group_cnt; // store the group# for node1
node_hash( node2 ) = group_cnt; // store the group# for node2
group = group_cnt; // use this in the output record.
}
if( g1 != null && g2 != null ) {
if( g1 != g2 ) {
abort("ERROR: "+label+"("+node1+","+node2+") can not belong to "+g1+" AND "+g2 );
}
// no need to register group# (e.g. no hash write)
// because the group# is already in the hash for these nodes.
// And at this point we could set group=g1 or group=g2.
group = g1;
}
if( g1 == null && g2 != null ) {
// We have seen node2 before but this is the
// first time for node1, so use node2's group#.
group = g2;
}
if( g1 != null && g2 == null ) {
// at this point, g2 must be null & g1 has a group#.
// We have seen node1 before but this is the
// first time for node2, so use node1's group#.
group = g1;
}