Rangelookup
Moderators: chulett, rschirm, roy
Rangelookup
HI,
I have a requrement like i need to do range lookup.
Here i have my input file
123,167,7362,245,SDCCAT,3456,150,MED,MIB,H,ABCD,XYZ,JKH,VI
123,167,7362,245,SDCCAT,3456,150,MID,MIB,H,ABCD,XYZ,JKH,VI
LookupFIle:(ONE RECORD)
123,%,NOT50,(200-300),SDC%,3456,(100-50),MED,MIB,H,ABCD,XYZ,JKH,VI,MEDICARE
Here % means ANY,NOT50 means other than 50,SDC% Means Starting with SDC,(200-300) Range .
I need to handle all these situations to do the lookup.
Any idea to implement this in datastage using lookup stage or lookup file set stage.
Thanks
uma
I have a requrement like i need to do range lookup.
Here i have my input file
123,167,7362,245,SDCCAT,3456,150,MED,MIB,H,ABCD,XYZ,JKH,VI
123,167,7362,245,SDCCAT,3456,150,MID,MIB,H,ABCD,XYZ,JKH,VI
LookupFIle:(ONE RECORD)
123,%,NOT50,(200-300),SDC%,3456,(100-50),MED,MIB,H,ABCD,XYZ,JKH,VI,MEDICARE
Here % means ANY,NOT50 means other than 50,SDC% Means Starting with SDC,(200-300) Range .
I need to handle all these situations to do the lookup.
Any idea to implement this in datastage using lookup stage or lookup file set stage.
Thanks
uma
HI
-
- Charter Member
- Posts: 822
- Joined: Sat Sep 17, 2005 5:25 pm
- Location: USA
Looking at your sample date and explanation (which is vague), I think you need to do the lookup based on the columns other than the constraints you had defined(eg: first,sixth,eight and others) then filter the output of the lookup stage based on your constraints.
IHTH
Code: Select all
Something like this either in filter stage or transformer:
in.thirdfield <> 50 and in.fourthfield>=200 and in.fourthfield<=300 and substring(in.fifthfield,1,3)='SDC' and in.seventhfield>=50 and in.seventhfield<=100
I haven't failed, I've found 10,000 ways that don't work.
Thomas Alva Edison(1847-1931)
Thomas Alva Edison(1847-1931)
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
It still makes no sense. Why is there only one lookup row? Does this row have those actual values it in or is this some type of pseudo code? It looks like you only join on about 10 columns and then use the pseudo code columns (such as NOT50 and 200-300) to build Stage Variable business rule code. But that's just a guess.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Re: Rangelookup
Are ranges always an integer value? Will you support date ranges?
Are ranges inclusive? e.g. is 500, 501, and 502 all part of (500-502) ?
Or just 501 ?
example broken out by fieldes:
So is the idea with your example that each one should be able to
decide it belongs to "MEDICARE" ?
Except that field #'s 7 & 8 don't match.
In your "lookup file" example, is the 7th field (100-50) a typo?
That makes more sense if it was (100-150) or (100-500).
Technically a range can just be two end-points, but it would be easier
on the matching engine to have them be (minimum-maximum).
I think this would be a Real Challenge to do without
procedural programming. If I had to solve this, it would
be enough to make me learn how to do custom-transformers with C++.
Or possibly write this as a database stored-proc and just feed
the MEDICARE (or whatever) category into DS.
Or possibly write this as a Perl pre-processor.
The only way I can think of to do what you want is to
stuff something like this into a custom transformer:
(warning - crappy pseudo-code ahead)
Are ranges inclusive? e.g. is 500, 501, and 502 all part of (500-502) ?
Or just 501 ?
example broken out by fieldes:
Code: Select all
LOOKUP rec . INPUT rec
01: 123 . 123 . =
02: % . 167 . = % matches anything
03: NOT50 . 7362 . = 7362 <> 50
04: (200-300) . 245 . = 245 is within the range
05: SDC% . SDCCAT . = SDCCAT begins with SDC
06: 3456 . 3456 . =
07: (100-50) . 150 . <> 150 is out of range
08: MED . MID . <>
09: MIB . MIB . =
10: H . H . =
11: ABCD . ABCD . =
12: XYZ . XYZ . =
13: JKH . JKH . =
14: VI . VI . =
15: MEDICARE . . lookup value
So is the idea with your example that each one should be able to
decide it belongs to "MEDICARE" ?
Except that field #'s 7 & 8 don't match.
In your "lookup file" example, is the 7th field (100-50) a typo?
That makes more sense if it was (100-150) or (100-500).
Technically a range can just be two end-points, but it would be easier
on the matching engine to have them be (minimum-maximum).
I think this would be a Real Challenge to do without
procedural programming. If I had to solve this, it would
be enough to make me learn how to do custom-transformers with C++.
Or possibly write this as a database stored-proc and just feed
the MEDICARE (or whatever) category into DS.
Or possibly write this as a Perl pre-processor.
The only way I can think of to do what you want is to
stuff something like this into a custom transformer:
(warning - crappy pseudo-code ahead)
Code: Select all
load match records (from some link?)
a[15] = ""
with current record a
ok = true
foreach match record b
for i = 1 to 14
if not match( a[i], b[i] ) then
ok = false
break
endif
next i
if ok then a[15] = b[15]
break
next b
end
sub match( pattern, value ) {
if indexof(pattern,"%") >= 0 then return wildcard(pattern, value);
if indexof(pattern,"-") >= 0 then return range(pattern, value);
// I would suggest modifying your "NOT" syntax to be "NOT:" or just "!<expr>"
// as that would let you recycle the match() function.
if indexof(pattern,"NOT:") = 0 then return not match(substr(pattern,3), value);
// nothing fancy, just see if they're equal.
// if c++ is playing nice, = will be overloaded
// for the various types.
return pattern = value;
}
// other subroutines left as an exercise for the reader
umamahes wrote:HI,
I have a requrement like i need to do range lookup.
Here i have my input file
123,167,7362,245,SDCCAT,3456,150,MED,MIB,H,ABCD,XYZ,JKH,VI
123,167,7362,245,SDCCAT,3456,150,MID,MIB,H,ABCD,XYZ,JKH,VI
LookupFIle:(ONE RECORD)
123,%,NOT50,(200-300),SDC%,3456,(100-50),MED,MIB,H,ABCD,XYZ,JKH,VI,MEDICARE
Here % means ANY,NOT50 means other than 50,SDC% Means Starting with SDC,(200-300) Range .
I need to handle all these situations to do the lookup.
Any idea to implement this in datastage using lookup stage or lookup file set stage.
Thanks
uma
-
- Premium Member
- Posts: 72
- Joined: Thu Sep 04, 2003 5:01 am
- Location: UK & Europe
Range lookup is coming in Hawk. Until then there are three possible solutions :-
1. Do the range lookup in your RDBMS wherever possible.
2. For a small number of possible values within the range, generate a reference dataset with a row for each value, do an exact match lookup or join, then immediately drop all those rows that do not match. We called this the Cartesian Product solution and it's pretty horrible for more than a small number of rows.
3. Code a Buildop. I have one that reads in the band names, upper and lower values from a text file in the Pre_loop code then uses these to determine which band a row falls in. Not pretty, but it gives reasonable performance in Live. Mail me if you could use further details. I include some sample code below. The advantage is that the number of bands and their bounds can be changed without recompiling.
regards
Phil Clarke.
Comments
=======
All rows from the reference link are read into the stage.
The Max, Min and Band Name for each band are stored in an internal table (three arrays). The bands are assumed to be in ascending order of Min Value, with no gaps between bands.
For each row on the input link:
The value is examined.
If it is less than the lowest minimum or greater than the highest maximum the row dropped or written with a literal value in Band indicating 'Out of Bounds'.
Loop though the band arrays:
If the value is between the Min and Max (inclusive) then
Write Band Name to the output Band name.
Set a Found indicator to true.
Break out of the loop.
Next Band
If Not Found, the row is dropped or written with a literal value in Band indicating 'Value not found'.
1. Do the range lookup in your RDBMS wherever possible.
2. For a small number of possible values within the range, generate a reference dataset with a row for each value, do an exact match lookup or join, then immediately drop all those rows that do not match. We called this the Cartesian Product solution and it's pretty horrible for more than a small number of rows.
3. Code a Buildop. I have one that reads in the band names, upper and lower values from a text file in the Pre_loop code then uses these to determine which band a row falls in. Not pretty, but it gives reasonable performance in Live. Mail me if you could use further details. I include some sample code below. The advantage is that the number of bands and their bounds can be changed without recompiling.
regards
Phil Clarke.
Comments
=======
All rows from the reference link are read into the stage.
The Max, Min and Band Name for each band are stored in an internal table (three arrays). The bands are assumed to be in ascending order of Min Value, with no gaps between bands.
For each row on the input link:
The value is examined.
If it is less than the lowest minimum or greater than the highest maximum the row dropped or written with a literal value in Band indicating 'Out of Bounds'.
Loop though the band arrays:
If the value is between the Min and Max (inclusive) then
Write Band Name to the output Band name.
Set a Found indicator to true.
Break out of the loop.
Next Band
If Not Found, the row is dropped or written with a literal value in Band indicating 'Value not found'.
Code: Select all
Definitions
===========
/****************************************************************
Declare the arrays to hold Name, Upper and Lower
Bounds of the ranges and the count of Bands
****************************************************************/
long Lower_Bound [100];
long Upper_Bound [100];
APT_String Band [100];
int NumBands=0;
Pre-Loop
//
// Pre-Loop Processing. Read the range names and bounds from the Reference Link (Link 1).
//
int n; // Used in Debugging only
readRecord(1); // Read in the first row from the ref link
while (!inputDone(1)) // Loop until End of File
{
if (NumBands <= 99) // If number of bands too high, ignore any further bands. nb arrays start at zero.
{
Lower_Bound[NumBands] = RefRec.Min; // Assign values from the file to the Band arrays
Upper_Bound[NumBands] = RefRec.Max;
Band[NumBands] = RefRec.RangeName;
NumBands++; // Add one to the Band Count
}
readRecord(1); // Read next row
}
cout << "Banding Data" << endl;
cout << "Lowest Min: " << Lower_Bound[0] << endl;
cout << "Highest Max: " << Upper_Bound[NumBands-1] << endl;
cout <<" " << endl;
cout << "Band Range" << endl;
for (n=0;n<NumBands;n++)
{
cout << Band[n] << " " << Lower_Bound[n] << " - " << Upper_Bound[n]<< endl;
}
Per-Loop
========
int found;
/*
found flag set as follows: 0 = value is in a band.
1 = value is out of bounds
2 = value is not found,
3 = value is null
*/
int ThisBand; // Current Band Number
long TestValue; // Current Value
APT_String OutMessage; // Message to insert if unbanded
TestValue = InRec.ValueIn; // Store the value
OutRec.ValueOut = InRec.ValueIn; // Move the value to the output buffer
found = 2; // Start with a 'not found' condition
// Test for Out-of-Range Values
if (TestValue == -1) // -1 indicates a Null, pass through unchanged
found = 3;
else
{
if (TestValue < Lower_Bound[0] || TestValue > Upper_Bound[NumBands-1])
found = 1;
else
{
for (ThisBand=0; ThisBand < NumBands; ThisBand++) // Loop through Bands
{
if (TestValue<Upper_Bound[ThisBand] && TestValue >=Lower_Bound[ThisBand])
{
found=0; // Found the cotrect band
OutRec.BandName = Band[ThisBand]; // Move band name to the output
break; // Break out of the loop
}
}
}
}
/*
Test the results of the search
If a valid band has not been found either
drop the row or insert an error message.
*/
if (found != 0)
{
if (DropRows=='True') // If DropRows property is True,
discardRecord(0); // discard this record.
else
{
if (found == 2) // Insert the appropriate error message
OutRec.BandName = ValueNotFoundMessage;
else if (found == 1)
OutRec.BandName = OutOfBoundsMessage;
else if (found == 3)
OutRec.BandName = "";
}
}
// Done