Rangelookup

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
umamahes
Premium Member
Premium Member
Posts: 110
Joined: Tue Jul 04, 2006 9:08 pm

Rangelookup

Post by umamahes »

HI,

I have a requrement like i need to do range lookup.

Here i have my input file

123,167,7362,245,SDCCAT,3456,150,MED,MIB,H,ABCD,XYZ,JKH,VI
123,167,7362,245,SDCCAT,3456,150,MID,MIB,H,ABCD,XYZ,JKH,VI

LookupFIle:(ONE RECORD)
123,%,NOT50,(200-300),SDC%,3456,(100-50),MED,MIB,H,ABCD,XYZ,JKH,VI,MEDICARE

Here % means ANY,NOT50 means other than 50,SDC% Means Starting with SDC,(200-300) Range .

I need to handle all these situations to do the lookup.

Any idea to implement this in datastage using lookup stage or lookup file set stage.

Thanks
uma
HI
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

How many keys are you going to be joining on?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
us1aslam1us
Charter Member
Charter Member
Posts: 822
Joined: Sat Sep 17, 2005 5:25 pm
Location: USA

Post by us1aslam1us »

Looking at your sample date and explanation (which is vague), I think you need to do the lookup based on the columns other than the constraints you had defined(eg: first,sixth,eight and others) then filter the output of the lookup stage based on your constraints.

Code: Select all

Something like this either in filter stage or transformer:

in.thirdfield <> 50 and in.fourthfield>=200 and in.fourthfield<=300 and substring(in.fifthfield,1,3)='SDC' and in.seventhfield>=50 and in.seventhfield<=100
IHTH
I haven't failed, I've found 10,000 ways that don't work.
Thomas Alva Edison(1847-1931)
umamahes
Premium Member
Premium Member
Posts: 110
Joined: Tue Jul 04, 2006 9:08 pm

Post by umamahes »

I need to join 14 columns and 15th column is my required out put.

Range is not constant some times it may be (500-700).

Thanks
Uma
HI
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

It still makes no sense. Why is there only one lookup row? Does this row have those actual values it in or is this some type of pseudo code? It looks like you only join on about 10 columns and then use the pseudo code columns (such as NOT50 and 200-300) to build Stage Variable business rule code. But that's just a guess.
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

Re: Rangelookup

Post by jgreve »

Are ranges always an integer value? Will you support date ranges?
Are ranges inclusive? e.g. is 500, 501, and 502 all part of (500-502) ?
Or just 501 ?

example broken out by fieldes:

Code: Select all

    LOOKUP rec .  INPUT rec
01: 123        .  123           . =
02: %          .  167           . = % matches anything
03: NOT50      .  7362          . = 7362 <> 50
04: (200-300)  .  245           . = 245 is within the range
05: SDC%       .  SDCCAT        . = SDCCAT begins with SDC
06: 3456       .  3456          . =
07: (100-50)   .  150           . <>  150 is out of range
08: MED        .  MID           . <>
09: MIB        .  MIB           . =
10: H          .  H             . =
11: ABCD       .  ABCD          . =
12: XYZ        .  XYZ           . =
13: JKH        .  JKH           . =
14: VI         .  VI            . =
15: MEDICARE   .                . lookup value

So is the idea with your example that each one should be able to
decide it belongs to "MEDICARE" ?
Except that field #'s 7 & 8 don't match.

In your "lookup file" example, is the 7th field (100-50) a typo?
That makes more sense if it was (100-150) or (100-500).
Technically a range can just be two end-points, but it would be easier
on the matching engine to have them be (minimum-maximum).

I think this would be a Real Challenge to do without
procedural programming. If I had to solve this, it would
be enough to make me learn how to do custom-transformers with C++.

Or possibly write this as a database stored-proc and just feed
the MEDICARE (or whatever) category into DS.
Or possibly write this as a Perl pre-processor.

The only way I can think of to do what you want is to
stuff something like this into a custom transformer:
(warning - crappy pseudo-code ahead)

Code: Select all

load match records (from some link?)
a[15] = ""
with current record a
   ok = true
   foreach match record b
      for i = 1 to 14
         if not match( a[i], b[i] ) then
            ok = false
            break
         endif
      next i
      if ok then a[15] = b[15]
      break
   next b
end

sub match( pattern, value ) {
   if indexof(pattern,"%") >= 0 then return wildcard(pattern, value);
   if indexof(pattern,"-") >= 0 then return range(pattern, value);
   // I would suggest modifying your "NOT" syntax to be "NOT:" or just "!<expr>"
   // as that would let you recycle the match() function.
   if indexof(pattern,"NOT:") = 0 then return not match(substr(pattern,3), value);

   // nothing fancy, just see if they're equal.
   // if c++ is playing nice, = will be overloaded
   // for the various types.
   return pattern = value;
}

// other subroutines left as an exercise for the reader



umamahes wrote:HI,

I have a requrement like i need to do range lookup.

Here i have my input file

123,167,7362,245,SDCCAT,3456,150,MED,MIB,H,ABCD,XYZ,JKH,VI
123,167,7362,245,SDCCAT,3456,150,MID,MIB,H,ABCD,XYZ,JKH,VI

LookupFIle:(ONE RECORD)
123,%,NOT50,(200-300),SDC%,3456,(100-50),MED,MIB,H,ABCD,XYZ,JKH,VI,MEDICARE

Here % means ANY,NOT50 means other than 50,SDC% Means Starting with SDC,(200-300) Range .

I need to handle all these situations to do the lookup.

Any idea to implement this in datastage using lookup stage or lookup file set stage.

Thanks
uma
umamahes
Premium Member
Premium Member
Posts: 110
Joined: Tue Jul 04, 2006 9:08 pm

Post by umamahes »

jgreve,your understanding is correct.range is typo error.it is (50-100).

i need to develop parallel job for this.i wrote a c++ code but it is taking more time.can any one please help me out to solve this issue.

Thanks
Uma
HI
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Dont have to go through all that, just load your files into a two different tables in a schema and write sql to do it. It will be much easier.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
SettValleyConsulting
Premium Member
Premium Member
Posts: 72
Joined: Thu Sep 04, 2003 5:01 am
Location: UK & Europe

Post by SettValleyConsulting »

Range lookup is coming in Hawk. Until then there are three possible solutions :-

1. Do the range lookup in your RDBMS wherever possible.

2. For a small number of possible values within the range, generate a reference dataset with a row for each value, do an exact match lookup or join, then immediately drop all those rows that do not match. We called this the Cartesian Product solution and it's pretty horrible for more than a small number of rows.

3. Code a Buildop. I have one that reads in the band names, upper and lower values from a text file in the Pre_loop code then uses these to determine which band a row falls in. Not pretty, but it gives reasonable performance in Live. Mail me if you could use further details. I include some sample code below. The advantage is that the number of bands and their bounds can be changed without recompiling.

regards

Phil Clarke.

Comments
=======

All rows from the reference link are read into the stage.

The Max, Min and Band Name for each band are stored in an internal table (three arrays). The bands are assumed to be in ascending order of Min Value, with no gaps between bands.

For each row on the input link:

The value is examined.
If it is less than the lowest minimum or greater than the highest maximum the row dropped or written with a literal value in Band indicating 'Out of Bounds'.

Loop though the band arrays:
If the value is between the Min and Max (inclusive) then
Write Band Name to the output Band name.
Set a Found indicator to true.
Break out of the loop.
Next Band

If Not Found, the row is dropped or written with a literal value in Band indicating 'Value not found'.


Code: Select all

Definitions
===========

/****************************************************************
  Declare the arrays to hold Name, Upper and Lower
  Bounds of the ranges and the count of Bands
 ****************************************************************/ 

long      Lower_Bound [100];
long      Upper_Bound [100];
APT_String Band        [100];

int        NumBands=0;


Pre-Loop

//
// Pre-Loop Processing. Read the range names and bounds from the Reference Link (Link 1).
//

int n;                                                  // Used in Debugging only
readRecord(1);                                          // Read in the first row from the ref link
while (!inputDone(1))                                   // Loop until End of File
{
      if (NumBands <= 99)                               // If number of bands too high, ignore any further bands. nb arrays start at zero. 
        {
        Lower_Bound[NumBands] = RefRec.Min;             // Assign values from the file to the Band arrays
        Upper_Bound[NumBands] = RefRec.Max;
        Band[NumBands] = RefRec.RangeName;
        NumBands++;                                     // Add one to the Band Count 
        }
      readRecord(1);                                    // Read next row 
}

cout << "Banding Data"                             << endl;
cout << "Lowest Min: "  << Lower_Bound[0]          << endl;
cout << "Highest Max: " << Upper_Bound[NumBands-1] << endl;
cout <<" "                                         << endl;
cout << "Band    Range"                            << endl;

for (n=0;n<NumBands;n++)
  {
  cout << Band[n] << " " << Lower_Bound[n] << " - " << Upper_Bound[n]<< endl; 
  }

Per-Loop
========

int          found;                      

/* 
found flag set as follows:  0 = value is in a band. 
                            1 = value is out of bounds
                            2 = value is not found,
                            3 = value is null
*/

int          ThisBand;                   // Current Band Number
long         TestValue;                  // Current Value 
APT_String   OutMessage;                 // Message to insert if unbanded

TestValue       = InRec.ValueIn;         // Store the value
OutRec.ValueOut = InRec.ValueIn;         // Move the value to the output buffer

found = 2;                               // Start with a 'not found' condition

// Test for Out-of-Range Values

if (TestValue == -1)                      // -1 indicates a Null, pass through unchanged
  found = 3;
else
{
  if (TestValue < Lower_Bound[0] || TestValue > Upper_Bound[NumBands-1])
    found = 1;
  else
    {
    for (ThisBand=0; ThisBand < NumBands; ThisBand++)       // Loop through Bands
      {
      if (TestValue<Upper_Bound[ThisBand] && TestValue >=Lower_Bound[ThisBand])
        {
        found=0;                             // Found the cotrect band
        OutRec.BandName = Band[ThisBand];    // Move band name to the output
        break;                               // Break out of the loop
        }
      }
    }
}

/*  
    Test the results of the search
    If a valid band has not been found either
    drop the row or insert an error message.   
*/

  if (found != 0)
  {
    if (DropRows=='True')                  // If DropRows property is True,
        discardRecord(0);                  // discard this record.
    else 
      {
      if (found == 2)                      // Insert the appropriate error message
        OutRec.BandName = ValueNotFoundMessage;    
      else if (found == 1)
        OutRec.BandName = OutOfBoundsMessage;    
      else if (found == 3)
        OutRec.BandName = "";    
      }
  } 

// Done
Post Reply