Build Op for "LIKE" operation between data in two

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Rakesh311
Participant
Posts: 36
Joined: Wed Aug 04, 2010 10:53 pm
Location: Banglore
Contact:

Build Op for "LIKE" operation between data in two

Post by Rakesh311 »

Hi Guys

I am having a requirement to do "Like" Operation between data in two files.
Since data is large i wrote an Build Op to do the operation.Which will work as Join stage


INPUT DATA File Format

InputCode| Input_Record


REF DATA File Format

RefCode|Ref_Record


Matching Condition

If InputCode=RefCode then if RefRecord like %InputRecord% then give InputRecord as output



Here Input record can be a maximum of 2 Million , while ref can be of any count .So i am using an array with the size of 2 million to store the input record and passing data form Ref to the build op for matching.

Am attaching the code i used

Code: Select all

[b]Definitions:[/b]

int i=0;
string Inpcode[2000000];
string InpRecord[2000000];
string RefRecord[1];
int j;
int k=0;
int comp=0;
int z;
int partunm=0;


[b]Pre-loop:[/b]

readRecord(0);
while (inputDone(0)!=1)
{

   doTransfersFrom(0);


Inpcode[i]=InRec.Inputcode;
InpRecord[i]=InRec.InputRecord;
i++;

readRecord(0);
}


[b]
Per-Record:[/b]

readRecord(1);
while (inputDone(1)!=1)
{
    doTransfersFrom(1);
    RefRecord[0]=(string) InRecRf.RefRecord;
    for (j=0;j<=i;j++)
      {
              
            if ( (string) InRecRf.Inputcode == Inpcode[j] )
            {

              if (  RefRecord[0].find(InpRecord[j]) != string::npos )
                 { 
                       char *ipInputRecord = (char*)InpRecord[j].c_str();
                      OutRec.InpMatchRecord=ipInputRecord ; 
                      comp=1;
                      transferAndWriteRecord(0);
                          
                  }  
            }            
      }   


  comp=0;
  readRecord(1);
 }


If ref count is about 1 or 2 million am getting good performance. But when it goes beyond that performance is decreasing drastically also its taking most of the CPU..

Can you guys please suggest me any better way in doing this!!
rAKESH
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Why not perform an equi-join on the ID values with a Join stage and the comparison in a Transformer stage (using the Index() function) downstream of that?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Rakesh311
Participant
Posts: 36
Joined: Wed Aug 04, 2010 10:53 pm
Location: Banglore
Contact:

Post by Rakesh311 »

Hi Ray,

In this case we may have same ID value repeating a millions time in same Ref File as well as in the Input file.So if we perform Join it will create HUGE volume of data for comparison.

Which inturn may cause performance issue and will fill scratch disk space as well Dataset space.

That is the reason why we went for Build OP!!!

Thanks
rAKESH
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

Your logic's performance suffers because it is performing a loop through the entire set of Input records for each Ref record that passes through (causing the high CPU usage).

Are you wanting to keep only one copy of an input record regardless of how many ref records it matches to? Or do you want a copy of an input record for each ref record it matches to? With Ray's suggestion, you can easily control this in the transformer by using the appropriate combination of stage variables and constraints and avoid the problems you perceive with scratch space.

You could likely pivot the ref records on the refcode column, leaving you with one record containing all the ref record data, reducing the output of the join even further. This would help especially when you have multiple input records with the same ref code.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Rakesh311
Participant
Posts: 36
Joined: Wed Aug 04, 2010 10:53 pm
Location: Banglore
Contact:

Post by Rakesh311 »

Yes i do want to loop my entire input record for each of the ref record.
Since i am using using AUTO transfer mode for ref link i will get the inpRecord along with its matched refRecord.

Doing this through Transformer was my initial approach,But which was giving lot of issues related to performance and disk space.
So we went with BUILD OP, and this is working fine when record count in input and ref is few thousands up to 1 million.

But now we may get data more than 1 million in both input and ref.
And for such cases we are facing issue with Builop also.

Is there any way to optimize the build op code??

Thanks
rAKESH
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

You don't need to loop through the entire set of input records for each ref record. If your data is properly sorted and partitioned on each input, then you only need to search through the input records which have the same ref code as your current ref record. To search through ALL of them is wasteful and inefficient. 1 million searches through 1 million rows is 1 trillion comparisons.

If you were having disk space issues when attempting to use join and transformer, then your job was incorrectly designed. You shouldn't need to use any dataset storage, and you shouldn't need to use much more scratch space than is needed to sort your two source files. Pivoting the ref data on ref code will reduce the number of ref records flowing into the join and transformer.

If you insist on following the BuildOp route, adjust the starting point of your loop (variable j) based upon the ref code from your current ref record. For example: You know ref code '10' can't match to ref code '00' through '09', so don't loop through those input records! Start at the first input record with the ref code '10' (if there is one...if not, you can't match it anyway). You also know you can't match to any ref code value > '10', so stop looking (i.e. break out of your loop) when you hit an input record with a ref code > '10'.

Alternatively, you could load ref records into your array, but ONLY those with a ref code which matches the ref code on your current input record. When the input record ref code changes, replace the current array of ref records with the next set which matches on ref code.

In short, the best (and really) only way to make your buildop more efficient is to reduce the number of comparisons you make (quit looping through the entire set of input record for each ref record). However, you don't need a buildop to do this...it can be done efficiently with a properly designed job using only sorts, a join and transformers (and optionally a pivot stage) and without an excessive use of scratch storage.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
Post Reply