Page 1 of 1

Build Op for "LIKE" operation between data in two

Posted: Fri Dec 14, 2012 1:08 am
by Rakesh311
Hi Guys

I am having a requirement to do "Like" Operation between data in two files.
Since data is large i wrote an Build Op to do the operation.Which will work as Join stage


INPUT DATA File Format

InputCode| Input_Record


REF DATA File Format

RefCode|Ref_Record


Matching Condition

If InputCode=RefCode then if RefRecord like %InputRecord% then give InputRecord as output



Here Input record can be a maximum of 2 Million , while ref can be of any count .So i am using an array with the size of 2 million to store the input record and passing data form Ref to the build op for matching.

Am attaching the code i used

Code: Select all

[b]Definitions:[/b]

int i=0;
string Inpcode[2000000];
string InpRecord[2000000];
string RefRecord[1];
int j;
int k=0;
int comp=0;
int z;
int partunm=0;


[b]Pre-loop:[/b]

readRecord(0);
while (inputDone(0)!=1)
{

   doTransfersFrom(0);


Inpcode[i]=InRec.Inputcode;
InpRecord[i]=InRec.InputRecord;
i++;

readRecord(0);
}


[b]
Per-Record:[/b]

readRecord(1);
while (inputDone(1)!=1)
{
    doTransfersFrom(1);
    RefRecord[0]=(string) InRecRf.RefRecord;
    for (j=0;j<=i;j++)
      {
              
            if ( (string) InRecRf.Inputcode == Inpcode[j] )
            {

              if (  RefRecord[0].find(InpRecord[j]) != string::npos )
                 { 
                       char *ipInputRecord = (char*)InpRecord[j].c_str();
                      OutRec.InpMatchRecord=ipInputRecord ; 
                      comp=1;
                      transferAndWriteRecord(0);
                          
                  }  
            }            
      }   


  comp=0;
  readRecord(1);
 }


If ref count is about 1 or 2 million am getting good performance. But when it goes beyond that performance is decreasing drastically also its taking most of the CPU..

Can you guys please suggest me any better way in doing this!!

Posted: Fri Dec 14, 2012 3:37 am
by ray.wurlod
Why not perform an equi-join on the ID values with a Join stage and the comparison in a Transformer stage (using the Index() function) downstream of that?

Posted: Fri Dec 14, 2012 4:21 am
by Rakesh311
Hi Ray,

In this case we may have same ID value repeating a millions time in same Ref File as well as in the Input file.So if we perform Join it will create HUGE volume of data for comparison.

Which inturn may cause performance issue and will fill scratch disk space as well Dataset space.

That is the reason why we went for Build OP!!!

Thanks

Posted: Fri Dec 14, 2012 10:09 am
by jwiles
Your logic's performance suffers because it is performing a loop through the entire set of Input records for each Ref record that passes through (causing the high CPU usage).

Are you wanting to keep only one copy of an input record regardless of how many ref records it matches to? Or do you want a copy of an input record for each ref record it matches to? With Ray's suggestion, you can easily control this in the transformer by using the appropriate combination of stage variables and constraints and avoid the problems you perceive with scratch space.

You could likely pivot the ref records on the refcode column, leaving you with one record containing all the ref record data, reducing the output of the join even further. This would help especially when you have multiple input records with the same ref code.

Regards,

Posted: Fri Dec 14, 2012 11:02 pm
by Rakesh311
Yes i do want to loop my entire input record for each of the ref record.
Since i am using using AUTO transfer mode for ref link i will get the inpRecord along with its matched refRecord.

Doing this through Transformer was my initial approach,But which was giving lot of issues related to performance and disk space.
So we went with BUILD OP, and this is working fine when record count in input and ref is few thousands up to 1 million.

But now we may get data more than 1 million in both input and ref.
And for such cases we are facing issue with Builop also.

Is there any way to optimize the build op code??

Thanks

Posted: Sat Dec 15, 2012 12:31 am
by jwiles
You don't need to loop through the entire set of input records for each ref record. If your data is properly sorted and partitioned on each input, then you only need to search through the input records which have the same ref code as your current ref record. To search through ALL of them is wasteful and inefficient. 1 million searches through 1 million rows is 1 trillion comparisons.

If you were having disk space issues when attempting to use join and transformer, then your job was incorrectly designed. You shouldn't need to use any dataset storage, and you shouldn't need to use much more scratch space than is needed to sort your two source files. Pivoting the ref data on ref code will reduce the number of ref records flowing into the join and transformer.

If you insist on following the BuildOp route, adjust the starting point of your loop (variable j) based upon the ref code from your current ref record. For example: You know ref code '10' can't match to ref code '00' through '09', so don't loop through those input records! Start at the first input record with the ref code '10' (if there is one...if not, you can't match it anyway). You also know you can't match to any ref code value > '10', so stop looking (i.e. break out of your loop) when you hit an input record with a ref code > '10'.

Alternatively, you could load ref records into your array, but ONLY those with a ref code which matches the ref code on your current input record. When the input record ref code changes, replace the current array of ref records with the next set which matches on ref code.

In short, the best (and really) only way to make your buildop more efficient is to reduce the number of comparisons you make (quit looping through the entire set of input record for each ref record). However, you don't need a buildop to do this...it can be done efficiently with a properly designed job using only sorts, a join and transformers (and optionally a pivot stage) and without an excessive use of scratch storage.

Regards,