Page 1 of 1

DataStage Parallel Routines and Malloc/New

Posted: Mon Jun 18, 2007 11:51 pm
by bmsq
Hi all,

I'm just about to start experimenting with DataStage Parallel routines. Before working with DataStage, C and C++ were my primary languages of choice and I'm therefore quite comfortable in writing a px routine. However, I've done a bit of reading of the DS manuals and some of the threads in this forum and have found several example px routines which allocate small amounts of temporary memory.

Since these functions are potentially getting called millions of times during execution of a job, how does this allocation affect the overall performance? I would have expected lots of small allocations/deallocations to both reduce performance and seriously fragment memory. Is it better to be using a fast pooled allocator for small temporary allocations? Or is the performance overhead negligible compared to the Orchestrate engine itself?

Can anyone share any thoughts or experiences in this matter?

Cheers,
barry

Posted: Tue Jun 19, 2007 12:13 am
by bmsq
On the topic of memory allocation within a DataStage px routine, I found this example px routine (thanks DSguru2B)

Code: Select all

#include "stdio.h"
#include "string.h"
#include "stdlib.h"

char* pxEreplace(char *str, char *subStr, char *rep, int num, int beg)
{
  char *result = (char *)malloc (sizeof(char *));
  int newlen = strlen(rep);
  int oldlen = strlen(subStr);
  int i, x, count = 0;
 
  //If begining is less than or equal to 1 then default it to 1
  if (beg <= 1)
  {beg = 1;}

  //replace all instances if value of num less than or equal to 0
  if (num <= 0)
  {num = strlen(str);}

  //Get the character position in i for substring instance to start from
  for (i = 0; str[i] != '\0' ; i++)
   {
     if (strstr(&str[i], subStr) == &str[i])
     {
      count++;
      i += oldlen - 1;
      if (count == beg)
      {break;}
     }
   }

   //Get everything before position i before replacement begins

   x = 0;
   while (i != x)
   {  result[x++] = *str++; }

  //Start replacement
   while (*str) //for the complete input string
   {

    if (num != 0 ) // untill no more occurances need to be changed
    {
       if (strstr(str, subStr) == str )
       {
          strcpy(&result[x], rep);
          x += newlen;
          str += oldlen;
          num--;
       }
       else // if no match is found
       {
          result[x++] = *str++;
       }
    }
    else
    {
       result[x++] = *str++;
    }
   }

    result[x] = '\0'; //Terminate the string
    return result; //Return the replaced string
    free(result);   //free memory
}
However, I noticed a couple of things which concerned me and left me wondering how memory is usually allocated within a px routine such as this.

First malloc only allocates a buffer of size large enough to contain a single pointer rather than the entire new string. A normal C program would exhibit strange behavior due to buffer over runs (maybe even cause a seg fault) in this situation, is this an error in the example or is it normal practice for px routines.

Second, how does DS free allocated memory? This example tries to free(result) after the return statement (this would probably get compiled out) so there appears to be a memory leak. Will DataStage free this variable later? Or is there some form of DS allocator to use which handles this transparently?

Thanks,
Barry

Posted: Tue Jun 19, 2007 12:15 am
by ArndW
malloc() is a pretty fast and efficient call; plus the space is release after the call is complete so no fragmentation occurs due to these calls.

I'm not sure if the implementation of fast pooled memory allocation are portable across UNIX versions, as the calls seem to be different.

It would certainly be worth a try to test the performance over millions of rows, but from past experience I would guess that the incremental time in cpu-ticks of a malloc() will disappear or become quite small when compared with the overhead of the PCL mechanism used to invoke the C++ routine per row.

Posted: Tue Jun 19, 2007 7:21 am
by DSguru2B
Allocating memory with malloc, like the one I allocated for result, does allocate equivalent to the size of a char pointer. As I have written many more px routines after this one, I found out that its better to explicitly specify the size of the input string rather than just the pointer (Had memory overflow issues). So If something encounters problems with my routine, just change the size for result pointer, explicitly.
The return never gets executed, but being a C programmer, I always have free() statements in my main function. So I would say, its habitual.
The px engine frees the memory by itself and hence you will never encounter any memory leaks.

Posted: Tue Jun 19, 2007 4:30 pm
by bmsq
Thanks guys for helpful answers!

The the DS Engine will free the returned char array, well that makes life a little easier then. Is there any way I can get DS to out put the generated C code for a given transformer? I'm generally a curious fellow when it comes to things like this and would love to see how the transformer is actually implemented.

Thanks again,
Barry

Posted: Tue Jun 19, 2007 7:23 pm
by ray.wurlod
Find out the job number.
SELECT NAME, JOBNO FROM DS_JOBS WHERE NAME = '<<Job Name>>';

Look in subdirectory RT_SCnnn (where nnn is the job number) in your project directory on the server for the generated code, generated osh, and scripts to run them.

Posted: Mon Jul 23, 2012 8:13 am
by PhilHibbs
If you check back at the original thread, I have posted my bug-fixed version of pxEreplace.

Posted: Thu Sep 27, 2012 4:02 am
by PhilHibbs
DSguru2B wrote:The px engine frees the memory by itself and hence you will never encounter any memory leaks.
I'm sorry but that is incorrect. DataStage will not free the memory allocated within a Parallel Routine.

Posted: Thu Sep 27, 2012 7:31 am
by chulett
Well... perhaps that was true five years ago. :wink: