Parallel Routine to strip Non Ascii chars from a string.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
alow
Premium Member
Premium Member
Posts: 17
Joined: Mon May 03, 2004 5:53 pm
Location: Geelong, Vic

Parallel Routine to strip Non Ascii chars from a string.

Post by alow »

Hi All,

As a result of a previous post (viewtopic.php?t=132126), I have received a few private messages asking me about a parallel routine that I have written to remove non ascii chars from an input string.

For anyone who does not know how to create parallel routines, I found the following website useful for getting started (http://it.toolbox.com/blogs/dw-soa/data ... easy-20926).

My C code to remove non ascii chars from a string is below;

Code: Select all

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>

char * PRemoveNonAscii(char* varInStr)
{
        int varStrLen;                                  // Int to hold the char size of the incoming string. //
        int varCount=0;                                 // Int to be used as a counter. //

        varStrLen=strlen(varInStr);                     // Calculate the length of the input string. //
        char* varOutStr=(char*) malloc (varStrLen+1);   // Create a pointer to a char array & allocate memory for the output string. //

        while (*varInStr)                               // Loop through each char of the passed string until the null terminator is reached. //
        {
                if (isascii(*varInStr))                 // Check to see if char is a valid ascii char. //
                {
                        varOutStr[varCount]=*varInStr;  // Char is a valid ascii char so write to output array. //
                        varCount++;                     // Increment array counter. //
                }
                varInStr++;                             // Move to next char in input string. //
        }

        varOutStr[varCount]='\0';                       // Add a null terminator to the end of the output string. //
        return varOutStr;                               // Return result. //
        free(varInStr);                                 // Free the memory allocated to the input string. //
        free(varOutStr);                                // Free the memory allocated to the output string. //
}

PLEASE NOTE: I am not sure how DataStage handles memory allocation (specifically when it releases the memory allocated), which is why I have put two free commands at the end of my function (one for the input string and one for the output). I haven't had any issues using this function in our environment (DataStage 7.5.3), but we are yet to move it to production (as I thought it wise to have our IT department review this code before we migrate). I don't consider myself to be a professional C or C++ programmer.... so.... USE THIS CODE AT YOUR OWN RISK!



If anyone has any feedback on the above code, or an alternative way of achieving the same result, I would be interested in hearing your thoughts.
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

I don't use C/C++ very often but one thing I am fairly sure of is that the free statements won't get executed due to the return being called prior to them (can't say I have ever thought about putting statements after the return so could be wrong)

As an alternative, declare an array as the size (+1) of the input string, then at least it doesn't cause memory leakage (unless the free statements after return really does work...)

Noting this as an issue really depends on how much data you send through the function and how much memory your server has to deal with it.
alow
Premium Member
Premium Member
Posts: 17
Joined: Mon May 03, 2004 5:53 pm
Location: Geelong, Vic

Post by alow »

Thanks for the feedback Kryt0n.
free statements won't get executed due to the return being called prior to them
You are probably right... I put the free statements in as an after thought.... I would assume that DataStage free's the memory its using at the completion of a process, but I am not sure?
declare an array as the size (+1) of the input string
I believe thats what I am doing here;

Code: Select all

char* varOutStr=(char*) malloc (varStrLen+1);
Or are you reffering to something else?
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

Alternative way, read space as ascii.

Code: Select all

string formnewString(string inStr, int len) 
{
        int i;
        for (i=0;i<len;i++)
        {

              if (isspace(inStr[i]))   

                { 
                    inStr[i]='+';
                } 
        } 
        return inStr;        
}


char* PRemoveNonAscii(char* varInStr)
{   
    int len;
    string t = string(varInStr);
    string getStr;
    len = t.length();
    getStr=formnewString(t,len);     
    char *OutStr = &getStr[0]; 
    return OutStr;
}
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

alow wrote:I would assume that DataStage free's the memory its using at the completion of a process, but I am not sure?
Unfortunately not, it doesn't seem to care what you have done in relation to memory during the processing. We discovered this when trying to push 100m+ records through a routine using malloc.
I believe thats what I am doing here;

Code: Select all

char* varOutStr=(char*) malloc (varStrLen+1);
Not quite the same. When doing a malloc, you are allocating space on the heap. Used memory on the heap isn't collected until either a free is called or the process terminates. (Remember that your routine is purely a linked in function).
By doing

Code: Select all

char varOutStr[varStrLen+1]
you are allocating memory on the stack. Stack memory is reclaimed once the function terminates. (Note I am not sure you can use a variable in your array length declaration... something is telling me C doesn't like you doing such so you will need to give a max expected length instead)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Is there any scope to use a static variable instead?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

Indeed yes, just as long as the user remembers to terminate it prior to each return... (which OP was doing)
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Workaround for memory leak issue in parallel routine

Post by JoshGeorge »

Code I posted does not have any memory leak issues.
Instead of

Code: Select all

char *OutStr = &getStr[0]; 
Use

Code: Select all

char *OutStr =(char *) getStr.c_str();
which is a better / standard ways of getting a null terminated string from a string object.
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
alow
Premium Member
Premium Member
Posts: 17
Joined: Mon May 03, 2004 5:53 pm
Location: Geelong, Vic

Post by alow »

Thanks for the code JoshGeorge. I haven't had a chance as yet to test out your code, but I plan to give it ago in the upcoming days.


Kryt0n, thanks for the feedback.

Since your last post I have removed the two free() statements and I have changed

Code: Select all

char* varOutStr=(char*) malloc (varStrLen+1);
to

Code: Select all

char varOutStr[varStrLen+1];
So my full function is now;

Code: Select all

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>

char * PRemoveNonAscii(char* varInStr)
{
        int varStrLen;                                  // Int to hold the char size of the incoming string. //
        int varCount=0;                                 // Int to be used as a counter. //

        varStrLen=strlen(varInStr);                     // Calculate the length of the input string. //
        char varOutStr[varStrLen+1];

        while (*varInStr)                               // Loop through each char of the passed string until the null terminator is reached. //
        {
                if (isascii(*varInStr))                 // Check to see if char is a valid ascii char. //
                {
                        varOutStr[varCount]=*varInStr;  // Char is a valid ascii char so write to output array. //
                        varCount++;                     // Increment array counter. //
                }
                varInStr++;                             // Move to next char in input string. //
        }

        varOutStr[varCount]='\0';                       // Add a null terminator to the end of the output string. //
        return varOutStr;                               // Return result. //
}
When I compile the above function, I receive the following warning;

Code: Select all

g++ -O -fPIC -Wno-deprecated -c usrRemoveNonAscii.cpp

usrRemoveNonAscii.cpp: In function `char* PRemoveNonAscii(char*)':
usrRemoveNonAscii.cpp:12: warning: address of local variable `varOutStr' returned
After reading up on stack vs heap memory allocation, the above warning makes sense to me. I am declaring a local array variable (varOutStr) in my function (PRemoveNonAscii) which is using the stack. So my understanding is; in effect I am returning the address of a locally declared variable which will be free'd from memory when the function terminates.

So I what I don't understand is how do we use the stack if it will be free'd when the function terminates. Won't that mean that DataStage could go to access a memory address that has since been free'd? Or am I missing something?
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

Indeed, to get around this you can declare the variable to be static, that will cause it to be created on the heap rather than the stack. However, DataStage does still seem to handle the local variable correctly... although that may be down to pure luck...
alow
Premium Member
Premium Member
Posts: 17
Joined: Mon May 03, 2004 5:53 pm
Location: Geelong, Vic

Post by alow »

Thanks Kryt0n.... unfortunately simply using the local variable technique didn't work with our installation of DataStage (7.5.3).

So.... I have since added a static string variable to my code to achieve the desired result;

Code: Select all

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <string>
using namespace std;

char * PRemoveNonAscii(char* InStr)
{
        int varStrLen;                                          // Int to hold the char size of the incoming string. //
        int varCount=0;                                         // Int to be used as a counter. //
        static string varOutStr;                                // Static string to be used to hold the processed output string. //

        varStrLen=strlen(InStr);                                // Calculate the length of the input string. //
        char varTempStr[varStrLen+1];                           // Char array (on the stack) used to hold the new output string during processing. //

        while (*InStr)                                          // Loop through each char of the passed string until the null terminator is reached. //
        {
                if (isascii(*InStr))                            // Check to see if char is a valid ascii char. //
                {
                        varTempStr[varCount]=*InStr;            // Char is a valid ascii char so write to output array. //
                        varCount++;                             // Increment array counter. //
                }
                InStr++;                                        // Move to next char in input string. //
        }

        varTempStr[varCount]='\0';                              // Add a null terminator to the end of the new string. //

        varOutStr=varTempStr;                                   // Assign the output value to the static string variable. //
        char* OutStr =(char *) varOutStr.c_str();               // Create a pointer to the static string to return as function output. //

        return OutStr;                                          // Return result. //
}
The above code produces the desired result. Kryt0n, after reading up on the static keyword, it appears that my string var varOutStr should be free'd from memory when the "program terminates". So what I hoping this means is the memory allocated to varOutStr will be released when my job completes its run?

I have managed to push a record set of 135 million records (3 varchar 240 columns) through a parallel job using the above code with no obvious issues, so I'm hoping my code no longer has any potential for a memory leak?

JoshGeorge, I managed to get your posted code working easily enough, but it doesn't produce the desired result. When I run a test string through your function it converts spaces to "+" and the non ascii chars are not removed. I'm not sure I understand how you are trying to use the isspace function to test for and remove non ascii chars?

Again, thanks for the help all. :D
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

Code: Select all

string formnewString(string inStr, int len) 
{ 
        int i; 
        for (i=0;i<len;i++) 
        { 

              if (!isascii(inStr[i]))    

                { 
                    inStr[i]=' '; 
                } 
        } 
        return inStr;        
}


char* PRemoveNonAscii(char* varInStr) 
{    
    int len; 
    string t = string(varInStr); 
    string getStr; 
    len = t.length(); 
    getStr=formnewString(t,len);      
    char *OutStr =(char *) getStr.c_str();
    return OutStr; 
}
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
Kryt0n
Participant
Posts: 584
Joined: Wed Jun 22, 2005 7:28 pm

Post by Kryt0n »

alow wrote:The above code produces the desired result. Kryt0n, after reading up on the static keyword, it appears that my string var varOutStr should be free'd from memory when the "program terminates". So what I hoping this means is the memory allocated to varOutStr will be released when my job completes its run?
Yep, that would be my understanding too...
alow wrote: I have managed to push a record set of 135 million records (3 varchar 240 columns) through a parallel job using the above code with no obvious issues, so I'm hoping my code no longer has any potential for a memory leak?
Monitor "top" and see if your processes are taking up as much memory as the sum of the output data for all 135m rows. If close, you have a memory leak, if only a small percentage you should be fine.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Maybe the following can assist in any way. Cannot compile and test now...so tried a simple alpha extract.

You can replace the condition with corresponding type check - e.g. isascii(), isnumber() etc.

Code: Select all

#include <iostream.h>
#include <stdio.h>

char *extractAlpha(char *dataStr)
{
   int actualPtr, asciiPtr = 0;
   for (actualPtr = 0; dataStr[actualPtr]; actualPtr++)
       if ((dataStr[actualPtr] >= 'A' && dataStr[actualPtr] <= 'Z') || \
           (dataStr[actualPtr] >= 'a' && dataStr[actualPtr] <= 'z'))
              dataStr[asciiPtr++] = dataStr[actualPtr];
   dataStr[asciiPtr] = '\0';
   return dataStr;
}
Post Reply