External Routines

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jim.paradies
Premium Member
Premium Member
Posts: 25
Joined: Thu Jan 31, 2008 11:06 pm
Location: Australia

External Routines

Post by jim.paradies »

Hi all,

I've been trying to get some information on the mysteries of external functions and although I've seen quite a number of threads about this, I am still a little confused.

I've written the following routine.

Code: Select all

#define ISALPHA(c) ((c>='a'&&c<='z')||(c>='A'&&c<='Z'))
#define ISDIGIT(c) (c>='0'&&c<='9')
#define ISSPACE(c) (c==' '||c==9||c==10||c==11||c==13)
#define ISPUNCTUATION(c) (!((c>='a'&&c<='z')||(c>='A'&&c<='Z'))&&!(c>='0'&&c<='9')&&!(c==' '||c==9||c==10||c==11||c==13))


char* titleCase(char* s) {
	const int size = 4000;

	// allocate memory
	char *buf = (char *)malloc(size);
	assert(*buf != NULL);

	// set some variables
	char *ch = buf;
	int endOfWord = 1; // set when ch is not alpha
	int inWord = 0; // set when ch is not the first letter of a word

	// copy input string to buffer
	// and begin checking each character
	strcpy(buf,s);

	while (*ch) {

		if (endOfWord) {
			if (ISALPHA(*ch)) {
				// found an alpha after end of word.
				// must be a start of new word
				inWord = 1;
				endOfWord = 0;
				*ch = toupper(*ch);
			}
		} else {
			if (inWord) {
				if (ISALPHA(*ch)) {
					// if previous 2 characters were 'Mc' then
					// this should be uppercase
					// otherwise lowercase
					if ((ch - buf) > 2) {
						if (*(ch-2)=='M'&&*(ch-1)=='c') {
							*ch = toupper(*ch);
						} else {
							*ch = tolower(*ch);
						}
					} else {
						*ch = tolower(*ch);
					}
				} else {
					// found a non-alpha
					// must be the end of a word
					inWord = 0;
					endOfWord = 1;
				}
			}
		}

		// increment pointers
		*ch++;
	}

	// NB no need to free because DataStage will do that for you (I hope)
	return buf;
}
This works fine in my test harness below:

Code: Select all

int _tmain(int argc, _TCHAR* argv[])
{
	char name[] = "mike o'brien, mcfee, mcdonald, this is a title, MR, MRS, MISS, mR, mRs, mIss";

	printf("%s\n",name);
	char *tCase = titleCase(name);
	printf("%s\n",tCase);
	free(tCase);

	return 0;
}
as these results show

Code: Select all

mike o'brien, mcfee, mcdonald, this is a title, MR, MRS, MISS, mR, mRs, mIss
Mike O'Brien, McFee, McDonald, This Is A Title, Mr, Mrs, Miss, Mr, Mrs, Miss
Press any key to continue . . .
The first question is this:
Although I've read on this forum that DataStage will take care of freeing memory, the generated code does not show any evidence of this (as shown below). Has anyone got any definitive statements from IBM about this?

Code: Select all

//
// Generated file to implement the V0S1_Try_ExternalRoutine_Transformer_1 transform operator.
//

// define external functions used
extern int32 hasDigit(string inString);
extern string titleCase(string inString);

// define our input/output link names
inputname 0 DSLink9;
outputname 0 hasDigit;
outputname 1 clean;

initialize {
	// define our row rejected variable
	int8 RowRejected0;

	// define our null set variable
	int8 NullSetVar0;

	// define and initialise each link row count variable required
	uint64 RowCount0_1;
	RowCount0_1 = 0;

	// Stage variable declaration and initialisation
	string StageVar0_svDigitFlag;
	StageVar0_svDigitFlag = "";
}

mainloop {
	// initialise our row rejected variable
	RowRejected0 = 1;

	// declare our intermediate variables for this section
	int64 InterVar0_0;
	string InterVar0_1;

	// evaluate the stage variables first
	StageVar0_svDigitFlag = hasDigit(DSLink9.nme_sur);

	// evaluate constraint and columns for link: hasDigit
	InterVar0_0 = StageVar0_svDigitFlag;
	if (InterVar0_0)
	{
		InterVar0_1 = titleCase(DSLink9.emp_tle);
		hasDigit.emp_tle = InterVar0_1;
		InterVar0_1 = titleCase(DSLink9.nme_sur);
		hasDigit.nme_sur = InterVar0_1;
		writerecord 0;
		RowRejected0 = 0;
	}
	// evaluate constraint and columns for link: clean
	InterVar0_0 = RowRejected0;
	if (InterVar0_0)
	{
		clean.emp_tle_1 = DSLink9.emp_tle;
		InterVar0_1 = titleCase(DSLink9.emp_tle);
		clean.emp_tle = InterVar0_1;
		clean.nme_sur_1 = DSLink9.nme_sur;
		InterVar0_1 = titleCase(DSLink9.nme_sur);
		clean.nme_sur = InterVar0_1;
		writerecord 1;
		RowRejected0 = 0;
		RowCount0_1 = RowCount0_1 + 1;
	}
}

finish {
	// Log warnings for any reject links
	string LogMsg0;
	string LogLink0;
	if (RowCount0_1 > 0) {
		LogMsg0 = RowCount0_1;
		LogLink0 = " rows written to reject link: ";
		LogMsg0 = LogMsg0 + LogLink0;
		LogLink0 = "clean";
		LogMsg0 = LogMsg0 + LogLink0;
		print_message(LogMsg0);
	}

}


The second question is this:
When I use the function in a DataStage Transform stage, I get no compile errors but the output is unchanged as shown below.

Code: Select all

Original	Modified
Ms	Ms
Ms	Ms
MRS	MRS
Mrs	Mrs
Mrs	Mrs
Ms	Ms
Any help would be greatly appreciated.
Jim Paradies
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Post how you compiled and how you set it up in DataStage. Your routine looks correct. I suspect you set it up wrong or compiled it incorrectly. You need to compile and not link. You should point DataStage to the .o file.

Do a search and maybe post the link you used to set this up.
Mamu Kim
jim.paradies
Premium Member
Premium Member
Posts: 25
Joined: Thu Jan 31, 2008 11:06 pm
Location: Australia

Post by jim.paradies »

Thanks for your reply Kim.

I don't know what I did the first time but I re-compiled and tried again and it now seems to be working.

Do you have any idea about the memory management issue?


Thanks,

Jim
Jim Paradies
sud
Premium Member
Premium Member
Posts: 366
Joined: Fri Dec 02, 2005 5:00 am
Location: Here I Am

Post by sud »

By the way, where did you learn that Datastage will free memory?
It took me fifteen years to discover I had no talent for ETL, but I couldn't give it up because by that time I was too famous.
jim.paradies
Premium Member
Premium Member
Posts: 25
Joined: Thu Jan 31, 2008 11:06 pm
Location: Australia

Post by jim.paradies »

I'm not saying that DataStage will free memory - I'm asking if it will.

The question arises from a few posts on this forum. A good example is on the following thread.

viewtopic.php?t=111614&highlight=free+memory

Specifically a comment by DSGuru2B
Yes. DataStage will.
You dont need it to be in writing to believe it. It makes sense.
You cannot free it before returning it, you cannot have a free statement after the return statement as it will never be executed. So the DSEngine takes care of it.
But there are a lot of other examples which allocate memory without freeing.

viewtopic.php?t=117585&highlight=free%28
viewtopic.php?t=110802&highlight=free%28
Jim Paradies
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What I understand is that memory allocated for the return value will be freed automatically by DataStage when (after) the routine returns. However any memory allocated within the routine (say for local variables) does need explicitly to be freed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jim.paradies
Premium Member
Premium Member
Posts: 25
Joined: Thu Jan 31, 2008 11:06 pm
Location: Australia

Post by jim.paradies »

I've finally received a response from IBM support and the bottom line is that DataStage will NOT release memory for you.

Here's the repsonse.
DataStage will not free up the memory that you have allocated, you have a memory leak here, and because buf is an automatic variable you will not be able to access the memory again over routine calls, herein lies the solution to your problem. This is not a defect with DataStage, how could it know that you have malloc'd the memory?

If you allocate memory in an external routine, it is your responsibility to free it up, to see that you in fact have a leak here, use a row generator and put through a few million rows, monitor the size of the osh process, if it is monotonically increasing, there is a leak, and if enough rows go through will eventually lead to the failure of your job.

If the number of rows is small it will still affect your performance but you may be able to live with it, however I would suggest that you make buf static initialised to NULL, check its value on entrance to the routine, if it is not NULL, free it and re allocate it.

Other scenarios are that you could make it a static char array of long enough length and overwrite the same memory locations with every call.
Jim Paradies
Post Reply