C++ routine Vs Aggregator stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mouni
Charter Member
Charter Member
Posts: 49
Joined: Tue Jul 11, 2006 11:30 pm

C++ routine Vs Aggregator stage

Post by mouni »

Hi,
I had a requirement to count the rows in the input file and write the final count to the output. For this the solution I thought of is to use a C++ routine which counts the number of lines and return the total. This is called from the datastage. This is working fine and quite fast.

My Customer is proposing to use a aggregate stage to do this, which I feel is a bit heavy just to count the input rows.( if input file is too huge )

I wanted to know the pros and cons of C++ routine Vs Aggregator stage. Please give your views.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

How about using UNIX wc -l to count the number of rows in the input file. I guess this will be faster than routine and aggregator stage.
mouni
Charter Member
Charter Member
Posts: 49
Joined: Tue Jul 11, 2006 11:30 pm

Post by mouni »

Our coding is on AIX server, but the code will be later ported to Windows. So, We want to make sure that we do not use anything Unix specific, that may not work on Windows.
balajisr
Charter Member
Charter Member
Posts: 785
Joined: Thu Jul 28, 2005 8:58 am

Post by balajisr »

In that case you will using 7.5x2 version of datastage.
Below is one of the many alternatives in addition to using routine and aggregator stage.

In the sequential file generate "Row Number Column"
Add a tail stage next to sequential file stage and run it sequentially.
Read 1st record using tail stage.
Read Value from the "Row Number Column" to get number of rows in the file.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You might be surprised how slick the Aggregator is for counting. You don't use the Column for Calculation method, you simply use Count. It's a very fast transit through the code.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Telenet
Premium Member
Premium Member
Posts: 14
Joined: Fri Nov 24, 2006 6:18 am
Location: Telenet

Post by Telenet »

I think this depends on what you want to do with it.
If you have a complex process and you want to know if the records at the end are the same number as in the beginning I would not use aggregator. The moment you read a file in datastage it might reject records that don't have the correct format , in this case I would go for wc -l before starting.
chandra
Participant
Posts: 88
Joined: Sun Apr 02, 2006 6:50 pm
Location: India

Post by chandra »

is there any group by key ! in you count .
chandra ,
Hyd
jgreve
Premium Member
Premium Member
Posts: 107
Joined: Mon Sep 25, 2006 4:25 pm

unix code on windows

Post by jgreve »

mouni wrote:Our coding is on AIX server, but the code will be later ported to Windows. So, We want to make sure that we do not use anything Unix specific, that may not work on Windows.
If you have to deploy on windows, your server's DataStage install is going to include the MKS toolkit (the parallel stuff requires it). I bet that common unix utilities, like "wc" will run just fine in either environment.

In that scenario, I would worry more about hardcoding path references into your jobs or helper scripts than

Speaking of deploying on Windows: do you have other C++ custom code in your system? If not, you might be able to escape from getting a Windows C++ compiler (of course, if you already have one, then it doesn't matter).

John G.
mouni
Charter Member
Charter Member
Posts: 49
Joined: Tue Jul 11, 2006 11:30 pm

Post by mouni »

Thanks guys for the help.

Telenet - Our customer is hesitant to use wc -l on Windows even though it works fine with the MKS Toolkit installed. So we were looking out for alternate method. Also they wanted us to do this using vanilla flavors of Datastage.

jgreve - We have a C++ compiler installed which is compatible with Datastage, and we have several complex routines coded in C++ used by Datastage.

We now concluded on using the tail stage. Every record coming into the tail stage will have @INROWCOUNT and @OUTROWCOUNT. The tail would give us the final count of the input and output records. This seems to be working fine with a single partition which solves our problem.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

...working fine with a single partition which solves our problem...
Which doesnt utilize the PX functionality.
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply