C++ routine Vs Aggregator stage

mouni · Post by **mouni** » Mon Dec 04, 2006 4:18 am

Hi,
I had a requirement to count the rows in the input file and write the final count to the output. For this the solution I thought of is to use a C++ routine which counts the number of lines and return the total. This is called from the datastage. This is working fine and quite fast.

My Customer is proposing to use a aggregate stage to do this, which I feel is a bit heavy just to count the input rows.( if input file is too huge )

I wanted to know the pros and cons of C++ routine Vs Aggregator stage. Please give your views.

balajisr · Post by **balajisr** » Mon Dec 04, 2006 4:22 am

How about using UNIX wc -l to count the number of rows in the input file. I guess this will be faster than routine and aggregator stage.

mouni · Post by **mouni** » Mon Dec 04, 2006 5:50 am

Our coding is on AIX server, but the code will be later ported to Windows. So, We want to make sure that we do not use anything Unix specific, that may not work on Windows.

balajisr · Post by **balajisr** » Mon Dec 04, 2006 6:11 am

In that case you will using 7.5x2 version of datastage.
Below is one of the many alternatives in addition to using routine and aggregator stage.

In the sequential file generate "Row Number Column"
Add a tail stage next to sequential file stage and run it sequentially.
Read 1st record using tail stage.
Read Value from the "Row Number Column" to get number of rows in the file.

ray.wurlod · Post by **ray.wurlod** » Mon Dec 04, 2006 6:25 am

You might be surprised how slick the Aggregator is for counting. You don't use the Column for Calculation method, you simply use Count. It's a very fast transit through the code.

Telenet · Post by **Telenet** » Mon Dec 04, 2006 8:11 am

I think this depends on what you want to do with it.
If you have a complex process and you want to know if the records at the end are the same number as in the beginning I would not use aggregator. The moment you read a file in datastage it might reject records that don't have the correct format , in this case I would go for wc -l before starting.

chandra · Post by **chandra** » Mon Dec 04, 2006 12:51 pm

is there any group by key ! in you count .

jgreve · Post by **jgreve** » Mon Dec 04, 2006 5:24 pm

mouni wrote:Our coding is on AIX server, but the code will be later ported to Windows. So, We want to make sure that we do not use anything Unix specific, that may not work on Windows.

If you have to deploy on windows, your server's DataStage install is going to include the MKS toolkit (the parallel stuff requires it). I bet that common unix utilities, like "wc" will run just fine in either environment.

In that scenario, I would worry more about hardcoding path references into your jobs or helper scripts than

Speaking of deploying on Windows: do you have other C++ custom code in your system? If not, you might be able to escape from getting a Windows C++ compiler (of course, if you already have one, then it doesn't matter).

John G.

mouni · Post by **mouni** » Thu Dec 07, 2006 2:14 am

Thanks guys for the help.

Telenet - Our customer is hesitant to use wc -l on Windows even though it works fine with the MKS Toolkit installed. So we were looking out for alternate method. Also they wanted us to do this using vanilla flavors of Datastage.

jgreve - We have a C++ compiler installed which is compatible with Datastage, and we have several complex routines coded in C++ used by Datastage.

We now concluded on using the tail stage. Every record coming into the tail stage will have @INROWCOUNT and @OUTROWCOUNT. The tail would give us the final count of the input and output records. This seems to be working fine with a single partition which solves our problem.

kumar_s · Post by **kumar_s** » Thu Dec 07, 2006 2:21 am

...working fine with a single partition which solves our problem...

Which doesnt utilize the PX functionality.

DSXchange

C++ routine Vs Aggregator stage

C++ routine Vs Aggregator stage

unix code on windows