Multiple Column Count

skp · Post by **skp** » Thu Mar 06, 2014 9:12 am

Hi,

I have 5 columns(say a,b,c,d,e) and i want the counts of each column ignoring null values.

My input file will be in tera bytes. Could some one suggest the best method to design in datastage in parallel version?

Thanks,
Kumar

chulett · Post by **chulett** » Thu Mar 06, 2014 10:19 am

You want to count the number of columns that do not contain nulls or do something with the counts from inside all non-null columns? Seems to me you'll still have to check each one individually for null so I'm thinking just stage variables in a transformer will be appropriate.

ray.wurlod · Post by **ray.wurlod** » Thu Mar 06, 2014 8:05 pm

Aggregator stage, Non-missing values count output column calculation method. Split your data into five streams, one for each column to be counted. Eliminate all unneeded data.

skp · Post by **skp** » Fri Mar 07, 2014 7:17 am

Thanks for the reply.

I think in both the scenarios above, the data stage job should run in sequence mode(Not in parallel). Is my understadning correct?

chulett · Post by **chulett** » Fri Mar 07, 2014 8:26 am

Why do you think that? Are there other requirements for grouping that you haven't shared?

thompsonp · Post by **thompsonp** » Fri Mar 07, 2014 9:36 am

If there are no grouping columns you will at some point need to perform a count sequentially (otherwise you'll get one answer for each partition).
However if you have lots of data and wish to take adavantage of parallelism you can perform the initial counts in parallel and then add another stage running sequentially to sum up the counts across the partitions.

chulett · Post by **chulett** » Fri Mar 07, 2014 10:42 am

thompsonp wrote:If there are no grouping columns you will at some point need to perform a count sequentially (otherwise you'll get one answer for each partition).

Good point. That's true... unless (as you noted) there are other columns not mentioned that will allow say a hash partition so that does not become an issue.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 07, 2014 2:45 pm

Generate a constant and group by that.

skp · Post by **skp** » Mon Mar 10, 2014 3:52 am

Seems i got very good replies here. Yes i will process in parallel mode and at the end i will group based on constact value.

Thanks guys.