Multiple Column Count

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
skp
Premium Member
Premium Member
Posts: 135
Joined: Wed Dec 26, 2007 1:56 am
Location: India

Multiple Column Count

Post by skp »

Hi,

I have 5 columns(say a,b,c,d,e) and i want the counts of each column ignoring null values.

My input file will be in tera bytes. Could some one suggest the best method to design in datastage in parallel version?

Thanks,
Kumar
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You want to count the number of columns that do not contain nulls or do something with the counts from inside all non-null columns? Seems to me you'll still have to check each one individually for null so I'm thinking just stage variables in a transformer will be appropriate.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Aggregator stage, Non-missing values count output column calculation method. Split your data into five streams, one for each column to be counted. Eliminate all unneeded data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
skp
Premium Member
Premium Member
Posts: 135
Joined: Wed Dec 26, 2007 1:56 am
Location: India

Post by skp »

Thanks for the reply.

I think in both the scenarios above, the data stage job should run in sequence mode(Not in parallel). Is my understadning correct?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Why do you think that? Are there other requirements for grouping that you haven't shared?
-craig

"You can never have too many knives" -- Logan Nine Fingers
thompsonp
Premium Member
Premium Member
Posts: 205
Joined: Tue Mar 01, 2005 8:41 am

Post by thompsonp »

If there are no grouping columns you will at some point need to perform a count sequentially (otherwise you'll get one answer for each partition).
However if you have lots of data and wish to take adavantage of parallelism you can perform the initial counts in parallel and then add another stage running sequentially to sum up the counts across the partitions.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

thompsonp wrote:If there are no grouping columns you will at some point need to perform a count sequentially (otherwise you'll get one answer for each partition).
Good point. That's true... unless (as you noted) there are other columns not mentioned that will allow say a hash partition so that does not become an issue.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Generate a constant and group by that.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
skp
Premium Member
Premium Member
Posts: 135
Joined: Wed Dec 26, 2007 1:56 am
Location: India

Post by skp »

Seems i got very good replies here. Yes i will process in parallel mode and at the end i will group based on constact value.

Thanks guys.
Post Reply