Multiple Column Count
Moderators: chulett, rschirm, roy
Multiple Column Count
Hi,
I have 5 columns(say a,b,c,d,e) and i want the counts of each column ignoring null values.
My input file will be in tera bytes. Could some one suggest the best method to design in datastage in parallel version?
Thanks,
Kumar
I have 5 columns(say a,b,c,d,e) and i want the counts of each column ignoring null values.
My input file will be in tera bytes. Could some one suggest the best method to design in datastage in parallel version?
Thanks,
Kumar
You want to count the number of columns that do not contain nulls or do something with the counts from inside all non-null columns? Seems to me you'll still have to check each one individually for null so I'm thinking just stage variables in a transformer will be appropriate.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Aggregator stage, Non-missing values count output column calculation method. Split your data into five streams, one for each column to be counted. Eliminate all unneeded data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
If there are no grouping columns you will at some point need to perform a count sequentially (otherwise you'll get one answer for each partition).
However if you have lots of data and wish to take adavantage of parallelism you can perform the initial counts in parallel and then add another stage running sequentially to sum up the counts across the partitions.
However if you have lots of data and wish to take adavantage of parallelism you can perform the initial counts in parallel and then add another stage running sequentially to sum up the counts across the partitions.
Good point. That's true... unless (as you noted) there are other columns not mentioned that will allow say a hash partition so that does not become an issue.thompsonp wrote:If there are no grouping columns you will at some point need to perform a count sequentially (otherwise you'll get one answer for each partition).
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: