Page 1 of 2

Count(*)

Posted: Tue Apr 05, 2011 3:38 pm
by Nagaraj
I have a requirement where i am creating counts in a delimited file, something like below

Source

col1
Yes
Yes
No
No


Target(File)(based on three condiitions YES/NO/NA)

2|2|0

the reason is even if there is no value for any of the 3 conditions i need zero.

i am using TFM splitiing by using constraints and then using Agg stage for each link and then i am using link collector and seperating out in a file.
(using stage variables to generate the format based on some dummy key)

all i am getting is 2|2

how do i get zero in place of nothing?

Posted: Tue Apr 05, 2011 3:52 pm
by ray.wurlod
I could give a parallel job answer, since you posted it in the parallel jobs forum. However your mention of a Link Collector suggests server job, and this is how you marked the job type.

Seems to me that this could be solved with three links each with its own counting Aggregator stage. One link handles Y, one link handles N, and the third link handles "neither Y nor N". Carry the same constant on each link, and join the three links together using a Transformer stage joining on the constant.

Posted: Tue Apr 05, 2011 4:14 pm
by chulett
... and here we are in Server world now.

Posted: Tue Apr 05, 2011 6:00 pm
by Nagaraj
oh right...i just didnt notice :) :) :)

Posted: Wed Apr 06, 2011 9:56 am
by Nagaraj
tried the same thing.

"join the three links together using a Transformer stage joining on the constant"

but looks like there is timing issue.

the three links which are doing counts takes time but before that my trasnformer main stream link then looks up hashfile created or still getting created will not find any records in hash files though there are records which appears late untill then the output will just show 0 for all the counts.

i hope i am clear in explaining the scenario here

Posted: Wed Apr 06, 2011 9:58 am
by chulett
Nope, not clear at all.

Can you draw a 'picture' of the job?

Posted: Wed Apr 06, 2011 10:31 am
by Nagaraj
not sure how to draw a picture...checked FAQ section didnt find any...!

Code: Select all

               AGG----->TFM4---->Hash1    
                ^           
                i           
                i           
                i           
                i           
seq--->TFM1-->TFM2--------->TFM3------->Hash2
	  				            ^ 
                             | 
                             | 
                             |
                             |
                            Hash1

Posted: Wed Apr 06, 2011 10:33 am
by Nagaraj
the upper link will be coming from 2nd TFM, and the lower reference link is towards 3rd TFM.

:)

Moderator: like that. The need to use code tags to preserve whitespace has been repeated here a ton of times. I also numbered your transformers for easier reference.

Posted: Wed Apr 06, 2011 10:39 am
by Nagaraj
i can split the job into two, which will definately work, but then i want to use this functionality where i can write and read from the same hash file.

Posted: Wed Apr 06, 2011 11:18 am
by arunkumarmm
If my understanding is correct the below approach will work.

SeqFile ----> Transformer ----> HashedFile ---->SeqFile


In the transformer, create 3 stage variables:

sv1 - If InputLink.Col1 = 'YES' Then sv1+1 Else sv1
sv2 - If InputLink.Col1 = 'NO' Then sv2+1 Else sv2
sv3 - If InputLink.Col1 = 'NA' Then sv3+1 Else sv3

And give default value for the stage variables as zero. Make a dummy key column in the hashed file and write all these stage variables to the hashed file(So as the Hashed file is updated for every record with the counts). Read the same hashed file and write it to your target SeqFile.

Posted: Wed Apr 06, 2011 11:44 am
by Nagaraj
Yes this logic works for only one set of data,
it should be able to handle different group types(i.e the key column keeps changing)

Posted: Wed Apr 06, 2011 12:24 pm
by DSguru2B
Why not set another column thats a 1 for YES/NO and 0 for NA. Then send it down to the aggregator stage and group on the type and sum up this new column.
Then just pivot it.
Am I missing something from the requirement?

Posted: Wed Apr 06, 2011 12:27 pm
by Nagaraj
sorted by key columns first and then
Tried by adding one more stage variable checking for key column change,
didnt workout.

Posted: Wed Apr 06, 2011 12:36 pm
by Nagaraj
Okay folks,

I am splitting the job into two and referencing the hash files in 2nd job which will give me counts and as well as zeros.....thanks for your time and ideas, appreciate it.

Posted: Wed Apr 06, 2011 12:57 pm
by arunkumarmm
Nagaraj wrote:Yes this logic works for only one set of data,
it should be able to handle different group types(i.e the key column keeps changing)
I did not understand this. Can you give some example? And if you are talking abouut the Key column in the Hashed file, it should be a dummy key, say 1 for all the rows.