Count(*)

Nagaraj · Post by **Nagaraj** » Tue Apr 05, 2011 3:38 pm

I have a requirement where i am creating counts in a delimited file, something like below

Source

col1
Yes
Yes
No
No

Target(File)(based on three condiitions YES/NO/NA)

2|2|0

the reason is even if there is no value for any of the 3 conditions i need zero.

i am using TFM splitiing by using constraints and then using Agg stage for each link and then i am using link collector and seperating out in a file.
(using stage variables to generate the format based on some dummy key)

all i am getting is 2|2

how do i get zero in place of nothing?

ray.wurlod · Post by **ray.wurlod** » Tue Apr 05, 2011 3:52 pm

I could give a parallel job answer, since you posted it in the parallel jobs forum. However your mention of a Link Collector suggests server job, and this is how you marked the job type.

Seems to me that this could be solved with three links each with its own counting Aggregator stage. One link handles Y, one link handles N, and the third link handles "neither Y nor N". Carry the same constant on each link, and join the three links together using a Transformer stage joining on the constant.

chulett · Post by **chulett** » Tue Apr 05, 2011 4:14 pm

... and here we are in Server world now.

Nagaraj · Post by **Nagaraj** » Tue Apr 05, 2011 6:00 pm

oh right...i just didnt notice

Nagaraj · Post by **Nagaraj** » Wed Apr 06, 2011 9:56 am

tried the same thing.

"join the three links together using a Transformer stage joining on the constant"

but looks like there is timing issue.

the three links which are doing counts takes time but before that my trasnformer main stream link then looks up hashfile created or still getting created will not find any records in hash files though there are records which appears late untill then the output will just show 0 for all the counts.

i hope i am clear in explaining the scenario here

chulett · Post by **chulett** » Wed Apr 06, 2011 9:58 am

Nope, not clear at all.

Can you draw a 'picture' of the job?

Nagaraj · Post by **Nagaraj** » Wed Apr 06, 2011 10:31 am

not sure how to draw a picture...checked FAQ section didnt find any...!

Code: Select all

               AGG----->TFM4---->Hash1    
                ^           
                i           
                i           
                i           
                i           
seq--->TFM1-->TFM2--------->TFM3------->Hash2
	  				            ^ 
                             | 
                             | 
                             |
                             |
                            Hash1

Nagaraj · Post by **Nagaraj** » Wed Apr 06, 2011 10:33 am

the upper link will be coming from 2nd TFM, and the lower reference link is towards 3rd TFM.

Moderator: like that. The need to use code tags to preserve whitespace has been repeated here a ton of times. I also numbered your transformers for easier reference.

Nagaraj · Post by **Nagaraj** » Wed Apr 06, 2011 10:39 am

i can split the job into two, which will definately work, but then i want to use this functionality where i can write and read from the same hash file.

arunkumarmm · Post by **arunkumarmm** » Wed Apr 06, 2011 11:18 am

If my understanding is correct the below approach will work.

SeqFile ----> Transformer ----> HashedFile ---->SeqFile

In the transformer, create 3 stage variables:

sv1 - If InputLink.Col1 = 'YES' Then sv1+1 Else sv1
sv2 - If InputLink.Col1 = 'NO' Then sv2+1 Else sv2
sv3 - If InputLink.Col1 = 'NA' Then sv3+1 Else sv3

And give default value for the stage variables as zero. Make a dummy key column in the hashed file and write all these stage variables to the hashed file(So as the Hashed file is updated for every record with the counts). Read the same hashed file and write it to your target SeqFile.

Nagaraj · Post by **Nagaraj** » Wed Apr 06, 2011 11:44 am

Yes this logic works for only one set of data,
it should be able to handle different group types(i.e the key column keeps changing)

DSguru2B · Post by **DSguru2B** » Wed Apr 06, 2011 12:24 pm

Why not set another column thats a 1 for YES/NO and 0 for NA. Then send it down to the aggregator stage and group on the type and sum up this new column.
Then just pivot it.
Am I missing something from the requirement?

Nagaraj · Post by **Nagaraj** » Wed Apr 06, 2011 12:27 pm

sorted by key columns first and then
Tried by adding one more stage variable checking for key column change,
didnt workout.

Nagaraj · Post by **Nagaraj** » Wed Apr 06, 2011 12:36 pm

Okay folks,

I am splitting the job into two and referencing the hash files in 2nd job which will give me counts and as well as zeros.....thanks for your time and ideas, appreciate it.

arunkumarmm · Post by **arunkumarmm** » Wed Apr 06, 2011 12:57 pm

Nagaraj wrote:Yes this logic works for only one set of data,
it should be able to handle different group types(i.e the key column keeps changing)

I did not understand this. Can you give some example? And if you are talking abouut the Key column in the Hashed file, it should be a dummy key, say 1 for all the rows.