Page 1 of 1

Calculating subtotals

Posted: Tue Mar 15, 2011 4:26 pm
by AngelMady
Hello,

I have in input file that looks like this:

id|colA|colB
1|x|z
2|y|t
1|a|b
3|c|d


and the required output:

id|colA|colB
1|x|z
1|a|b
1|2|
2|y|t
2|1|
3|c|d
3|1|

My input is a sequential file and for each id, I have to calculate the number of lines for each id.

Could you give me any suggestions in solving this quest?

Thank you in advance for your return.

Madalina

Posted: Tue Mar 15, 2011 4:44 pm
by ray.wurlod
Fork join. (Search)

In the Aggegregtor stage count the rows for each group (each key value).

Funnel based on the key value.

Make sure your data are properly sorted and partitioned on the key value (or, for a small enough number of rows, process on a single node).

Re: Calculating subtotals

Posted: Tue Mar 15, 2011 5:18 pm
by LS
Hi AngelMady.

Yes I can. I am assuming that the only column you want sorted in the output is the id.

Job: File -> Sort -> Transformer -> Sort -> Transformer -> File

1. Read you file and hash by id, so all ids with same value are on the same partition.
2. Sort by id.
3. Add a transformer and with the use of 3 stage variables create a new column on the output to hold the Key_Change(varchar(100)). Stage vars like:

CurrentId = inputLink.ID
IsNewId = if CurrentId <> PreviousId Then "1" Else "0"
PreviousId = CurrentId

Make the new output column Key_Change to have IsNewId value. It will look like (abstract partitions):
id|colA|colB|Key_Change
1|x|z|1
1|a|b|0
2|y|t|1
3|c|d|1


4. Sort by Id asc and Key_Change desc. It will look like:
id|colA|colB|Key_Change
1|a|b|0
1|x|z|1
2|y|t|1
3|c|d|1

5. Add a transformer with a stage var Count like:

CurrentId = inputLink.ID
Count = if CurrentId <> PreviousId Then 1 Else Count + 1
PreviousId = CurrentId

The output columns will be only Id, colA and colB (make this big enough to hold the size of 2 records). Make the derivation for colB like:
If input.Key_Change = "0" Then input.colB Else input.colB : "**BREAK**" : input.Id : "|" : Count

It will look like:

id|colA|colB
1|a|b|
1|x|z**BREAK**1|2
2|y|t**BREAK**2|1
3|c|d**BREAK**3|1

Don't forget to sort merge on the collector.

On the after run a sed pr awk to change "**BREAK**" for a unix/windows newline.

In v8 the Sort has a Key Change Column option that would make this job much smaller:
Job: File -> Sort -> Transformer -> File

As a payment tomorrow you will have to smile to 10 strangers.

Have fun,
me.

[snip]

Posted: Tue Mar 15, 2011 6:45 pm
by chulett
Dear LS,

:idea:

Please give our friend the Reply to topic button a try next time, there's no need to quote everything every time you want to add to a conversation. I would be eternally grateful and it will cut down on my superfluous quotation cleanup work. :wink:

Have fun,
me!

Posted: Sat Mar 19, 2011 9:43 am
by AngelMady
Hello,

Thank you for your answers.

Starting from your answers, I decided to go on a solution using aggregator, funnel and sort.

In the first step, I draw the design in order to analyze it, to see where it can be optimized.

For the moment, I'm working at the extraction part. I'll get back to you with the result.