Calculating subtotals

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
AngelMady
Participant
Posts: 16
Joined: Fri Feb 23, 2007 6:25 pm
Location: Romania

Calculating subtotals

Post by AngelMady »

Hello,

I have in input file that looks like this:

id|colA|colB
1|x|z
2|y|t
1|a|b
3|c|d


and the required output:

id|colA|colB
1|x|z
1|a|b
1|2|
2|y|t
2|1|
3|c|d
3|1|

My input is a sequential file and for each id, I have to calculate the number of lines for each id.

Could you give me any suggestions in solving this quest?

Thank you in advance for your return.

Madalina
O:-)There can be only one>:)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Fork join. (Search)

In the Aggegregtor stage count the rows for each group (each key value).

Funnel based on the key value.

Make sure your data are properly sorted and partitioned on the key value (or, for a small enough number of rows, process on a single node).
Last edited by ray.wurlod on Tue Mar 15, 2011 7:35 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
LS
Participant
Posts: 5
Joined: Tue Mar 15, 2011 3:48 pm
Location: Europe
Contact:

Re: Calculating subtotals

Post by LS »

Hi AngelMady.

Yes I can. I am assuming that the only column you want sorted in the output is the id.

Job: File -> Sort -> Transformer -> Sort -> Transformer -> File

1. Read you file and hash by id, so all ids with same value are on the same partition.
2. Sort by id.
3. Add a transformer and with the use of 3 stage variables create a new column on the output to hold the Key_Change(varchar(100)). Stage vars like:

CurrentId = inputLink.ID
IsNewId = if CurrentId <> PreviousId Then "1" Else "0"
PreviousId = CurrentId

Make the new output column Key_Change to have IsNewId value. It will look like (abstract partitions):
id|colA|colB|Key_Change
1|x|z|1
1|a|b|0
2|y|t|1
3|c|d|1


4. Sort by Id asc and Key_Change desc. It will look like:
id|colA|colB|Key_Change
1|a|b|0
1|x|z|1
2|y|t|1
3|c|d|1

5. Add a transformer with a stage var Count like:

CurrentId = inputLink.ID
Count = if CurrentId <> PreviousId Then 1 Else Count + 1
PreviousId = CurrentId

The output columns will be only Id, colA and colB (make this big enough to hold the size of 2 records). Make the derivation for colB like:
If input.Key_Change = "0" Then input.colB Else input.colB : "**BREAK**" : input.Id : "|" : Count

It will look like:

id|colA|colB
1|a|b|
1|x|z**BREAK**1|2
2|y|t**BREAK**2|1
3|c|d**BREAK**3|1

Don't forget to sort merge on the collector.

On the after run a sed pr awk to change "**BREAK**" for a unix/windows newline.

In v8 the Sort has a Key Change Column option that would make this job much smaller:
Job: File -> Sort -> Transformer -> File

As a payment tomorrow you will have to smile to 10 strangers.

Have fun,
me.

[snip]
I'm just a Player, with no Section Leader.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Dear LS,

:idea:

Please give our friend the Reply to topic button a try next time, there's no need to quote everything every time you want to add to a conversation. I would be eternally grateful and it will cut down on my superfluous quotation cleanup work. :wink:

Have fun,
me!
-craig

"You can never have too many knives" -- Logan Nine Fingers
AngelMady
Participant
Posts: 16
Joined: Fri Feb 23, 2007 6:25 pm
Location: Romania

Post by AngelMady »

Hello,

Thank you for your answers.

Starting from your answers, I decided to go on a solution using aggregator, funnel and sort.

In the first step, I draw the design in order to analyze it, to see where it can be optimized.

For the moment, I'm working at the extraction part. I'll get back to you with the result.
O:-)There can be only one>:)
Post Reply