query on vertical pivot

manish1005 · Post by **manish1005** » Fri Apr 04, 2008 12:18 am

Hi, I am doing vertical pivot on the lines of whats mentioned at - http://dsxchange.com/viewtopic.php?t=97 ... d58c385d2b

Now, in the above solution, final data is being written in a hash file as it overwrites values for same keys with no extra effort. Since my data runs in millions of rows I was thinking about saving some of the hash file writes by using an aggregator with 'last' derivation. By the way my data is sorted as per key. And then i m thinking of putting a sequential file after aggregator, which would be used by bulk loading utility for oracle to load data.

old : seq. file1--> transformer-->hash file
new: seq. file1--> transformer-->aggregator-->seq. file2

My question is - can this approach put too much load on main memory? will the aggregator use too much of memory since data size is too large?
or is there any other way which could fast for large volumes of data.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 04, 2008 12:43 am

If the data are sorted on the grouping keys, and the Aggregator stage is informed of this fact, then the volumes are irrelevant - the stage can output a row (and free the memory associated with it) each time a sorted grouping key changes value - the sorting means that that value will never be seen again. It (the Aggregator stage) becomes very light-weight in its use of memory with sorted input.

manish1005 · Post by **manish1005** » Fri Apr 04, 2008 6:26 am

I tried using aggregator on a sample job, by replacing one of the end hash files with aggregator.
But in DS Designer after I run the job, I see twice the number of rows passed between transformer & Aggregator stage, how is that possible?

source----32rows----->transformer----20rows------>transformer-----40rows--->aggregator------5---->seq. file
....................................|
....................................|
.....................................----12------>transformer----4---------->hash file

its the 40rows link which shows double the rows which previous hash file design was showing.

In the aggregator I have simply grouped by index field which is sorted. and used 'last' in the record field which is a comma separated record.

INDEX,"Record"
0,"0,3434,001,asdsa,22"

ray.wurlod · Post by **ray.wurlod** » Fri Apr 04, 2008 3:58 pm

Unless you tell us precisely how you are processing in the Transformer we can not answer your question. It IS possible for more rows to come out of a Transformer than go in, but you have to program this - it can not happen of its own accord.

manish1005 · Post by **manish1005** » Sun Apr 06, 2008 11:46 pm

Requirment is to club these multiple rows with same index value(the integer in third field) into one row.

sample input:
Field001,Field001_1,Field001_2,Field001_3,Value
"wfappwols","hierarchies","0","cisid","3434"
"wfappwols","hierarchies","0","cisbankid","001"
"wfappwols","hierarchies","0","ciscompanyname","asdsa"
"wfappwols","hierarchies","0","cislevel","22"
"wfappwols","hierarchies","1","cisid","3435"
"wfappwols","hierarchies","1","cisbankid","002"
"wfappwols","hierarchies","1","ciscompanyname","asdsadsad"
"wfappwols","hierarchies","1","cislevel","22"

sample output:
INDEX,"Record-(cisid,cisbankid,ciscompanyname,cislevel)"
0,"0,3434,001,asdsa,22"
1,"1,3435,002,asdsadsad,22"

In the transformer I am doing vertical pivot the following way:

Input Link consists of 5fields:
Field001,Field001_1,Field001_2,Field001_3,Value.

Stage Variables:
currentKey:
Default=""
Derivation=InputLink.Field001_2

newRecord:
Default=""
Derivation=if currentKey <> lastKey Then InputLink.Field001_2:",": InputLink.Value ELSE newRecord:",":InputLink.Value

lastKey:
Default=""
Derivation=currentKey

Output Link:
INDEX=InputLink.Field001_2
Record=newRecord

So, now I can't understand why when I use aggregator stage, the number of output rows from transformer gets doubled?

ray.wurlod · Post by **ray.wurlod** » Mon Apr 07, 2008 12:59 am

How many output rows, how many input rows?

You probably need to constrain the output to pass only those rows that are complete.

Take a look at the actual rows being output.

manish1005 · Post by **manish1005** » Mon Apr 07, 2008 1:55 am

You probably need to constrain the output to pass only those rows that are complete.

Thats precisely why I am using an Aggregator Stage(with last derivation) or a hash file stage which gives me last write.

I will try putting up constraint in transformer itself, but the question still remains unanswered as to from where does these extra rows coming for 'Aggregator' and not for hash files.

ray.wurlod · Post by **ray.wurlod** » Mon Apr 07, 2008 2:21 am

Please answer my questions about your Transformer stage job.

Help us to help you.

An extra question, then. What are the row counts when you use an Aggregator stage - inputs and outputs to that stage?

Can you please post the job design in each case? But use Code tags (which gives fixed-width typeface) rather than what you attempted earlier and which was not totally clear.

What are you doing (precisely) in the second Transformer stage?