Generating Multiple rows in output from single row in input

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

dhiren
Premium Member
Premium Member
Posts: 43
Joined: Thu Aug 11, 2005 12:24 am

Generating Multiple rows in output from single row in input

Post by dhiren »

Hi All
In our project requirement, where there is a single row in input containing 3 fields F1, F2 and F3.

I have to generate an output file where there are four fields F1,F2,F3 and F4.

Now depending on the value in the input field F1, single/multiple output rows will be generated.

For eg.
The input file structure is
F1, F2, F3

The values are:
3 , a, b

Since the value of F1 is 3, i will need 3 rows in output as below:

F1, F2, F3 , F4
3, a , b , 1
3, a , b , 2
3, a , b , 3

Note: The value of F1 can be anything ; there is no upper limit for this.

Any advice towards achieving this will be of great help.

Regards,
Dhiren
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I think you're going to need a custom stage for this. My first thought was that you have a horizontal pivot here, but you would need to generate an arbitrary number of columns in the row (based on the value of F1) prior to the Pivot stage. And DataStage doesn't work with an arbitrary number of columns in a row. Is there a finite upper limit to F1 values? If so, this approach may be viable. F1, F2 and F3 would be pivot keys. (Yes, I note that you indicated that there is no upper limit, but it's always worth checking.)

In the custom stage you would need a loop that generated F1 rows for each input row, with an incrementing value for F4 reset to 1 for each new row being processed.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dhiren
Premium Member
Premium Member
Posts: 43
Joined: Thu Aug 11, 2005 12:24 am

Post by dhiren »

Ray,
Since we are dealing with the input processed data, i see that there will be no change in the upper limit.
The upper limit can be assumed to be 10000 here.

Regards,
Dhiren
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How do you feel about a row with 10003 columns? The last 10000 will have very similar derivations: If x > F1 Then @NULL Else x (for x running from 1 to 10000). Not pretty, I think the custom stage is for you.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

Can you use an external target stage to write to a delimited sequential file? Can you use awk? If so, I think this awk tidbit might do the trick.

Code: Select all

#!/bin/ksh

FNAME="${1}"
AFS="${2}"

awk -vAFS="${AFS}" -vFNAME="${FNAME}" 'BEGIN{FS=AFS;OFS=AFS;}
{
   COUNT=1;
   while ( COUNT <= $1 )
   {
      print $1,$2,$3,COUNT >FNAME;
      COUNT++;
   }
}'

exit 0
bhavna.jha
Participant
Posts: 1
Joined: Fri Feb 02, 2007 4:33 am

Post by bhavna.jha »

hi Dhiren,

i think you can do it by using dimentation table also. you need to create a table which contain all possible value of F1 fields than join(inner join) this table with source table. After that u need to apply filter records which satisfy the condition f1<=3 like. try this may be this can help u.[/quote]
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

awk will definately do the trick. You can also write a C routine which takes in your three columns and returns the generated rows.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Or you can write the same C code in a Build stage, which is my original suggestion.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Actually C wont work. Come to think of it, C variables wont hold values with line feeds. The workaround could have been writing to a file but that would require the execution to be sequential. Parallel writes wont work. I guess its either a Basic transformer or an awk script. C will work but it has to be sequential.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's news to me. So my sprintf(LineCount, "%d lines written.\n\n", lines); statement, which has worked for many years, isn't valid?
:?
Last edited by ray.wurlod on Wed Feb 14, 2007 6:24 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

My apologies. I was trying to build the same but was messing up at a place. I take my words back. If I get this done in a while Ill post the code.
My apologies once again.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Ah, the ever-hasty DSguruji!
:roll:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Well I wrote a small piece of code in the morning and was stuck in between how to return a value with all rows. I was getting some junk data. Therefore my misinterepretation and hence my assumption that I made earlier. But your example Ray, was like a wake up call as I use that all the time too. I looked at my code again and added strcpy(All,"") in my code below. It just got rid of the junk data that I was getting.
Anyways, here is the complete routine.

Code: Select all

//Header Files
#include <string.h>
#include <stdio.h>
#include <stdlib.h>

//Function that generates the rows
char * generateRows(char* F1, char* F2, char* F3)
{
  //Initialize variables
  const int SIZE = 1000;
  int intF1;
  
  char* temp = (char* )malloc(4*SIZE);

 //typecasting of F1 to get the number of duplicates
  intF1 = atoi(F1);

  char* All = (char* )malloc(intF1*SIZE);

//Initialize variable
  strcpy(All, "");

//Duplicating records and storing in variable All
  for (int i = 1; i <= intF1 ; i++)
  {
     sprintf(temp, "%s,%s,%s,%i\n", F1, F2, F3, i);
     strcat(All, temp);
	 
  }

  free(temp);
  return All;
}
dhiren, you can use this code after you do thorough testing. I tested it at home by manually passing data. It should word just fine.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

I think a buildop would be much easier. Then there are also no restrictions on sequential vs. parallel. That always throws a red flag for me.

We use buildops to handle situations similar to this, where the COBOL copybook has a single record with an occurs statement embedded. In these cases we'll read one record in and then interrogate the data within the occurs to determine if a new record should be created. If there are a max of 10, we may create 0 to 10 records as output from the single input.

Dhiren, your situation can be very easily handled in a buildop - I think this may be what Ray was talking about.

I used the following pieces to construct the buildop:

Input schema/table definition:

Code: Select all

record
(
        f1:int32;
        f2:string[1];
        f3:string[1];
)
Output schema/table definition:

Code: Select all

record
(
        f1:int32;
        f2:string[1];
        f3:string[1];
        f4:int32;
)
Per-record logic:

Code: Select all

// Drop default record, output will be handled manually 
dropRecord(); 

int i;

for (i = 1; i <= in.f1; i++) {
    out.f1 = in.f1;
    out.f2 = in.f2;
    out.f3 = in.f3;
    
    out.f4 = i;

    // Transfer record to buffer and output
    transferRecord(); 
    outputRecord();
}
Only 2 things to keep in mind if you want to manually control how the records are output. 1) drop the record at the beginning - this essentially clears the buffer so you can populate it yourself. 2) For each record you create, transfer the data to the buffer and output (write) it.

If you are interested, here is a link where we used similar logic to handle an occurs in our input data:

Splitting One Record to Many

Multiple record entries to be produced from a single row.

Hope this helps.

Brad.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

bcarlson wrote:I think a buildop would be much easier. Then there are also no restrictions on sequential vs. parallel. That always throws a red flag for me.
No restrictions for sequential vs. parallel in a routine as well. I think it depends upon how a person wants to do it.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply