Derive Column names from Rows

UAUITSBI · Post by **UAUITSBI** » Wed Feb 11, 2015 4:47 pm

Hello,

Below is the scenario:

In the data that I am getting from the source Column names are embedded into rows I have to pivot these rows and derive column names from them. I have done the pivoting part which is shown in the sample below, I am left with the challenge to derive column names from rows:

Example:

Code: Select all

Column1  Column2 Column 3
  ABC           DEF        XYZ
  123            234        345

In the above example 'ABC', 'DEF', 'XYZ' are the names of the columns. I am using 3 columns as example but there can be more than 100 columns one day, next day may be 10 columns etc... so they are dynamic.

Also to add to this ---> Above sample data is for once instance. If there are more than one instance the data will look like below:

Code: Select all

Column1  Column2 Column 3
  ABC           DEF        XYZ
  123            234        345
  GHI            JKL        MNO
  333            222        666 
  ABC           MNO       
  789            555

Output intended:

Code: Select all

ABC     DEF     XYZ     GHI     JKL     MNO
123      234     345      333     222     666
789      NULL    NULL   NULL    NULL   555

The column names are not fixed they vary dynamically.

Please advise.

Any help is appreciated.

Thanks

ray.wurlod · Post by **ray.wurlod** » Wed Feb 11, 2015 7:31 pm

I don't believe what you are trying to do is legally possible.

UAUITSBI · Post by **UAUITSBI** » Wed Feb 11, 2015 8:49 pm

Is there a way to assign some sort of sequence number to the group of data that can make this a possibility ?

ray.wurlod · Post by **ray.wurlod** » Wed Feb 11, 2015 9:29 pm

UAUITSBI wrote:The column names are not fixed they vary dynamically.

How do you envisage "some sort of sequence number" would help?

ShaneMuir · Post by **ShaneMuir** » Thu Feb 12, 2015 4:33 am

Is there any logic to the way the data is received?

From you example it looks like you receive 3 columns at time? And the rows arrive as header row, data row, header row, data row?

If that is the case, I would try to the the data into 2 columns, one with the header row value, and one with the data value.eg

Code: Select all

 Headers   Data
ABC       123
DEF       234
XYZ       345
GHI       333
JKL       222
MNO       666
ABC       789
MNO       555

From here get a distinct list of header values and form a single column of delimited header values for all the values that are being input eg ABC|DEF|GHI|JKL|MNO|XYZ. Join this value to the existing input.

For the distinct list of header values give each one a position number.
1 ABC
2 DEF
3 GHI
4 JKL
5 MNO
6 XYZ

At the same time apply a seuqence number to get the number of times header arrives (sort the data by headers and implement a counter for the header, increment by 1 and reset when the header name changes).

So by this point you can have your data looking like this

Code: Select all

 
Headers   Data   AllHeaders                Count  Position
ABC       123    ABC|DEF|GHI|JKL|MNO|XYZ   1      1
DEF       234    ABC|DEF|GHI|JKL|MNO|XYZ   1      2
XYZ       345    ABC|DEF|GHI|JKL|MNO|XYZ   1      6
GHI       333    ABC|DEF|GHI|JKL|MNO|XYZ   1      3
JKL       222    ABC|DEF|GHI|JKL|MNO|XYZ   1      4
MNO       666    ABC|DEF|GHI|JKL|MNO|XYZ   1      5
ABC       789    ABC|DEF|GHI|JKL|MNO|XYZ   2      1
MNO       555    ABC|DEF|GHI|JKL|MNO|XYZ   2      5

Resort your data by the count and the position

In a transformer create variables for
* Current Row (svCurrentRow)
* Change in Row (svRowChange)
* AllHeaders value (svHeaders)
* Count of Headers (ie DCount on All Headers) (svHeaderCount)
* Current Position (svCurrentPosId)
* Setting empty columns (svRepeatDelim)
* Current column value (svDataValue)
* Buliding the output (svBuildData)
* Previous Position (svPreviousPosId)
* Previous row (svPreviousRow)

Logic
svCurrentRow = lnk_data_in.Count
svRowChange = svCurrentRow<>svPreviousRow
svHeaderCount = DCount(lnk_data_in.AllHeaders,'|')
svCurrentPosId = lnk_data_in.Postion
svRepeatDelim = If svRowChange then svCurrentPosId-1 else svCurrentPosId-svPreviousPosId
svDataValue = lnk_data_in.Data
svBuildData = If svRowChange then Str('|',svRepeatDelim) : svDataValue else svBuildData:Str('|',svRepeatDelim) : svDataValue
svPreviousRow = lnk_data_in.Count

Set a constaint on the output as LastRowinGroup(lnk_data_in.Count)

Output the data to a sequential file with one column. Set the output column to be
If @OUTROWNUM = 1 then svHeaders : Char(10) : svBuildData else svBuildData

This should give you an output of

Code: Select all

 
Values
ABC|DEF|GHI|JKL|MNO|XYZ
123|234|333|222|666|345
789||||555|

I might have glossed over a few things, but this should get you pretty close. Assuming of course you can get your data into a format of column name and column value at the start.

ray.wurlod · Post by **ray.wurlod** » Thu Feb 12, 2015 4:47 pm

I envisaged something like Shane's solution, but abandoned it because I assumed that your possible range of headings was infinite. Solutions such as Shane's are fine for a limited, finite, set of headings only.

UAUITSBI · Post by **UAUITSBI** » Thu Feb 12, 2015 7:46 pm

Hello Shane,

Thanks for your extensive insight into this.

My data is sourced from XML and I pivot the data which is shown in my example. It is not always 3 columns at a time, there can be any number of columns as Ray pointed out it is dynamic.

Yes, rows arrive as header row, data row, header row, data row.

Infact from the source I get the data as you have shown:

Code: Select all

 Headers   Data 
ABC       123 
DEF       234 
XYZ       345 
GHI       333 
JKL       222 
MNO       666 
ABC       789 
MNO       555

But then I do vertical pivot this data inorder to convert the header into one row so that I can derive the column names.

Your solution will lead me in right direction. But is it possible when the header has not just 3 but any number of columns ?

I worked with my analysts and was able to locate a sequence number which is would establish relationship between Header and Data.

Code: Select all


Column1  Column2 Column 3   UniqueNumber
  ABC           DEF        XYZ           1234567
  123            234        345           1234567

Will this unique number be of any assistance if the number of columns are dynamic ?

ShaneMuir · Post by **ShaneMuir** » Fri Feb 13, 2015 3:09 am

If your source is already coming in as a 2 column input of header values and data values ie

Code: Select all

Headers   Data 
ABC       123 
DEF       234 
XYZ       345 
GHI       333 
JKL       222 
MNO       666 
ABC       789 
MNO       555

then just follow the logic that I provided from that point . There is no need to pivot the data. From this point you can create an output with any number of "columns" based on the input. I have personally used this method to create a process which creates over 1000 files at the same time all with different "structures" (ie some have 5 columns, some have 200). This is because you are actually outputting to a single single column which contains delimited values, so at that point it doesn't have a "structure" as such, but rather the file structure will be interpreted when you read the file again (or pass the output stream to another stage which can interpret the delimited values).

UAUITSBI · Post by **UAUITSBI** » Fri Feb 13, 2015 9:29 am

Great !! Thanks again, I will implement this logic per my data and will get back if I stumble into an issue.

Will update this thread soon. Thanks !!!

UAUITSBI · Post by **UAUITSBI** » Mon Feb 16, 2015 12:43 pm

Hello Shane,

Your solution was very handy, I ran into a minor pitfall by not sorting data well which lead to mismatch the Headers and data column. I was able to fix it.

I have couple of questions on this design please throw some insight:

1. If I get another column from the source say "ACTION":

Code: Select all

 Headers   Data   Action
ABC       123       Insert
DEF       234       Update
XYZ       345       Insert
GHI       333       Update
JKL       222       Update
MNO       666       
ABC       789       Insert
MNO       555       Update

Does your design stay the same ? Or the above design would work for just 2 columns and in-order to get the data from the third / fourth or Nth column the design needs to be different ? I thought that if it is not feasible then probably I need to use the COUNT counter that you suggested as a lookup key to get the "ACTION" column in the later part of the job after completing logic suggested by you but was not sure how to concatenate back to that one column that we are loading into sequential file.

2. After we load the sequential file with one column, how can we split the data into individual columns dynamically ? I know we can use either FIELD function based on delimiter position but as the data here is dynamic how can I split it to map the target column as the position of the column may vary depending on the source data ? Or the only solution is to use the column position and use the field function to map the target columns manually ?

Thanks again for throwing an insight into this appreciate it !!

ShaneMuir · Post by **ShaneMuir** » Tue Feb 17, 2015 4:26 am

Any extra column shouldn't matter because you are processing each column individually.

However the fact that you have this extra column with what look like DB actions could mean there were other options, and throws up other questions. But we will progress with what we have for now.

If you have output the data to a file, the question is what do you need to do with it? Given that there was an indication of some sort of DB action (Inserts and Updates), are you writing to a DB?
Is it all to the one table? Are you able to identify the key columns from the data provided? Are the key columns the same each time?

info_ds · Post by **info_ds** » Tue Feb 17, 2015 6:54 am

Hi,
Here is my thought on the scenario.
1)sort the data by header
2)Genererate seq num based on the repeatitive header
3)Get 3 links out of the transformer to
3a)get max of the genereted seq num & add dummy column
3b)remove duplicate header and then add dummy colum & generate row num again using @INROWNUM
Join 3a & 3b by dummy column & then set loop condition to @iteration<= max(genererated sq num) from step 3a
set a new field & map @iteration to get unique number per iteration
3c)get header,data & seq num(from setp 2)
4) Do a lookup on 3c data with the output of 3b based on header & seq num(from step 2)
5)Sort the data by seq num & header
6)Finally you can do rowcompare with seq num & split you header & data.

UAUITSBI · Post by **UAUITSBI** » Wed Feb 18, 2015 10:08 am

Hello Shane,

Yes, the target is database so I need after I loaded the sequential file as per your solution, How can I go by using it dynamically ?
Yes, that 3rd column is the DB action but I don't do anything with it as of now. I will load 2 staging tables which will have respective column names that are there in the "Headers" column as shown in my example.

Code: Select all

STAGING TABLE 1        STAGING TABLE 2
     ABC                          GHI
     DEF                           JKL
     XYZ                          MNO

So I have to convert the output from the file as these column names to match the staging tables then map respective columns.

Yes, each staging table has it's own key column.

Code: Select all

 Are the key columns the same each time?

Do you mean if I will get the Key columns all the time from the source ? If that is what you meant then yes I will get the key columns.

UAUITSBI · Post by **UAUITSBI** » Wed Feb 18, 2015 10:16 am

Hello Info_ds,

Appreciate your insight on this. When you say rowcompare do you mean joining the data with seq num ? While performing split of the data into header & data, is it possible to be dynamic split and map to respective target columns ?

ShaneMuir · Post by **ShaneMuir** » Wed Feb 18, 2015 11:45 am

What I meant when I said are the key columns always the same, was Are the key columns identifiable, and are they same in each target table.

Depending on your requirements you should be able to use a sequential file stage to read in the data. Turn RCP on and set First row is Column headers. This will set the column names in the DS Job. If you leave RCP on it will pass all the way through to the target table. Any Insert can be done without surfacing any columns. It will insert new rows based on the column names in the input file, and insert nulls for values which are not provided. For an update you will probably need to surface the key columns - this would mean having a job per table to update.