Column output list based on reference data

Novak · Post by **Novak** » Tue Oct 06, 2015 2:47 pm

Hi all,

We have a problem that we have been trying to solve without much success and your help would be much appreciated.
The requirement is to prepare a record to be written to output in XML format of which two fields are complex type and can be populated any number of times.

The scenario is as follows:

1. A long list of source columns (about 90) from couple of joined tables
2. A dynamic number of reference records of which column name values are the detail we need
3. We then need to use the above two to come up with the column name/value pairs to populate into complex XML type

Some sample data below to clarify the requirement

Input

PK| BK| COL_A| COL_B| COL_C| COL_D| COL_E| COLxxx
-------------------------------------------------------------------------------
23| Cus_123| 5| 15| 25| 35| 40

Reference

COL_ID| COL_NAME
-------------------------
3| COL_A
6| COL_B
8| COL_E

Output

PK| BK| COL_A| COL_B| COL_E
--------------------------------------------------
23| Cust_123| 590| 15| 40

Alternative Output

PK| BK| COL_NAME| COL_VALUE
--------------------------------------------------
23| Cust_123| COL_A| 590
23| Cust_123| COL_B| 15
23| Cust_123| COL_E| 40

So we need to read the column name values from the reference data to determine which input column name/value pairs to take and in this case only COL_A, COL_B and COL_E are taken. These column name/value pairs will end up going to xml complex type but we should be fine with this part.
Would anyone have any suggestions on the best way to do this. Any routines are highly undesirable because this needs to fit into an existing code.

Many thanks,

Novak

ShaneMuir · Post by **ShaneMuir** » Wed Oct 07, 2015 4:14 am

Couple of questions:
Does each input row potentially have a different set of reference column requirements? If so - how do you determine from your input row which columns need to be referenced?

Ie in your example you suggest that your input row requires an output of only cols A,B and E? Does your job expect that the all input rows will require A,B and E for that run, then the next run the reference requirements are different, OR is the requirement different for each input row, so this example requires A,B and E but the next input row in the same run of the job requires C,D and F?

If the latter - how can you tell which reference columns the input row requires?

Novak · Post by **Novak** » Wed Oct 07, 2015 8:33 am

Hi Shane,

Thanks for looking into this. So, each job may have different set of columns needing to be output, determined by reference data and applicable to all the records.
Every record for a particular job run using the same example from before will output columns A, B and E.
In the next run it could be columns B, E, D and F for all of the source records that come through.

Cheers

ShaneMuir · Post by **ShaneMuir** » Wed Oct 07, 2015 9:13 am

So you know before you run the job which columns are required for reference?

In that case I would just use a modify stage and set a KEEP specification in which you specify your column names to keep. You would pass these column names via a parameter which you populate as the job starts.

OR

You might be able to use a combination of Column Import and Export stages in conjunction with Schema files based on the columns that you require.

Novak · Post by **Novak** » Wed Oct 07, 2015 11:13 am

Hi Shane,

We don't know the columns until we run the source file. At a point in a job we take that information from the reference table and can only decide then which columns will go through.

Cheers,

Novak

ShaneMuir · Post by **ShaneMuir** » Thu Oct 08, 2015 2:10 am

Ok, so you only what columns you require once the job has started running.

In that case you will need to pivot your data, whilst at the same time adding column which contains your column names.

This can be achieved in a variety of ways:
1. Use a peek stage with an output link.

Set the peek to return all records, all columns, all partitions, and the output mode to Output (Rather than log), Show column names should be true. This will output the data in the format
ColumnName:Value|ColumnName:Value etc (where | is the delimter as set in the peek stage (under options)). You should then be able to split the values into column name and value columns, and perform a lookup on the column name, keeping only those where there is a match.

2. Import the column names from your DB - this option will depend on how you are reading your data, but you might be able to read the system tables to get your column names (easier if extracting from one table or a view). Once you read in your header values you need to add them to your values. Again multiple ways of achieving this - you can add a position number column and then do the same your values and perform a join, OR, you could concatenate all the column headers into a single record, and the values row into a single record and then use a loop variable in a transformer to split the values into column name and values
Pros: Can change your source and you won't have to change your job.
Cons: The order of columns must exactly match the order of the input data, else the wrong data can end up in the column.

There are also other ways - but the key is getting the column name associated with the value so that you can perform a lookup on your reference table.

Novak · Post by **Novak** » Sun Oct 11, 2015 9:43 am

Hi Shane,

Thanks for your help. We will end up going via your peek stage suggestion. Another person on the team has taken it up and we have it half built already.

Many thanks for your suggestion!