Hi all,
We have a problem that we have been trying to solve without much success and your help would be much appreciated.
The requirement is to prepare a record to be written to output in XML format of which two fields are complex type and can be populated any number of times.
The scenario is as follows:
1. A long list of source columns (about 90) from couple of joined tables
2. A dynamic number of reference records of which column name values are the detail we need
3. We then need to use the above two to come up with the column name/value pairs to populate into complex XML type
Some sample data below to clarify the requirement
Input
PK| BK| COL_A| COL_B| COL_C| COL_D| COL_E| COLxxx
-------------------------------------------------------------------------------
23| Cus_123| 5| 15| 25| 35| 40
Reference
COL_ID| COL_NAME
-------------------------
3| COL_A
6| COL_B
8| COL_E
Output
PK| BK| COL_A| COL_B| COL_E
--------------------------------------------------
23| Cust_123| 590| 15| 40
Alternative Output
PK| BK| COL_NAME| COL_VALUE
--------------------------------------------------
23| Cust_123| COL_A| 590
23| Cust_123| COL_B| 15
23| Cust_123| COL_E| 40
So we need to read the column name values from the reference data to determine which input column name/value pairs to take and in this case only COL_A, COL_B and COL_E are taken. These column name/value pairs will end up going to xml complex type but we should be fine with this part.
Would anyone have any suggestions on the best way to do this. Any routines are highly undesirable because this needs to fit into an existing code.
Many thanks,
Novak
Column output list based on reference data
Moderators: chulett, rschirm, roy
Couple of questions:
Does each input row potentially have a different set of reference column requirements? If so - how do you determine from your input row which columns need to be referenced?
Ie in your example you suggest that your input row requires an output of only cols A,B and E? Does your job expect that the all input rows will require A,B and E for that run, then the next run the reference requirements are different, OR is the requirement different for each input row, so this example requires A,B and E but the next input row in the same run of the job requires C,D and F?
If the latter - how can you tell which reference columns the input row requires?
Does each input row potentially have a different set of reference column requirements? If so - how do you determine from your input row which columns need to be referenced?
Ie in your example you suggest that your input row requires an output of only cols A,B and E? Does your job expect that the all input rows will require A,B and E for that run, then the next run the reference requirements are different, OR is the requirement different for each input row, so this example requires A,B and E but the next input row in the same run of the job requires C,D and F?
If the latter - how can you tell which reference columns the input row requires?
Hi Shane,
Thanks for looking into this. So, each job may have different set of columns needing to be output, determined by reference data and applicable to all the records.
Every record for a particular job run using the same example from before will output columns A, B and E.
In the next run it could be columns B, E, D and F for all of the source records that come through.
Cheers
Thanks for looking into this. So, each job may have different set of columns needing to be output, determined by reference data and applicable to all the records.
Every record for a particular job run using the same example from before will output columns A, B and E.
In the next run it could be columns B, E, D and F for all of the source records that come through.
Cheers
So you know before you run the job which columns are required for reference?
In that case I would just use a modify stage and set a KEEP specification in which you specify your column names to keep. You would pass these column names via a parameter which you populate as the job starts.
OR
You might be able to use a combination of Column Import and Export stages in conjunction with Schema files based on the columns that you require.
In that case I would just use a modify stage and set a KEEP specification in which you specify your column names to keep. You would pass these column names via a parameter which you populate as the job starts.
OR
You might be able to use a combination of Column Import and Export stages in conjunction with Schema files based on the columns that you require.
Ok, so you only what columns you require once the job has started running.
In that case you will need to pivot your data, whilst at the same time adding column which contains your column names.
This can be achieved in a variety of ways:
1. Use a peek stage with an output link.
Set the peek to return all records, all columns, all partitions, and the output mode to Output (Rather than log), Show column names should be true. This will output the data in the format
ColumnName:Value|ColumnName:Value etc (where | is the delimter as set in the peek stage (under options)). You should then be able to split the values into column name and value columns, and perform a lookup on the column name, keeping only those where there is a match.
2. Import the column names from your DB - this option will depend on how you are reading your data, but you might be able to read the system tables to get your column names (easier if extracting from one table or a view). Once you read in your header values you need to add them to your values. Again multiple ways of achieving this - you can add a position number column and then do the same your values and perform a join, OR, you could concatenate all the column headers into a single record, and the values row into a single record and then use a loop variable in a transformer to split the values into column name and values
Pros: Can change your source and you won't have to change your job.
Cons: The order of columns must exactly match the order of the input data, else the wrong data can end up in the column.
There are also other ways - but the key is getting the column name associated with the value so that you can perform a lookup on your reference table.
In that case you will need to pivot your data, whilst at the same time adding column which contains your column names.
This can be achieved in a variety of ways:
1. Use a peek stage with an output link.
Set the peek to return all records, all columns, all partitions, and the output mode to Output (Rather than log), Show column names should be true. This will output the data in the format
ColumnName:Value|ColumnName:Value etc (where | is the delimter as set in the peek stage (under options)). You should then be able to split the values into column name and value columns, and perform a lookup on the column name, keeping only those where there is a match.
2. Import the column names from your DB - this option will depend on how you are reading your data, but you might be able to read the system tables to get your column names (easier if extracting from one table or a view). Once you read in your header values you need to add them to your values. Again multiple ways of achieving this - you can add a position number column and then do the same your values and perform a join, OR, you could concatenate all the column headers into a single record, and the values row into a single record and then use a loop variable in a transformer to split the values into column name and values
Pros: Can change your source and you won't have to change your job.
Cons: The order of columns must exactly match the order of the input data, else the wrong data can end up in the column.
There are also other ways - but the key is getting the column name associated with the value so that you can perform a lookup on your reference table.