Split one row in multiple based on identifier

ambasta · Post by **ambasta** » Mon Oct 04, 2010 2:20 pm

Hi,
We have a requirement where we are getting a fixed width file of length 60. The source system is sending the data in below format

A-----A-----B---------B---------B---------B---------@@@@@@@@
A-----A-----A-----A-----B---------B---------@@@@@@@@@@@@@@@@
A-----A-----A-----A-----A-----A-----A-----A-----B---------@

Where '-' represents data and '@' represents space...
Althought the above data(Just for represntation) doesn't look fixed width but actaully it is..All records is of length 60 bytes.

A is the identifier which contains 5 bytes of data and similarly B is identifier which contains 9 bytes of data.

Requirement is to break the data and put in two files with data like...
1st file...
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----

2nd File
B---------
B---------
B---------
B---------
B---------
B---------
B---------

Can you pls help me about the approach. I am using DataStage 7.5.3 version.

Thanks in Advance.

ray.wurlod · Post by **ray.wurlod** » Mon Oct 04, 2010 3:33 pm

Read the records as a single string, use TrimB() function to strip the trailing space characters. In a Transformer stage have as many output links as there are fields in the record, and send one field to each. Downstream of that use a Funnel stage to bring them together.

ambasta · Post by **ambasta** » Tue Oct 05, 2010 11:40 pm

Thanks for the suggession... I am not sure how many output link i would require and also the number of recors within the same row is not fixed??? Is there any other way to achieve it??
Thanks for you help!!

vkhandel · Post by **vkhandel** » Wed Oct 06, 2010 12:05 am

Is it confirmed that after first occurrances of "B" there would not be any occurrance of "A"? If that's the case, then you can use the Index() function in transformer to get the position of first occurrance of "B", and then use that to cut the input string into 2 columns - one having all the occurrances of "A" and other having only "B".

Now, you can send these columns separately into 2 datasets to be converted into multiple rows.

To achieve that, we may need to write a job-control (server job) to invoke 2 parallal jobs in loop -

1. The first job will cut the first five characters of dataset storing "A" to a master dataset in "Append" mode, and the remaining string to a temporary dataset in "Overwrite" mode.

2. The second job will simply rename the temporary dataset created in 1st job as the input dataset of 1st job.

The loop should be continued untill the dataset record count is reached to zero.

The same loop must be performed for dataset storing "B".

ray.wurlod · Post by **ray.wurlod** » Wed Oct 06, 2010 12:44 am

ambasta wrote:Thanks for the suggession... I am not sure how many output link i would require and also the number of recors within the same row is not fixed??? Is there any other way to achieve it??
Thanks for you help!!

Not more than 10 "A" records is possible and not more than 6 "B" records. Therefore you need 16 output links.

Constrain each output link on the existence of a value.

ambasta · Post by **ambasta** » Wed Oct 06, 2010 10:58 am

Thanks for replying.
Actaully, it is fixed that there will be no A type records after B type . But the character "B" can come anywhere in data. If "B" comes anywhere within data then use of Index function will fail as it is not actually the start identifier of B type records.

jcthornton · Post by **jcthornton** » Wed Oct 06, 2010 1:41 pm

Still doable, although it complicates things a bit more. Any other missing bits of information or requirements?

My first thought is to build the job as:

Input->Transformer->Funnel->Switch->Datasets

The transformer would have 10 outputs so it can handle any combination of A and B identifiers in a single record.

I would use cascading stage variable to split the string into your identifier (take the first identifier off the front of the string into 1 variable, put the remainder of the string into another variable. Rinse and repeat until you have 10 identifier variables - don't forget to properly handle nulls/spaces)

svID1
svString1
svID2
...
svString9
svID10

Each set of svID(N)/svString(N) is dependent upon svString(N-1). No need for an svString10 either.

Then each of your identifier variables is sent to a different output link with constraints to prevent invalid values (blanks) from being sent.

The switch would split the records by type.

And alternative would be to have multiple outputs for each identifier variable (1 if A and 1 if B) and use 2 funnels and no switch. But I don't think is it as easy to understand.

Others I work with would prefer the alternative because you avoid splitting an input, bringing them back together and splitting again.