Split one row in multiple based on identifier

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ambasta
Participant
Posts: 93
Joined: Thu Jan 19, 2006 10:29 pm
Location: Bangalore

Split one row in multiple based on identifier

Post by ambasta »

Hi,
We have a requirement where we are getting a fixed width file of length 60. The source system is sending the data in below format

A-----A-----B---------B---------B---------B---------@@@@@@@@
A-----A-----A-----A-----B---------B---------@@@@@@@@@@@@@@@@
A-----A-----A-----A-----A-----A-----A-----A-----B---------@

Where '-' represents data and '@' represents space...
Althought the above data(Just for represntation) doesn't look fixed width but actaully it is..All records is of length 60 bytes.

A is the identifier which contains 5 bytes of data and similarly B is identifier which contains 9 bytes of data.

Requirement is to break the data and put in two files with data like...
1st file...
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----
A-----

2nd File
B---------
B---------
B---------
B---------
B---------
B---------
B---------

Can you pls help me about the approach. I am using DataStage 7.5.3 version.

Thanks in Advance.
ambasta
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Read the records as a single string, use TrimB() function to strip the trailing space characters. In a Transformer stage have as many output links as there are fields in the record, and send one field to each. Downstream of that use a Funnel stage to bring them together.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ambasta
Participant
Posts: 93
Joined: Thu Jan 19, 2006 10:29 pm
Location: Bangalore

Post by ambasta »

Thanks for the suggession... I am not sure how many output link i would require and also the number of recors within the same row is not fixed??? Is there any other way to achieve it??
Thanks for you help!!
ambasta
vkhandel
Participant
Posts: 35
Joined: Wed Oct 04, 2006 12:12 am
Location: Pune

Post by vkhandel »

Is it confirmed that after first occurrances of "B" there would not be any occurrance of "A"? If that's the case, then you can use the Index() function in transformer to get the position of first occurrance of "B", and then use that to cut the input string into 2 columns - one having all the occurrances of "A" and other having only "B".

Now, you can send these columns separately into 2 datasets to be converted into multiple rows.

To achieve that, we may need to write a job-control (server job) to invoke 2 parallal jobs in loop -

1. The first job will cut the first five characters of dataset storing "A" to a master dataset in "Append" mode, and the remaining string to a temporary dataset in "Overwrite" mode.

2. The second job will simply rename the temporary dataset created in 1st job as the input dataset of 1st job.

The loop should be continued untill the dataset record count is reached to zero.

The same loop must be performed for dataset storing "B".
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

ambasta wrote:Thanks for the suggession... I am not sure how many output link i would require and also the number of recors within the same row is not fixed??? Is there any other way to achieve it??
Thanks for you help!!
Not more than 10 "A" records is possible and not more than 6 "B" records. Therefore you need 16 output links.

Constrain each output link on the existence of a value.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ambasta
Participant
Posts: 93
Joined: Thu Jan 19, 2006 10:29 pm
Location: Bangalore

Post by ambasta »

Thanks for replying.
Actaully, it is fixed that there will be no A type records after B type . But the character "B" can come anywhere in data. If "B" comes anywhere within data then use of Index function will fail as it is not actually the start identifier of B type records.
ambasta
jcthornton
Premium Member
Premium Member
Posts: 79
Joined: Thu Mar 22, 2007 4:58 pm
Location: USA

Post by jcthornton »

Still doable, although it complicates things a bit more. Any other missing bits of information or requirements?

My first thought is to build the job as:

Input->Transformer->Funnel->Switch->Datasets

The transformer would have 10 outputs so it can handle any combination of A and B identifiers in a single record.

I would use cascading stage variable to split the string into your identifier (take the first identifier off the front of the string into 1 variable, put the remainder of the string into another variable. Rinse and repeat until you have 10 identifier variables - don't forget to properly handle nulls/spaces)

svID1
svString1
svID2
...
svString9
svID10

Each set of svID(N)/svString(N) is dependent upon svString(N-1). No need for an svString10 either.

Then each of your identifier variables is sent to a different output link with constraints to prevent invalid values (blanks) from being sent.

The switch would split the records by type.


And alternative would be to have multiple outputs for each identifier variable (1 if A and 1 if B) and use 2 funnels and no switch. But I don't think is it as easy to understand.

Others I work with would prefer the alternative because you avoid splitting an input, bringing them back together and splitting again.
Jack Thornton
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
Post Reply