reading variable length data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

reading variable length data

Post by varshanswamy »

Hi,

I have a sequential file, the file has varaible length data, with each line
which does not have the number of columns fixed.
for example

1|12345|abcd|abef|1a
2|3456|
3|123456|abcde|abef|1a|12457



I would like to use PX and convert this information as
follows

1,12345
1,abcd
1,abef
1,1a
2,3456
3,123456
3,abcde
3,abef
3,1a
3,12457

thankz in advance

regards,
varsha
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

You can build your custom operator to do it.
lshort
Premium Member
Premium Member
Posts: 139
Joined: Tue Oct 29, 2002 11:40 am
Location: Toronto

Post by lshort »

One way to do this would be:

1. Read each row as a single column.
2. Process each row through a custom routine which
a. use count() to count the '|' to determine the number of fields in the row --vN
b. use field() to grab first columns value -- vFirstValue
c. use field() in a loop 2 to N to gather subsequent field values -- vNextValue
d. use writeseq() to write vFirstValue:",":vNextValue:char(10) to a Sequential File for each loop.

I can think of at least one other wayof doing this by using convert() to change '|' to @TM then parsing...(I like MV fields)

Im sure there are many other ways as well to accomplish this but the one illustrated above is quite simple I think.

"Coding is FUNdamental"
8)
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
alanwms
Charter Member
Charter Member
Posts: 28
Joined: Wed Feb 26, 2003 2:51 pm
Location: Atlanta/UK

Post by alanwms »

Varsha,

Lance is correct. EE expects input rows to be of an identical format-- the same number of columns for each row. Since your input rows have different numbers of columns (assuming you use the '|' as a delimiter), you'll need to use a different delimiter ('',' perhaps) to treat each row as if it had the same number of columns, in this case, just one.

Furthermore, you wish to output more rows than are in the input dataset. There aren't really any Parallel stages that provide this functionality exactly as you have presented it. Therefore, you should write a custom routine as Lance suggests. The char(10) is the line feed character and will effectively create additional output rows for you.

You should probably create a Server job to do this manipulation, as the custom code that Lance has presented is much more readily developed using the traditional Server platform rather than the newer Parallel framework.

Alan
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you can figure out some way of loading everything after field 1 into a variable-length vector, the Split Vector stage will do exactly what you want.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
varshanswamy
Participant
Posts: 48
Joined: Thu Mar 11, 2004 10:32 pm

Post by varshanswamy »

ray.wurlod wrote:If you can figure out some way of loading everything after field 1 into a variable-length vector, the Split Vector stage will do exactly what you want.
I would like to know what is a variable length vector
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

All you want and more can be found in the Parallel Job Developer's Guide manual.
The section on data types in Chapter 2, and the chapters on Split Vector and Make Vector stages will be a good starting point.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
lshort
Premium Member
Premium Member
Posts: 139
Joined: Tue Oct 29, 2002 11:40 am
Location: Toronto

Post by lshort »

Ray,

Dont say it! I know , I have to modify my server oriented thinking.
Im gonna try that Vector thingy. ;-)
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
alanwms
Charter Member
Charter Member
Posts: 28
Joined: Wed Feb 26, 2003 2:51 pm
Location: Atlanta/UK

Post by alanwms »

Lance, Ray,

I looked at those vector stages in the guide before my earlier response, but didn't think they would work since the incoming data had a variable number of fields. I'd be curious what either of you come up with for using the vector stages.

Alan
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Hence my opening "If you can figure out some way of loading everything after field 1 into a variable-length vector". I won't be able to devote any time to it; my current gig is server-only (so I don't have PX to play with).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Lance, a vector is a close analogy to a multi-valued field. A vector of subrecords is a close analogy to an associated set of multi-valued fields. Does that help any?

For COBOL folks, a fixed length vector corresponds roughly to an OCCURS clause, and a variable length vector corresponds roughly to an OCCURS DEPENDING ON clause.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
lshort
Premium Member
Premium Member
Posts: 139
Joined: Tue Oct 29, 2002 11:40 am
Location: Toronto

Post by lshort »

Thanks for the info Ray. I just started a new gig. Just a lot reading this week (zzzzzzzzzzzzz) I'll try to make some time for this. :-)
Lance Short
"infinite diversity in infinite combinations"
***
"The absence of evidence is not evidence of absence."
benny.lbs
Participant
Posts: 125
Joined: Wed Feb 23, 2005 3:46 am

Post by benny.lbs »

I have tried Lance 's suggestion before, it works well. However, the performance is not good while the incoming data become larger. But anyway, it is one of the solution.
lshort wrote:One way to do this would be:

1. Read each row as a single column.
2. Process each row through a custom routine which
a. use count() to count the '|' to determine the number of fields in the row --vN
b. use field() to grab first columns value -- vFirstValue
c. use field() in a loop 2 to N to gather subsequent field values -- vNextValue
d. use writeseq() to write vFirstValue:",":vNextValue:char(10) to a Sequential File for each loop.

I can think of at least one other wayof doing this by using convert() to change '|' to @TM then parsing...(I like MV fields)

Im sure there are many other ways as well to accomplish this but the one illustrated above is quite simple I think.

"Coding is FUNdamental"
8)
Post Reply