reading variable length data

varshanswamy · Post by **varshanswamy** » Tue May 31, 2005 10:57 am

Hi,

I have a sequential file, the file has varaible length data, with each line
which does not have the number of columns fixed.
for example

1|12345|abcd|abef|1a
2|3456|
3|123456|abcde|abef|1a|12457

I would like to use PX and convert this information as
follows

1,12345
1,abcd
1,abef
1,1a
2,3456
3,123456
3,abcde
3,abef
3,1a
3,12457

thankz in advance

regards,
varsha

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Tue May 31, 2005 11:16 am

You can build your custom operator to do it.

lshort · Post by **lshort** » Tue May 31, 2005 12:05 pm

One way to do this would be:

1. Read each row as a single column.
2. Process each row through a custom routine which
a. use count() to count the '|' to determine the number of fields in the row --vN
b. use field() to grab first columns value -- vFirstValue
c. use field() in a loop 2 to N to gather subsequent field values -- vNextValue
d. use writeseq() to write vFirstValue:",":vNextValue:char(10) to a Sequential File for each loop.

I can think of at least one other wayof doing this by using convert() to change '|' to @TM then parsing...(I like MV fields)

Im sure there are many other ways as well to accomplish this but the one illustrated above is quite simple I think.

"Coding is FUNdamental"

alanwms · Post by **alanwms** » Tue May 31, 2005 1:38 pm

Varsha,

Lance is correct. EE expects input rows to be of an identical format-- the same number of columns for each row. Since your input rows have different numbers of columns (assuming you use the '|' as a delimiter), you'll need to use a different delimiter ('',' perhaps) to treat each row as if it had the same number of columns, in this case, just one.

Furthermore, you wish to output more rows than are in the input dataset. There aren't really any Parallel stages that provide this functionality exactly as you have presented it. Therefore, you should write a custom routine as Lance suggests. The char(10) is the line feed character and will effectively create additional output rows for you.

You should probably create a Server job to do this manipulation, as the custom code that Lance has presented is much more readily developed using the traditional Server platform rather than the newer Parallel framework.

Alan

ray.wurlod · Post by **ray.wurlod** » Tue May 31, 2005 5:10 pm

If you can figure out some way of loading everything after field 1 into a variable-length vector, the Split Vector stage will do exactly what you want.

varshanswamy · Post by **varshanswamy** » Tue May 31, 2005 6:45 pm

ray.wurlod wrote:If you can figure out some way of loading everything after field 1 into a variable-length vector, the Split Vector stage will do exactly what you want.

I would like to know what is a variable length vector

ray.wurlod · Post by **ray.wurlod** » Tue May 31, 2005 7:25 pm

All you want and more can be found in the Parallel Job Developer's Guide manual.
The section on data types in Chapter 2, and the chapters on Split Vector and Make Vector stages will be a good starting point.

lshort · Post by **lshort** » Wed Jun 01, 2005 8:32 am

Ray,

Dont say it! I know , I have to modify my server oriented thinking.
Im gonna try that Vector thingy.

alanwms · Post by **alanwms** » Wed Jun 01, 2005 9:48 am

Lance, Ray,

I looked at those vector stages in the guide before my earlier response, but didn't think they would work since the incoming data had a variable number of fields. I'd be curious what either of you come up with for using the vector stages.

Alan

ray.wurlod · Post by **ray.wurlod** » Wed Jun 01, 2005 4:29 pm

Hence my opening "If you can figure out some way of loading everything after field 1 into a variable-length vector". I won't be able to devote any time to it; my current gig is server-only (so I don't have PX to play with).

ray.wurlod · Post by **ray.wurlod** » Wed Jun 01, 2005 4:32 pm

Lance, a vector is a close analogy to a multi-valued field. A vector of subrecords is a close analogy to an associated set of multi-valued fields. Does that help any?

For COBOL folks, a fixed length vector corresponds roughly to an OCCURS clause, and a variable length vector corresponds roughly to an OCCURS DEPENDING ON clause.

lshort · Post by **lshort** » Thu Jun 02, 2005 7:03 am

Thanks for the info Ray. I just started a new gig. Just a lot reading this week (zzzzzzzzzzzzz) I'll try to make some time for this.

benny.lbs · Post by **benny.lbs** » Thu Jun 02, 2005 8:34 am

I have tried Lance 's suggestion before, it works well. However, the performance is not good while the incoming data become larger. But anyway, it is one of the solution.

lshort wrote:One way to do this would be:

1. Read each row as a single column.
2. Process each row through a custom routine which
a. use count() to count the '|' to determine the number of fields in the row --vN
b. use field() to grab first columns value -- vFirstValue
c. use field() in a loop 2 to N to gather subsequent field values -- vNextValue
d. use writeseq() to write vFirstValue:",":vNextValue:char(10) to a Sequential File for each loop.

I can think of at least one other wayof doing this by using convert() to change '|' to @TM then parsing...(I like MV fields)

Im sure there are many other ways as well to accomplish this but the one illustrated above is quite simple I think.

"Coding is FUNdamental"