Best practice for handling a large variable list

sbass1 · Post by **sbass1** » Tue Mar 17, 2009 1:37 am

Say I have a file or table:

Column1
Column2
Column3
...
Column100

with no real structure to the column names (Column1 etc is just for illustration)

I want to concatenate the values of all these columns together separated by @VM.

Right now I:

* create stage variables V1 - V100 (real names, keep them as short as possible
* drag my columns of interest to the stage variables to map them
* create stage variable DLM = @VM (makes changing the delimiter easier)
* stage variable CONCAT = V1:DLM:V2:DLM:V3:DLM:...:V100

Then do whatever I want to do with CONCAT.

I was hoping there would be a better way, as this code is a royal PITA!@# (Ray, you want me to spell it out?

) to build and maintain.

I was hoping arrays might help.

As a frame of reference, this is how I would do it in SAS:

CONCAT = catx(@VM,of V: ) * all vars starting with V
CONCAT = catx(@VM,of V1-V100) * all vars named V#, in numeric naming sequence
CONCAT = catx(@VM,of V1--V100) * all vars between V1 and V100 in the program data vector
CONCAT = catx(@VM,of Column1--Column100) * skip copying the columns to stage variables altogether.

"of" is a keyword to the compiler saying what follows is a variable list.

Internally, all the DS compiler would have to do is support variable list source code tokenization, and build the variable list internally for me. I'm sure the compiled code is no different than me specifying:

CONCAT = catx(@VM,V1,V2,V3,...,V100)

but the compiler does the work for me

As an aside, what is the best way to get enhancement requests to IBM?

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Tue Mar 17, 2009 3:29 am

If the source is a seq file, describe the full record (all columns) as a single column with different field delimiter.

In the transformation, replace all delimiters with @VM.

sbass1 · Post by **sbass1** » Tue Mar 17, 2009 4:54 am

Thanks for the reply.

I should have explained the problem better.

My sequential file is something like:

Key1 Key2 Var1 Var2 Key3 Var3 Var4 Var5 ... VarN VarX

I want to create the output:

Key1 Key2 Key3 Var1:@VM:Var2:@VM:Var3:@VM:...@VM:VarN

In other words, I want to copy the key variables, then concatenate a subset of the other variables delimited by @VM. This is to create a CRC of "variables of interest".

I'd also like the solution to support DRS stages as a source as well.

Thanks,
Scott

chulett · Post by **chulett** » Tue Mar 17, 2009 8:31 am

Seems like this is why you've asked about the RowMerger stage.

One thought... perhaps array notation would be more managable and would get you the Value Marks automatically. Use ArrayName<-1> to push an element onto the end of the array in a loop. And if you end up needing a different delimiter, then Convert() can easily change them all to something else in one swell foop.

sbass1 · Post by **sbass1** » Tue Mar 17, 2009 9:10 am

chulett wrote:Seems like this is why you've asked about the RowMerger stage.

One thought... perhaps array notation would be more managable and would get you the Value Marks automatically. Use ArrayName<-1> to push an element onto the end of the array in a loop. And if you end up needing a different delimiter, then Convert() can easily change them all to something else in one swell foop.

Yes that's correct. I feel like my posts on this subject are "all over the place", but I've tried to keep each issue as a separate post with a meaningful subject. Not sure if that's good or bad. Probably good for future search hits, but bad for "seeing" the entire current problem in context.

A summary of various issues I've run into:

* NULLS in the seq file input are problematic when concatenating the columns to build the CRC. Concatenating NULL onto anything = NULL.

* "Bulk converting" the NULLS to empty string by fiddling with the seq file format (Map to empty string, "NULL" for alternative pad character) fixes this issue, but is problematic when writing the target table. For example, writing empty string to a datetime column in SQL Server results in 1900-01-01 for the date. Basically, if the input data was NULL, I want the target table to contain NULL (and barf if NOT NULL is specified so the input data is fixed).

* The long list of variables to concatenate, and the lack of variable list support in DS, makes the code generation cumbersome.

* I can't get the RowMerger stage to concatenate only a subset of the output columns. What I want is all of the input fields (needed to update the target table), properly delimited (split out into individual fields, with NULLS as appropriate), plus a merged column of selected columns to build the CRC value (used to determine whether to close out a historical record).

* The Merge stage doesn't accept input streams, so I'm having a problem using it in the same job to merge two seq files together. I get a code execution timing problem. I'm unwilling to run the merge as a separate job.

My current approach is to use two seq file stages with the same input source filee. One maps missing columns to SQL NULL, the other maps missing columns to empty string. The second, empty string stage writes out the key fields, maps the desired "variables of interest" to temporary stage variables, concatenates them together, and calculates the CRC. This is then written to a hashed file of keys + CRC. This hashed file is then joined with the first seq file to join the CRC to the entire input stream. This is also joined with a similar hashed file synced with the target table. This hashed file is keys + CRC for the current SCD2 record. CRCNew <> CRCOld is the trigger to close out the old history record and write out the new one.

* If my input source is changed to a DRS table, NULLS again become problematic. The approach I'm using is to write the DRS (in this case SQL Server) table to a seq file, then follow the approach above (fiddle with the seq file format to map NULLS to empty string). I'm concerned whether this additional I/O will affect performance for a large table. I'm investigating whether SQL Server has an option to "bulk convert" NULLS to empty string, preferably working with SELECT *, so I can return NullToEmptyString directly from the database query.

Thanks...

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Tue Mar 17, 2009 9:18 am

What happens if you write the data in order of keyfields first followed by data fields?

Assuming you have k1, k2, v1, v2, k3, v3, v4......vn - write them to a sequential file as k1,k2,k3,v1,v2....vn.

Then read just defining key and one additional column. Then convert the additional column's delimiter into VM.

Or do a row merge and take the values as
k1 = field(mergedValue, ',', 1)
k2 = field(mergedValue, ',', 2)
............................................
km = field(mergedValue, ',', m)

restOfData = mergedValue[index(mergedValue, ',', m) + 1, 99999]