Is there a better way? Adding / Combining Rows

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Mon Apr 17, 2006 10:32 am

Ok - to start with, I have a solution for this particular problem, I just want to know if there is a BETTER solution.

I'm getting a daily feed that contains sales amounts that are tagged with time-of-sale in 15 minute increments. A simplified look at a row of data looks something like this:

Simplified Input:

Store Date____ Time___ Amount
0001 20060418 093000 $157.00

The required XML output format needs amounts summed per 15 minute segment and then output in a single row per-store, per-date, with 96 comma separated values containing empty string if no sales were submitted for a 15-minute period, or the sum of that timeslots sales amounts. For example:

Required Output:

Store Date___ Summed Sales Per 15-Minute Increments
0001 2000418 ,,,,,,,,,,$300.00,$456.00,$950.00,,,$340.00,,,,,,,, (etc)

Here's my current solution:

1) Job1 sorts and aggregates data for each 15 minute increment. It outputs these aggregates into a temp file (call it actual_sales file). It also uses a de-duplicator to produce a separate file containing each valid store / and sales date combination.

2) I have a static sequential file containing the 96 possible valid timeslots for each day (from 001500 to 240000). A post-processing script for Job1 takes the file with valid store / sales date combinations and then uses the static increment file to create 96 rows for each sales/date combination (call it possible_sales file).

3) Job2 takes the possible_sales file and performs a left outer join with the actual_sales file. This gives me a 96 records per store/day combo, with nulls for rows that had no sales in that time slot and valid amounts when sales occurred.

4) Post-processing scripts for Job2 then use awk to take the beginning entry for each store/date, strip off time time increment, then concatenate the sales amounts (null or otherwise) from the 95 following records. This is stuffed into a final sequential file.

5) Job3 then takes this output and formats it for XML - which is another story for another day...

I had a very limited amount of time to find a solution, and this was the first one I thought of (inelegant as it was). It works, though I'm a bit dissatisfied with its heavy reliance on single-threaded scripts. I was just wondering if there is a better solution to the two major issues:

1) Creating missing rows
2) Combining elements from rows

I tried various parallel stages and at this point I'm stumped. Any thoughts?

kcbland · Post by **kcbland** » Mon Apr 17, 2006 11:50 am

All PX or can you use a database (Oracle maybe) to faciliate?

emma · Post by **emma** » Mon Apr 17, 2006 11:51 am

Did you try the switch stage?

input_file ---> Filter_Store ---> Switch_Slot--->Funnel_sequence_mode ---> Transformer ---> output_file

or

input_file ---> Switch_Slot--->Funnel_sequence_mode ---> Transformer ---> output_file

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Mon Apr 17, 2006 12:20 pm

Ken - Due to some external constraints here I'm trying to do this in PX / local processing only w/o a database.

Emma - I'll play around with switch this evening and see if it works - thanks for the suggestions.

jasper · Post by **jasper** » Tue Apr 18, 2006 1:55 am

something we did (but with lot less fields):

in a transformer create the record with all the fields, so 96 times logic like : if time=001500 then amount else NULL for field amount0015,if time=003000 then amount else NULL for field amount0030,...

after this do an aggregation where you summarize all these fields (which is actually more to put it all into one record, cause it will summarize one value with a lot of nulls.)

We're doing this with good performance for 12 columns, don't know about 96. But it will look better in the job I feel.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Tue Apr 18, 2006 7:06 pm

It turns out there IS a better way to consolidate rows. After giving it some thought we found a way to use a transformer stage to consolidate the rows.

First of all, the consolidation process requires that each "group" (ie: store / date / increments) be contiguous and in sorted order. To do that I set the transformer to sequential order sorted the input by store, then date, then increment.

I then created several stage variables with the following names and properties:

svNewStoreDate, VarChar(1), default="1"
svCompletedRecord, VarChar(2000), default=""
svBuildingRecord, VarChar(2000), default=""
svWriteFlag, VarChar(3), default="No"
SvStore,VarChar(4),default="0000"
svDate, VarChar(8), default="00000000"

The variables are then derived as follows in the transform. Please note: the order is VERY important, as these must be evaluated in the proper sequence for this to work:

svNewStoreDate: If (incoming.Store <> svStore) or (incoming.Date <> svDate) Then 1 Else 0

svCompletedRecord: If svNewStoreDate then svBuildingRecord Else ''

svBuildingRecord: If svNewStoreDate then incoming.Store:"|":incoming.Date:"|": incoming.IncrementSalesAmount Else svBuildingRecord:",":incoming.IncrementSalesAmount

svWriteFlag: If svNewStoreDate and (svCompletedRecord <> '') then 'Yes' Else 'No'

svStore: incoming.Store

svDate: incoming.Date

The output link for the transformer has the following constraint in it:
svWriteFlag="Yes"

and it outputs svCompletedRecord to an outgoing VarChar(2000) column

In essence - this transformer notes when either the store or date changes. When they DO change - it realizes the record it has been building is completed, and should be written out (unless it is empty because it is the first record), so it turns the WriteFlag on to allow the constraint to output the record. If the store/date match what it saw on the previous record, it grabs the field it needs and appends it to the record it is building.

The only thing to note is that a dummy record (I used 9999|99999999|999999) must be appended to the very end of the input file to "kick out" the very last record being constructed.

It operates several orders of magnitude faster than the awk script I was using (even in "single stream" mode). Processing time for a few thousand records dropped from 20 minutes (ugh!) to less than 15 seconds.

Hope this comes in handy!

jasper · Post by **jasper** » Wed Apr 19, 2006 12:19 am

if you place an order stage part of this logic can be done in there. I'm not sure about the exact name but there is an option for a key-change column. This will be set to 1 if the previous record had an other key then this one.

DSXchange