How to Handle Overlapping Priorities in Data Values

myukassign · Post by **myukassign** » Mon May 24, 2010 6:32 am

Hi ,

I have a output to achieve for which my experience in datastage is not helping much. I request if anyone can help me.

My Input
-----------

Code: Select all

FILE 1

PID        ORG    DC             Stdate           EndDate            Price

P1         10       1           1-MAR-10          30-OCT-10          100


FILE 2

PID        ORG    DC             Stdate           EndDate            Price

P1         10       1           22-MAR-10          1-OCT-10          98

MY OUTPUT SHOULD BE
-----------------------------

Code: Select all

PID         ORG    DC             Stdate           EndDate            Price

P1         10       1           1-MAR-10          21-MAR-10          100
P1         10       1           22-MAR-10         01-OCT-10          98
P1         10       1           2-OCT-10          30-OCT-10          100

Please suggest logic to do that.

Rules here are.

The records in the second file should always get priority. The first file start and end date and price should adjust according to a record in the second file. Please note that in my output, the dates/period of price is closing before the start date in second file

The keys of the table / file are PID, ORG & DC

Any ideas to do this .....

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Mon May 24, 2010 8:55 am

Interesting problem. Four questions:

1) I assume it is a many to many relationship? There can be multiple records in both File 1 and File 2 with the same keys?

2) What do you do if a price in File 1 is completely overlaid (start to finish) by a higher priority price in file 2? Eliminate the record?

3) Do any of the start / end dates in File 2 overlap for identical keys? (I assume not.)

4) Do you have complete coverage of all dates or would their possibly be gaps with no prices (for example holidays)?

myukassign · Post by **myukassign** » Mon May 24, 2010 8:53 pm

asorrell wrote:Interesting problem. Four questions:

1) I assume it is a many to many relationship? There can be multiple records in both File 1 and File 2 with the same keys?

2) What do you do if a price in File 1 is completely overlaid (start to finish) by a higher priority price in file 2? Eliminate the record?

3) Do any of the start / end dates in File 2 overlap for identical keys? (I assume not.)

4) Do you have complete coverage of all dates or would their possibly be gaps with no prices (for example holidays)?

1. Yes...

2. yes...

3. possible

4. No there wont be gaps......

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Tue May 25, 2010 9:45 am

Because of the "many to many" relationships and the multiple overlays that are possible, I'm thinking the best way to do this might be to build a lookup table from the data, and then use that to re-create your date ranges. This is a complex operation because some of it really would be easier to do in a programming language, but I think it can be done.. At a very high level, here's an overview of the steps...

1) make one pass through file one - storing prices in a lookup table via:
PID, Org, DC, "inclusive dates" - which means if you have an entry from May 1 through May 5, there would be five entries in the table (one for each day).

2) Make a pass through file two, doing the same - which means that in many cases you are overlaying entries for file one.

3) Then use a row generator to generate rows for each valid PID, ORG, DC, date combination and lookup the price.

4) Feed the output to a transformer that uses save variables to store the current PID, Org, DC, StartDate and Price. It should also have a flag that indicates a "price change" (ie - current incoming price and stored price don't match). A constraint on the output link should only output a "record" when the price changes. When the price changes, output the saved values with the previous day as the "end date". Then store the new incoming values and keep looking for the next price change. The constraint should also not output any date ranges where the price is "null" - just in case there are gaps.

You'll also have to have a "trailer" of some sort to close out the last range.

Even with this overview, this will be difficult to do and will require an extensive knowledge of DataStage. Not an easy task...

Also - because you have marked this as a "parallel" job - I'm going to move this to the parallel forum in a couple of days.

Sainath.Srinivasan · Post by **Sainath.Srinivasan** » Wed May 26, 2010 2:09 am

Probably a simpler solution could be to

1.) combine both files into a single one
2.) ordered by key, fromDate and toDate
3.) append an extra dummy (end-of-file) record (if parallel, then 1 for each node)
4.) Using stage variables, compare current record with previous record and write out previous record when there is a change in the key.
By this way, you can ensure correct start and end dates are provided

Give it a shot and respond if you have any issues.

myukassign · Post by **myukassign** » Wed May 26, 2010 3:20 am

Sainath.Srinivasan wrote:Probably a simpler solution could be to

1.) combine both files into a single one
2.) ordered by key, fromDate and toDate
3.) append an extra dummy (end-of-file) record (if parallel, then 1 for each node)
4.) Using stage variables, compare current record with previous record and write out previous record when there is a change in the key.
By this way, you can ensure correct start and end dates are provided

Give it a shot and respond if you have any issues.

Can you please elaborate step 3. I did not understand.

chulett · Post by **chulett** » Wed May 26, 2010 6:15 am

You need something to drive out the last group change and a known "EOD" record will allow that. This capability will be built natively into "a future" (the next?) release.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Wed May 26, 2010 9:11 am

The solution posted by Sainath.Srinivasan is not as simple as he proposed. It would require VERY careful tracking of both File 1 and File 2 current prices in order to determine the correct output in overlapping dates.

For example:

File 1 has prices of: May 1 - May 10, $100 and May 11 - May 20, $50
File 2 has price of May 5 - May 15, $75

Should result in required output of
May 1 - May 4 $100
May 5 - May 15, $75
May 16 - May 20, $50

Also (2nd example)

File 1 has price of June 1 - June 20, $100
File 2 has prices of June 5 - June 10, $75 and June 15 - June 18, $50

Should result in required output of

June 1 - June 4, $100
June 5 - June 10, $75
June 11 - June 14, $100
June 15 - June 18, $50
June 19 - June 20, $100

Doing that using just saved variables might be do-able, but would be fairly difficult, requiring numerous "state" variables to keep track of active and inactive prices. I was trying to post something that might be simpler to implement, but in truth, there probably isn't an "easy" solution to this, it will be tough either way.

Note - I still plan on moving this thread to parallel forum - and will probably change the title to "Handling Overlapping Priorities in Data Values". The current title about "Extrapolation" isn't relevant, there is no extrapolation going on here - everything is spelled out.

myukassign · Post by **myukassign** » Mon May 31, 2010 5:03 am

I was doing Unix shell scripting to achive the same. It's in the final stage. So I will use an exec command stge n sequncer to call my Unix script wo do the same for me.

It was really helpful. Thanks a lot for everyone who supported me.

I am marking this post as 'Resolved with workaround"

Thanks.

swades · Post by **swades** » Tue Jun 01, 2010 2:31 pm

I have DS job design ready for this issue. Where I can upload the Job DSX