calculating values based on previous row values

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
tsanan
Premium Member
Premium Member
Posts: 8
Joined: Thu Mar 10, 2005 8:45 am

calculating values based on previous row values

Post by tsanan »

Hello All

I am having a problem calculating values based on previous row values while Hash partitioning. I would appreciate any advise.

I have account-wise daily data and am trying to create monthly aggregates:

acctid, calid, val1, val2,val3,closing - this is the layout of my data

there are roughly about 2000 different accounts and 1500 cal ids for each account.

For each account, closing of day1 becomes opening of day2 and so on. I am using stage variables for getting the previous day closing as current day's opening
and I have calculations based on closing, opening and val1,val2, val3 etc.

The job has
an enterprise stage for sourcing the data,
a lookup stage to lookup the calid,
transformer stage where the above calculations are done
and a remove duplicates stage to output only the last day of month's values.

The job works fine and I get the right results when I run all the stages in sequential mode. But when I partition the transformer in Hash partition and sort on the acctid and calid (I also tried sorting on acctid, calid in the SQL statement and Hash partitioning just on acctid as well) I do not get the right previous days values. I am thinking I need all rows of one acctid to be processed on one node and in the increasing order of calid. :oops:

Any inputs as to where I am wrong and how to rectify it, would be greatly appreciated.

Thanks
tsanan
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What's the key of the reference table for the lookup? What partitioning algorithm are you using for the reference input link?

I suspect you may be missing some calid lookups if you're not hash partitioning on calid on both inputs to the Lookup stage (or Entire partitioning on calid on the reference input).

Given that you logic works in sequential mode, I'd guess there's nothing wrong with your logic.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tsanan
Premium Member
Premium Member
Posts: 8
Joined: Thu Mar 10, 2005 8:45 am

Post by tsanan »

Thanks for your response Ray.
ray.wurlod wrote:What's the key of the reference table for the lookup? What partitioning algorithm are you using for the reference input link.
The key to the lookup is calendarid which is a surrogate id and am looking up for the calendar year and month.
ray.wurlod wrote:I suspect you may be missing some calid lookups if you're not hash partitioning on calid on both inputs to the Lookup stage (or Entire partitioning on calid on the reference input)..
I have been testing it with 158 rows and see the same number 158 on the reference link as well.


ray.wurlod wrote:Given that you logic works in sequential mode, I'd guess there's nothing wrong with your logic.

Yes, Ray it works in sequential mode and it also works on single node. When I run in parallel I am not getting the right previous values.
I was wondering if the O/S has anything to do with it.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How are the data partitioned on both inputs to the Lookup stage?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
tsanan
Premium Member
Premium Member
Posts: 8
Joined: Thu Mar 10, 2005 8:45 am

Post by tsanan »

ray.wurlod wrote:How are the data partitioned on both inputs to the Lookup stage?
I had been trying different ways:

1. Used Order by actid,calid clause in the enterprise stage which is a stream input to the lookup stage. The reference lookup being Parallel execution mode and Preserve partition set to Propogate.
The lookup stage - Parallel, HAsh Partition (actid,calid) and Propogate.

2. Also tried without the order by clause and a sort on the the Lookup stage.

I am having the logic for obtaining the previos values and the calculations on the transformer stage which is the next one. I am having a hash partition and sort on actid (tried with actid and calid aswell).

I am a little confused as to what HAsh Partitioning I should use - Actid alone or Actid,Calid to be able to process all rows pertaining one actid on one node at the same time having the rows in the calid order.

Appreciate all the help I get from dsxchange - though I have been lurking and benefitting from the searches I often do, I never took the time to come out to even thank all of you.

Thanks
tsanan
tsanan
Premium Member
Premium Member
Posts: 8
Joined: Thu Mar 10, 2005 8:45 am

Re: calculating values based on previous row values

Post by tsanan »

tsanan wrote: The job works fine and I get the right results when I run all the stages in sequential mode. But when I partition the transformer in Hash partition and sort on the acctid and calid (I also tried sorting on acctid, calid in the SQL statement and Hash partitioning just on acctid as well) I do not get the right previous days values. I am thinking I need all rows of one acctid to be processed on one node and in the increasing order of calid. :oops:

Any inputs as to where I am wrong and how to rectify it, would be greatly appreciated.

I was able to resolve this. What I did was:
Use Order by and not sort in any of the subsequent stages. I just did a Hash partition by acctid and now get right values.

Thanks for your inputs, Ray.

Thanks
tsanan
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

You can mark your post as "Resolved" then. :)
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply