Implementation problem

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
piyu
Participant
Posts: 52
Joined: Thu Jul 21, 2005 12:22 am
Location: Singapore

Implementation problem

Post by piyu »

I have an implementation/optimization problem.

a. There is a set of weekending dates which comes in a file.
b. There is a set of account ids which comes from a second stream (driver)
c. There is a history table which I have to query :(over 650million records)
1. To get a subset of account ids from table which have the weekending dates given in the file mentioned above.
2. Do a self join to the extracted accounts to find accounts related by a particular field.

Till now, this was being done manually in a single query : creating a temp table of the accounts mentioned in (b) above. Joining this table to the history table(c) and the self join of the history to itself and the date range(a) being hard-coded in the query.

When I try to implement this is Datastage, the history table reads in all the records which is where the problem is. The history table has over 650 million records and it takes several hours till it can first get in all the records.

Is there a way that I can avoid creating temp tables and avoid doing a sql join and have it implemented in Datastage?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can restrict the large table SELECT to those rows that fall between the largest and smallest dates in your driver data. Then you should be able to perform a range lookup.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

When you say 'accounts', does it imply 'customer accounts created' or 'account transactions' ?

Assuming you have partitioned the data by the date, it will be easy to access by the date value. Also assuming that accountId is indexed, joining by this column will be faster. So 650m rows must not be a big problem.

You can try populating a temporary table (as you already had), using the join in db level than in DataStage. This table can be the source for your downstream process.
piyu
Participant
Posts: 52
Joined: Thu Jul 21, 2005 12:22 am
Location: Singapore

Post by piyu »

Its an account profile weekly history. The incoming count is barely 6000 and the weekly history has got 750 million records falling in that window (thats assuming i am hardcoding the dates).

I think a sparse lookup for every account will be better or an in-sql join having the accounts in a temp table. :?
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

For such small incoming volume of data, sparse lookup is the best.

But why don't you do the join in the db itself rather than doing in DataStage ?
piyu
Participant
Posts: 52
Joined: Thu Jul 21, 2005 12:22 am
Location: Singapore

Post by piyu »

I will need to get the incoming account data into a load/replace table (getting a table created is very difficult in this establisment! :cry: ) to be able to do the selfjoin+incoming account data join. Anyways, this was a migration project... Will probably go with the sparse lookup or SQL join now :) Thanks.
Post Reply