Page 1 of 1

Implementation problem

Posted: Wed Aug 18, 2010 8:47 pm
by piyu
I have an implementation/optimization problem.

a. There is a set of weekending dates which comes in a file.
b. There is a set of account ids which comes from a second stream (driver)
c. There is a history table which I have to query :(over 650million records)
1. To get a subset of account ids from table which have the weekending dates given in the file mentioned above.
2. Do a self join to the extracted accounts to find accounts related by a particular field.

Till now, this was being done manually in a single query : creating a temp table of the accounts mentioned in (b) above. Joining this table to the history table(c) and the self join of the history to itself and the date range(a) being hard-coded in the query.

When I try to implement this is Datastage, the history table reads in all the records which is where the problem is. The history table has over 650 million records and it takes several hours till it can first get in all the records.

Is there a way that I can avoid creating temp tables and avoid doing a sql join and have it implemented in Datastage?

Posted: Wed Aug 18, 2010 11:24 pm
by ray.wurlod
You can restrict the large table SELECT to those rows that fall between the largest and smallest dates in your driver data. Then you should be able to perform a range lookup.

Posted: Thu Aug 19, 2010 2:39 am
by Sainath.Srinivasan
When you say 'accounts', does it imply 'customer accounts created' or 'account transactions' ?

Assuming you have partitioned the data by the date, it will be easy to access by the date value. Also assuming that accountId is indexed, joining by this column will be faster. So 650m rows must not be a big problem.

You can try populating a temporary table (as you already had), using the join in db level than in DataStage. This table can be the source for your downstream process.

Posted: Fri Aug 20, 2010 3:11 am
by piyu
Its an account profile weekly history. The incoming count is barely 6000 and the weekly history has got 750 million records falling in that window (thats assuming i am hardcoding the dates).

I think a sparse lookup for every account will be better or an in-sql join having the accounts in a temp table. :?

Posted: Fri Aug 20, 2010 3:21 am
by Sainath.Srinivasan
For such small incoming volume of data, sparse lookup is the best.

But why don't you do the join in the db itself rather than doing in DataStage ?

Posted: Fri Aug 20, 2010 3:40 am
by piyu
I will need to get the incoming account data into a load/replace table (getting a table created is very difficult in this establisment! :cry: ) to be able to do the selfjoin+incoming account data join. Anyways, this was a migration project... Will probably go with the sparse lookup or SQL join now :) Thanks.