How to handle many records in datasets?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
say2prabhu1
Participant
Posts: 27
Joined: Mon Jul 09, 2007 1:06 am

How to handle many records in datasets?

Post by say2prabhu1 »

I am extracting data from two views based on key column values.
The thing is both of the views having more than 100 million records.I am joining the view based on some key columns which is producing some million records.
The source is oracle table(40000 records) and i am pulling the records from the views based on key column.

1)I can't use sparse look up because the part of key colum is used as key.(Look up key length 21,Source table length 10.So i need to use substring in the look-up)
2)If i am using normal look-up,it is taking more time to read the data from view itself(It is taking approximately 30 hours to read the view)
3)And finally i am writing the records into dataset.After some amount of time the job is aborting( error message "ot enough space
Error finalizing / saving table /tmp/dynLUT1183942da1c5888")
4)If i am using some condition in look-up like rownum<1000000,it is working.



I am planning to write the records into some oracle temp tables.Then i will read the records from the temp table for further use instead of reading data from Dataset.Is it good move?


What is your suggestion on this issue?



Note :-
Source :- Oracle table(40000 or more app)
Lookup :- Two oracle view join based on key (Both the view having more than 100 million records)
Look up key :- Need to do substring in the look up
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you insist upon using a Lookup stage, then the entire reference data set must fit in memory. Clearly in your case it doesn't, so allocate a lot more scratchdisk resource.

Search for ways to make the extraction more efficient - do the views REALLY need to expose 100 million records? Can you, for example, extract from the base tables rather than from the views, so that the join can occur at that level and time taken to materialize the views can be saved? This only requires SELECT privilege on the relevant columns on the base tables; you should be able to convince a reasonable DBA that SELECT privilege does not allow you to do any damage. I suspect that a goodly part of your 30 hours is materializing two 100-million-row views. If the DBA won't permit access to the base tables, how about asking for materialized views, so that that part of the extraction at least has already been done? Joins on tables can be supported (sped) through the use of indexes; joins on views can not be.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
say2prabhu1
Participant
Posts: 27
Joined: Mon Jul 09, 2007 1:06 am

Post by say2prabhu1 »

any suggestion??

Please post your comments on this..
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What is this, an interview? You asked for suggestions, I gave you suggestions based on more than 12 years' experience, and you immediately respond with "any suggestions?". I resent that, I really do.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply