Loading Huge Data

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
pradkumar
Charter Member
Charter Member
Posts: 393
Joined: Wed Oct 18, 2006 1:09 pm

Loading Huge Data

Post by pradkumar »

Hi

I am getting around 40-50 million records from an oracle table. I need to do some join and simple transformations and load the data into another table.

ODBC-->Transformer-->ODBC
I went through the forum which discused the use of IPC, Multiple instances etc.
I am thinking of creating multiple instances of my job.
I know to create multiple instances in DS. But the concept of how to partition is something new to me.
Could anyone explain me in little bit more detail, how to accomplish this.

Thanks
Pradeep Kumar
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yuck... that will take for-ev-a. Look into making it a two step process... first extract and transform, landing it to a flat file. This job can be 'multi-instanced' rather easily, using mod in Oracle and a 'partition number' as the Invocation ID. Make sure the output files have a unique name as well by including the Invocation ID as part of the filename.

Then suggest you bulk load the data if at all possible. That could get your load portion of this little passion play down to minutes rather than many, many (many) hours.
-craig

"You can never have too many knives" -- Logan Nine Fingers
vinodhraj
Participant
Posts: 53
Joined: Mon Sep 12, 2005 6:51 am

Post by vinodhraj »

Hi,

seperate the table based on keys

for eg: if the table consists of 10 county data, split it into 10 based on country, giving country as a job parameter.

Give the county code as multiple instance, shedule it in such a way so that it runs parallelly.

Vinod
saikir
Participant
Posts: 92
Joined: Wed Nov 08, 2006 12:25 am
Location: Minneapolis
Contact:

Post by saikir »

Hi,

You can also try using the link partitioner and link collector stage. The link collector will split the load into links and the other end you can collect all the links using link collector.
Sreedhar
Participant
Posts: 187
Joined: Mon Oct 30, 2006 12:16 am

Post by Sreedhar »

While extracting the data from the source give an GROUP by clause so that the data coming out of the ODBC is in sorted order, Do Not partition the data at any further stages, do your normal transformation and load.

This should improve the throughput.


Regards,
Regards,
Shree
785-816-0728
aakashahuja
Premium Member
Premium Member
Posts: 210
Joined: Wed Feb 16, 2005 7:17 am

Post by aakashahuja »

The job type heading of ur post says Server, so if u r using 7.5.2, then u might consider redeveloping ur job in PX
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Most people that post here as 'Server' peoples do so because their company has not licensed the PX product.
-craig

"You can never have too many knives" -- Logan Nine Fingers
pradkumar
Charter Member
Charter Member
Posts: 393
Joined: Wed Oct 18, 2006 1:09 pm

Post by pradkumar »

I am thinking to create a job sequence by allowing multiple instances of the job. I decided to create 4 instances of the original job.

I am thinking to partition the input data based on a period (as ia m loading only 2005 data). Periods are different. So this should be given as a parameter and in where clause. If so, could you give some guidance on how to do it
Pradeep Kumar
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You have it exactly.

Use the job parameter references to replace constants in the WHERE clause of the extraction SQL. For example:

Code: Select all

WHERE sel_date >= '#StartDate#' AND sel_date < '#EndDate#'
You can do this on the Selection tab - you do not need to resort to user-defined SQL.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pradkumar
Charter Member
Charter Member
Posts: 393
Joined: Wed Oct 18, 2006 1:09 pm

Post by pradkumar »

Thx for the reply.

So we need to give different ranges of satrt date and end date in each job sequence.

I need to do partition first on year and then on period (1-12 for 12 months). My Where clause in main job is

"WHERE tablLE_NAME.PERIOD_NAME = '#PERIOD#'.

PERIOD_Name is a column in my source table. has values like(Jan-05,Feb-05.........,Jan-04......) Each year is having around 5 million records. So first we decided to load data year wise and use 4 instances in each year. Each instance will load three periods of data.
So now this PERIOD is appearing in each and every joba ctivity in sequence. Now I need to pass the values to these 4 different joba ctivities for PERIOD. Liek First Joba ctivity should get Jan-05,Feb-05,Mar-05 and second one hsould get Apr,May,Jun and so on..


How these values can be passed from sequencer to main job. I am not sure how to specify it in sequencer.

How to accomplish this
Pradeep Kumar
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If you generate YEAR upstream in the job sequence (maybe as a sequence parameter) you can use a Routine activity to generate the four date ranges, perhaps pass them back as delimited strings, and unpack them. For example the routine could return a dynamic array containing four fields each with three PERIOD values. You would then load three job parameters with these and specify then in a WHERE clause of the form

Code: Select all

WHERE tablename.PERIOD_NAME IN ('#jpPeriod1#', '#jpPeriod2#', '#jpPeriod3')
Example Routine:

Code: Select all

FUNCTION Periods(Year)
* Returns four fields containing three period names each
Year2 = Right(Year,2) ; * rightmost two digits of year
Ans = ""
Ans<1> = "Jan-" : Year2 : @VM : "Feb-" : Year2 : @VM : "Mar-" : Year2
Ans<2> = "Apr-" : Year2 : @VM : "May-" : Year2 : @VM : "Jun-" : Year2
Ans<3> = "Jul-" : Year2 : @VM : "Aug-" : Year2 : @VM : "Sep-" : Year2
Ans<4> = "Oct-" : Year2 : @VM : "Nov-" : Year2 : @VM : "Dec-" : Year2
RETURN(Ans)
In instance number 4, which handles the fourth quarter, job parameter values are derived as

Code: Select all

jpPeriod1     RoutineActivityName.$ReturnValue<4,1>
jpPeriod2     RoutineActivityName.$ReturnValue<4,2>
jpPeriod3     RoutineActivityName.$ReturnValue<4,3>
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pradkumar
Charter Member
Charter Member
Posts: 393
Joined: Wed Oct 18, 2006 1:09 pm

Post by pradkumar »

Thx very much. i will try doing the thing you suggested
Pradeep Kumar
Post Reply