Loading Huge Data

pradkumar · Post by **pradkumar** » Tue Nov 28, 2006 9:45 pm

Hi

I am getting around 40-50 million records from an oracle table. I need to do some join and simple transformations and load the data into another table.

ODBC-->Transformer-->ODBC
I went through the forum which discused the use of IPC, Multiple instances etc.
I am thinking of creating multiple instances of my job.
I know to create multiple instances in DS. But the concept of how to partition is something new to me.
Could anyone explain me in little bit more detail, how to accomplish this.

Thanks

chulett · Post by **chulett** » Tue Nov 28, 2006 9:54 pm

Yuck... that will take for-ev-a. Look into making it a two step process... first extract and transform, landing it to a flat file. This job can be 'multi-instanced' rather easily, using mod in Oracle and a 'partition number' as the Invocation ID. Make sure the output files have a unique name as well by including the Invocation ID as part of the filename.

Then suggest you bulk load the data if at all possible. That could get your load portion of this little passion play down to minutes rather than many, many (many) hours.

vinodhraj · Post by **vinodhraj** » Wed Nov 29, 2006 1:11 am

Hi,

seperate the table based on keys

for eg: if the table consists of 10 county data, split it into 10 based on country, giving country as a job parameter.

Give the county code as multiple instance, shedule it in such a way so that it runs parallelly.

Vinod

saikir · Post by **saikir** » Wed Nov 29, 2006 4:36 am

Hi,

You can also try using the link partitioner and link collector stage. The link collector will split the load into links and the other end you can collect all the links using link collector.

Sreedhar · Post by **Sreedhar** » Wed Nov 29, 2006 4:45 am

While extracting the data from the source give an GROUP by clause so that the data coming out of the ODBC is in sorted order, Do Not partition the data at any further stages, do your normal transformation and load.

This should improve the throughput.

Regards,

aakashahuja · Post by **aakashahuja** » Wed Nov 29, 2006 8:30 am

The job type heading of ur post says Server, so if u r using 7.5.2, then u might consider redeveloping ur job in PX

chulett · Post by **chulett** » Wed Nov 29, 2006 8:48 am

Most people that post here as 'Server' peoples do so because their company has not licensed the PX product.

pradkumar · Post by **pradkumar** » Wed Nov 29, 2006 1:48 pm

I am thinking to create a job sequence by allowing multiple instances of the job. I decided to create 4 instances of the original job.

I am thinking to partition the input data based on a period (as ia m loading only 2005 data). Periods are different. So this should be given as a parameter and in where clause. If so, could you give some guidance on how to do it

ray.wurlod · Post by **ray.wurlod** » Wed Nov 29, 2006 2:02 pm

You have it exactly.

Use the job parameter references to replace constants in the WHERE clause of the extraction SQL. For example:

Code: Select all

WHERE sel_date >= '#StartDate#' AND sel_date < '#EndDate#'

You can do this on the Selection tab - you do not need to resort to user-defined SQL.

pradkumar · Post by **pradkumar** » Wed Nov 29, 2006 2:59 pm

Thx for the reply.

So we need to give different ranges of satrt date and end date in each job sequence.

I need to do partition first on year and then on period (1-12 for 12 months). My Where clause in main job is

"WHERE tablLE_NAME.PERIOD_NAME = '#PERIOD#'.

PERIOD_Name is a column in my source table. has values like(Jan-05,Feb-05.........,Jan-04......) Each year is having around 5 million records. So first we decided to load data year wise and use 4 instances in each year. Each instance will load three periods of data.
So now this PERIOD is appearing in each and every joba ctivity in sequence. Now I need to pass the values to these 4 different joba ctivities for PERIOD. Liek First Joba ctivity should get Jan-05,Feb-05,Mar-05 and second one hsould get Apr,May,Jun and so on..

How these values can be passed from sequencer to main job. I am not sure how to specify it in sequencer.

How to accomplish this

ray.wurlod · Post by **ray.wurlod** » Wed Nov 29, 2006 5:10 pm

If you generate YEAR upstream in the job sequence (maybe as a sequence parameter) you can use a Routine activity to generate the four date ranges, perhaps pass them back as delimited strings, and unpack them. For example the routine could return a dynamic array containing four fields each with three PERIOD values. You would then load three job parameters with these and specify then in a WHERE clause of the form

Code: Select all

WHERE tablename.PERIOD_NAME IN ('#jpPeriod1#', '#jpPeriod2#', '#jpPeriod3')

Example Routine:

Code: Select all

FUNCTION Periods(Year)
* Returns four fields containing three period names each
Year2 = Right(Year,2) ; * rightmost two digits of year
Ans = ""
Ans<1> = "Jan-" : Year2 : @VM : "Feb-" : Year2 : @VM : "Mar-" : Year2
Ans<2> = "Apr-" : Year2 : @VM : "May-" : Year2 : @VM : "Jun-" : Year2
Ans<3> = "Jul-" : Year2 : @VM : "Aug-" : Year2 : @VM : "Sep-" : Year2
Ans<4> = "Oct-" : Year2 : @VM : "Nov-" : Year2 : @VM : "Dec-" : Year2
RETURN(Ans)

In instance number 4, which handles the fourth quarter, job parameter values are derived as

Code: Select all

jpPeriod1     RoutineActivityName.$ReturnValue<4,1>
jpPeriod2     RoutineActivityName.$ReturnValue<4,2>
jpPeriod3     RoutineActivityName.$ReturnValue<4,3>

pradkumar · Post by **pradkumar** » Wed Nov 29, 2006 7:32 pm

Thx very much. i will try doing the thing you suggested