Processing Large Amounts of data

ashik_punar · Post by **ashik_punar** » Fri Aug 29, 2008 2:41 am

Hi All,

We are using Server jobs in our project and we got a requirement in which we need to process large amounts of data. The jobs need to run on the monthly basis and the data files are received on daily basis. We are loading the data files into our Source tables (Oracle 10g) on daily basis and at the end of the month we will be fetching all the closed and active accounts data from our source, aggregating and applying the business logic and then populating into flat files. On average we are getting 10 million records for 3 days so effectively we will be getting nearly 100 million records at the end of the month. We have already designed the job but for 1 day of data the job is taking nearly 1 hour. So for the month end run it will be taking many many hours. We tried a no of design options which are as below:

Option1: (Execution time nearly 1 hour for 2 million records, going to DB twice for reading and once for writing)
Job1:
We are taking all the distinct Transaction Ids and loading them into a hash file. So we hit the Database for the first time.
Job2:
In this job we will getting all the records from the source and aggregate the data on the basis of distinct Transaction Ids and load them into a hash file.
Job3: In this job we will be using the hash file with distinct Transaction Ids and do a lookup on the hash file which is having distinct Transaction Ids and the aggregate amounts, get the amounts related to the distinct Transaction Ids and load them into the table.

Option2: (Execution time nearly 35 minutes for 2 million records, going to DB once for reading and once for writing)
Job1:
We are taking all the distinct Transaction Ids and aggregating them and loading them into a hash file.
Job2: In this job we will be using the hash file with distinct Transaction Ids and the aggregate amounts and load them into the table.

The problem with both the above designs is that they are taking very long time to run. Please guide if we can improve/change the design to boost the performance of jobs.

Thank You in advance.
Punar

ray.wurlod · Post by **ray.wurlod** » Fri Aug 29, 2008 3:18 am

Consider ways in which you can build in some parallel processing, for example multi-instance jobs.

Can you create an Oracle table of the distinct transaction IDs and join to that, to minimize the number of rows that you need to extract from Oracle? Such a table is called a "bucket table" - it contains a "bucket of keys".

Separate processing for active and closed accounts?

What does "applying the business logic" involve? Are there any opportunities for efficiency here?

Have you monitored your jobs to identify the "hot spots", both with the Monitor and with collection of statistics on the stage tracing tab when you issue a job run request through Director?

These are just some ideas quickly without much thought. I'm sure there are probably others if the actual application were known in more detail.

tkbharani · Post by **tkbharani** » Fri Aug 29, 2008 3:32 am

This is one of suggestion.

As you get daily data, you must process it daily to complete it fast.
You can process the 10 million data, aggregate it daily and do summarized operations. At the end of the month you can again do a summary or rollup of all the daily summarized data. At this time you will have only few records comparatively.

If your DB write is taking time then go for EXCHANGE PARTITION. This is faster than SQLLDR, fast load.

ashik_punar · Post by **ashik_punar** » Mon Sep 01, 2008 7:24 am

Hi All,
Thank You for your valuable inputs. Currently we have improved the performance by changing our source queries and shifting the load on to database rather then doing everything in Datastage. I will keep you posted about the developments. Thank You.

Punar Deep Singh

jdmiceli · Post by **jdmiceli** » Mon Oct 20, 2008 12:13 pm

Building a little on the groundwork Ray and Bharani have already laid for you:

I had the same problem with very large amounts of data taking too much time to process. I removed the hashed file creation part of the process and switched to staging tables holding the same information (actually a little more, which removed another step later on). Also, if it works in your process to split Inserts and Updates to separate processing files, you may be able to take advantage of bulk loading for inserts and whatever is best for updates (sorry, I'm not Oracle compliant

)

Hope that helps!