Page 1 of 1

Profile Stage performance

Posted: Sat Oct 30, 2004 11:17 am
by mee
We have about 100GB of data that we want to profile. For business reason, we can't use the sampling, and must profile the entire data set.

The profiling seem to be hanging (or not able to complete). The data is ASCII data. Machine is a 4 CPU.

I have two questions.

a) Has anyone else seen this behaviour ?
b) What is the largest amount of data volume that anyone has profiled in 2 to 3 hours ?

Thanks for any help here.

Posted: Sun Oct 31, 2004 1:10 am
by ray.wurlod
At my current site a few runs were performed that ran more than 2-3 hours, and were definitely nowhere near 100GB. Of course, elapsed time is as much as anything a function of how complex the analysis you're performing is. More complex analyses take longer; this is a simple fact of life.
It being Sunday as I write I don't have access to the actual figures. However, I think you're being over-optimistic expecting to process that volume of data in so short an interval.

Posted: Sun Oct 31, 2004 9:22 pm
by mee
Ray, Thanks for the reply.

We are doing some basic column profiling and inter table key relationship. How much time should I allow for that volume of data ?

How much data did you profile in 2-3 hours and what sort of analysis? This would help me compare also.

Posted: Mon Nov 01, 2004 1:37 am
by ray.wurlod
Here's a reply from the guy who actually ran the Profile Stage analyses at my current site. Hope it's useful.

I'm not sure how useful this information is going to be because, as you pointed out, there are many factors that affect the processing time. Probably the most relevant example that I can recount is the table that contained 54 million rows (one of the claim history tables if memory serves me correctly). This table took 14 hours to perform the column analysis within ProfileStage. One of the factors that affects performance is the number of columns that the table contains as well as the number of rows that the table contains. You can generally get an idea of how long the column analysis will take by setting it off and checking the log to see how long it takes for ProfileStage to process each 10,000 rows. The time gap between each of the 10,000 rows can then be used, in conjunction with the total number of rows in the table, to calculate a rough estimate of how long it will take to perform the column analysis for all of the rows in that table. It won't be exact, but it would give you an idea. Once you've got this estimate, you can kill the ProfileStage job and then plan to schedule and run it "properly".

The table in question had only 28 columns. I'm afraid I don't have any size figures available. ProfileStage was running on a 4 CPU AIX machine, but other DataStage processes, also processing similar volumes, were also running during the same overnight period, some moving data from a remote Informix database to a local Oracle database.

So it really is "how long is a piece of string".

Posted: Wed Nov 24, 2004 1:52 pm
by metabill
Other factors can also affect throughput for other phases of analysis. For instance, since table analysis looks at functional dependencies among the columns within a given table, the number of columns and the degree to which the column values imply dependencies will affect how many dependencies are found, and hence the elapsed time required.

one trick is to use 4 cpu work for u

Posted: Wed Feb 16, 2005 8:57 am
by namisbis
i noticed that u got 4 cpu server.
One trick probably can help to fasten the column analysis:
f.inst: u hv 4 tables to analyze, u can create 4 jobs,each contains only one table on the analysis-server, profilestage wl use the 4 cpu do 4 analysis simulatanously, which wl much shrtern the time.

hope this help u.

franky