How to measure performance?

kduke · Post by **kduke** » Sun Jun 19, 2005 9:11 pm

I was wondering if I created some jobs to gather job run time statistics would you run them on your site and share them. If we knew your hardware like Sun 480 and then rows per second for all jobs by stage type, megabytes per second fpr same. Lets say your rows per second were averaging 500 and everyone else is 5000. Why is your performance so slow? Lets say we could also some how measure network speed. Maybe we could also measure rows per second on the exact same job for 10,000 rows and a 1,000 rows inserted into Oracle or a hash file or a sequential file. This would isolate connection time and give you true rows per second to compare across databases or systems. If you are a lot slower than similar systems then you have a problem.

The question is would you share your numbers to see how you rank? If DSX could garuantee nobody not could see anyone else's numbers then would you share.

kduke · Post by **kduke** » Sun Jun 19, 2005 9:22 pm

I would like to do the same for jobs built. If we could gather the number of jobs, links and columns created each month then the difference is how many new items created. Maybe we cannot easily get the number of items changed. If we could then we could post the most number if jobs per developer, the least, the average and your rank.

We could post the average number of stages per job, average number of links and columns. We could post the max and min as well. It would be nice to know how many jobs per target table and average number of source tables per target.

I think I could write the jobs to gather this info. If people would run them and I think we could write a process to post the results to a DSX computer to average the averages and produce the rankings. I want to make it extremely easy to gather this information so maybe just run one sequence. and it may email the results.

I am just brainstorming. If you are using Version Control then maybe we can get changes as well.

Maybe we need a poll.

ArndW · Post by **ArndW** » Mon Jun 20, 2005 12:33 am

Kim,

I think that it would be great to get a wide list of performance numbers for a baseline of common type of jobs. Jobs along the general line of:

1. Read sequential -> Write Sequential
2. Read Sequential -> CPU transforms -> Write Sequential
3. Read Db -> Write Sequential
4. Read Sequential -> Write Db

with some limits on other concurrent processes, disk HW configuration, etc.

ray.wurlod · Post by **ray.wurlod** » Mon Jun 20, 2005 2:33 am

I think you need to restrict the requirement to production environments.

I write, and encourage others to write, small jobs as part of the learning experience, experimenting with different ways of doing things. I feel that this would bias any results.

elavenil · Post by **elavenil** » Mon Jun 20, 2005 2:55 am

Kim,

It is great to have the statistics as Arnd mentioned and that can be used as a benchmark for developing simple jobs and this could be useful to validate our environment as well.

Regards
Saravanan

kduke · Post by **kduke** » Mon Jun 20, 2005 7:24 am

I plan on talking to Dennis today. Dennis owns this web site DSX. We need a place to gather the numbers and process them. I will let you know but it will take a few days to build the jobs and a few weeks to build the web pages.

I would still like to see how many people would run the jobs and post the numbers to this site? Post a reply if you are willing and some suggestions like Ray and Arnd.

ArndW · Post by **ArndW** » Mon Jun 20, 2005 8:02 am

If idea of the Ascential benchmark center survived the IBM takeover then they would certainly be more than a bit interested in this type of information (particularly when DSXchange results differ from published numbers)

Nonetheless I would consider this type of data to be very valuable to those of us involved in both sizing and performance tuning portions of DataStage development.

Even a simple DataStage installation involves quite a few decisions that can significantly impact performance of different aspects of a job run. I think that the test program would need to include not only the test run and results but also a rough idea of the actual environment (real HW, virtual views, OS config, UV.CONFIG, Px config, etc.). But even without this detailed information it would suffice to know that on at least one installation of DS7.x on HW platform y and DataBase z the write did ~2000 rows per second.

I'd be willing to put in some hours to help create some programs/jobs to collect this data.

roy · Post by **roy** » Mon Jun 20, 2005 10:27 am

Hi,
I'd like to help if I can.
Naturally I think publishing performance numbers on different platforms and configurations is benefitial to everyone including the site publishing them, So I'll try encorraging most of the ones I know to participate.

p.s.
AFAIK using NLS has conciderable impact on performance, so it should be noted as well.

vmcburney · Post by **vmcburney** » Mon Jun 20, 2005 5:59 pm

Ascential/IBM will hate this idea. It would be easy for a competitor to take the worst results for each platform and compare them to the results they have produced in a favourable lab environment and publish them or present them to potential customers. Meanwhile IBM/Ascential cannot use the favourable results because they cannot prove that they are accurate.

But speaking as someone not employed by Ascential/IBM I think it is an interesting idea and it is better to share information then to live in fear of information misuse. I'd be happy to participate. Is it worth waiting for the Sept Hawk release and the new repository? Process metadata may be easier to extract in that release then it is in the previous releases.

chulett · Post by **chulett** » Mon Jun 20, 2005 6:09 pm

I'd be happy to help as well - as long as you promise not to laugh at the numbers we get off of our [cough] 'Super' [/cough] domes, that is.

vmcburney wrote:Is it worth waiting for the Sept Hawk release and the new repository? Process metadata may be easier to extract in that release then it is in the previous releases.

Vince, are you really planning on upgrading the moment this release hits the street? I think for most of us out there that won't happen for quite some time, but then we'd be perfectly happy for you to take point on that trail... and possibly the first arrow as well.

ray.wurlod · Post by **ray.wurlod** » Mon Jun 20, 2005 6:14 pm

It would not be difficult to create a server job (with routines) that could capture all of the required information from an entire project and, indeed, email the results. Gathering information about the platform is more difficult but, again, not impossible. This could be distributed to anyone interested in participating.

Initially, at least, I think that we need to guarantee that no information will be included in published results that could be used to identify contributors; that is, only summary statistics will be available.

It would also be necessary to draw up some heavy-duty caveats. These data can not be taken to be representative of anything, because there has been no control over, for example, the optimization skills of the developer.

kduke · Post by **kduke** » Mon Jun 20, 2005 7:09 pm

I agree Ray but I think we need to get it started and then fine tune it later. I think we need to do a uname -a and maybe a sysdef. It would be interesting to have some command to measure network speed or throughput. Some of this might be harder to get on Windows. I may need some help there. What would be the equivalent of uname -a for Windows. We need the hostname, the make and model of the computer, amount of RAM, number of processors, speed of processors, stuff like that. If we can get it then that would be great otherwise what can we get easily and quickly and even consistently. Rows per second should be easy.

elavenil · Post by **elavenil** » Tue Jun 21, 2005 12:37 am

Hi,
I like to help by developing the jobs and routines to get the numbers.

Regards
Saravanan

Sreenivasulu · Post by **Sreenivasulu** » Tue Jun 21, 2005 3:11 am

HI All,

We are Measuring performance using the following ways

1. Suppose the target is a database. Convert the database stage to
a sequential file and see whther it takes the same time. This would give us a knowhow whter the database connection to the target (generally remote connections) is slow or the volume of data is huge hence it takes takes time.

2. In the transformations section - Invalidate all transformations to default values. This would help us know whether the job is running slow because
of transformations

3. Suppose the source is a database then run the query using hints/partition/index (with the help of dba). This qould give an insight
whether the source query is a bottleneck

4. Aggregators - This is part of transofrmation bottleneck but need
to be given special attention. An aggregator stage in the middle of
a big job makes the enter job slow since all the records need
to pass the aggregator(cannot be processed in parallel)

These are the 4 points we stress on while measuring performance.

Regards
Sreenivasulu

DSguru2B · Post by **DSguru2B** » Wed Apr 05, 2006 11:56 am

I dont know if someone would hate this idea, but for using this feature its better to have seperate subscription to it with a decent cover.