Page 1 of 2

Should I use DataSets?

Posted: Mon Jun 28, 2010 1:13 pm
by dougcl
Hi folks, we have a brand new system, and I building my first job. Our system was sized with only 45GB available for DataSets. Our old custom PL-SQL process transfers about 1TB of data every night. Seems to me if I am thinking about landing data in DataSets I will immediately overrun my 45GB. So I have a big decision. Ask for more disk space or not? Plan on landing or not?

Is building a system without the option of landing DataSets considered a compromise?

Thanks,
Doug

Posted: Mon Jun 28, 2010 1:33 pm
by Raftsman
I have been experimenting with datasets and I have found that using a DBMS is more efficient that using them. Our environment uses Oracle and I am using TMP Oracle datasets as a holding area. The R/W process is much faster.

If datasets are your only option, I think your going to have performance issues.

Posted: Mon Jun 28, 2010 2:26 pm
by dougcl
Raftsman wrote:I have been experimenting with datasets and I have found that using a DBMS is more efficient that using them. Our environment uses Oracle and I am using TMP Oracle datasets as a holding area. The R/W process is much faster.

If datasets are your only option, I think your going to have performance issues.
Hi Jim, you mean retrieving data from a remote database is faster than reading a DataSet from a local directory? I'm surprised, but this is great info if it is true. Are you using an order by clause on the source tables?

Do you find that write performance to Oracle is faster than DatSets as well?

Side question, do you happen to be using a particular partitioned read strategy on Oracle?

Posted: Mon Jun 28, 2010 5:20 pm
by kduke
Jim, you have either seriously slow disk drives or your config files are poorly designed. No way a database should be as fast as a dataset. Now landing the data maybe slower than not landing the data. That is a whole different issue.

Posted: Mon Jun 28, 2010 5:25 pm
by kduke
Also undersized hardware will always cause issues. Not enough scratch space means you cannot sort. Not enough RAM will cause performance issues. 45GB is nothing. My laptop has more than that. If you put a Vette motor in a VW doesn't men it can perform like a Vette. All these things need to work together. You need adequate disk space and RAM and CPU power before DataStage can perform properly.

Posted: Mon Jun 28, 2010 7:21 pm
by Raftsman
Our system has over 500G of space and has 32G of ram. I wrote 13 million rows in ~30 minutes. When I write to the Oracle base, I write the same in 15 minutes. When I read, I get approx 4000 row/sec. from a dataset. When I read from Oracle, I get 30000 rows/sec.

There is a substantial gain.

I tried running on 1, 2, 4 and 6 nodes. I gain the most performance at 4.

I asked our network/server about the configuration. He said it is optimized as best can be. Since I am no expert in server configs, I assume his answer to be correct.

It would be nice to know if I could increase performance from a dataset.

I agree, it doesn't make sense that write to a disk on the same server be slower than writing to a dataset.

If I can provide more info to help rectify this problem, please let me know.

Thanks.

Jim

Posted: Mon Jun 28, 2010 7:22 pm
by Raftsman
Dougcl,

No, we are not reading using a particular partitioning strategy.

Jim

Posted: Mon Jun 28, 2010 11:23 pm
by kduke
That is very bad performance writing to a dataset. I would expect your disk space is a SAN and it remotely mounted. You need locally mounted disk space that is not RAID. Get the cheapest fastest drives. ETL does not need to be fault tolerant. Most ETL space is temporary or should be.

Your Oracle performance is very respectable. Could you be sorting your data when you do not need to be? Something seems very wrong. I am very curious now. I wonder if others see similar rows per second on datasets and databases. Also, why only a 6 nodes? Have you tried 12 or 16? Maybe your SAN is configured poorly. I have seen admins configure SANs with one big logical volume on top of lots of physical volumes. This is easy to admin but allows poor performance. If the logical volume more closely matched the physical volumes then your config files would spread the disk access out more evenly. One logical volume may mean the SAN is writing to one physical disk and not spreading the workload across multiple physical drives. So basically you are single threaded on your SAN. Not good.

Posted: Tue Jun 29, 2010 6:50 am
by Raftsman
Let me check with the server tech and get back to you. I know it's not a SAN. Our disk resides on the same server and whatever the default that DS provides during installation are used.

Follow-up to come.

Thanks

Posted: Tue Jun 29, 2010 6:57 am
by ArndW
One other aspect is that the type of file system and journalling can play a significant performance impact. Are you using JFS for the data file partition?

Posted: Tue Jun 29, 2010 7:44 am
by Raftsman
I googled your question and if I am correct, this is related to a 64 bit system.

No, we are running under a 32 bit system.

Posted: Tue Jun 29, 2010 7:50 am
by chulett
Nope, got nothing to do with the "bit-ness" of the system.

Posted: Tue Jun 29, 2010 8:09 am
by Raftsman
Answer to two of the questions asked from our Server tech.

First:

The disks are local mounted raid5 and scratch is a local mounted non-raid.

Second:

No... JFS is only on unix.

I hope I was clear that we are running Windows. This may be the reason for most of our issues.

Posted: Tue Jun 29, 2010 8:11 am
by Raftsman
I apologize, I wasn't clear because I responded to a question asked with a Unix system.

No matter what, should we not get better performance with datasets?

THanks

Posted: Tue Jun 29, 2010 8:15 am
by chulett
You marked your post as using the UNIX system for your server, hence the confusion.