Should I use DataSets?
Moderators: chulett, rschirm, roy
Should I use DataSets?
Hi folks, we have a brand new system, and I building my first job. Our system was sized with only 45GB available for DataSets. Our old custom PL-SQL process transfers about 1TB of data every night. Seems to me if I am thinking about landing data in DataSets I will immediately overrun my 45GB. So I have a big decision. Ask for more disk space or not? Plan on landing or not?
Is building a system without the option of landing DataSets considered a compromise?
Thanks,
Doug
Is building a system without the option of landing DataSets considered a compromise?
Thanks,
Doug
I have been experimenting with datasets and I have found that using a DBMS is more efficient that using them. Our environment uses Oracle and I am using TMP Oracle datasets as a holding area. The R/W process is much faster.
If datasets are your only option, I think your going to have performance issues.
If datasets are your only option, I think your going to have performance issues.
Jim Stewart
Hi Jim, you mean retrieving data from a remote database is faster than reading a DataSet from a local directory? I'm surprised, but this is great info if it is true. Are you using an order by clause on the source tables?Raftsman wrote:I have been experimenting with datasets and I have found that using a DBMS is more efficient that using them. Our environment uses Oracle and I am using TMP Oracle datasets as a holding area. The R/W process is much faster.
If datasets are your only option, I think your going to have performance issues.
Do you find that write performance to Oracle is faster than DatSets as well?
Side question, do you happen to be using a particular partitioned read strategy on Oracle?
Also undersized hardware will always cause issues. Not enough scratch space means you cannot sort. Not enough RAM will cause performance issues. 45GB is nothing. My laptop has more than that. If you put a Vette motor in a VW doesn't men it can perform like a Vette. All these things need to work together. You need adequate disk space and RAM and CPU power before DataStage can perform properly.
Mamu Kim
Our system has over 500G of space and has 32G of ram. I wrote 13 million rows in ~30 minutes. When I write to the Oracle base, I write the same in 15 minutes. When I read, I get approx 4000 row/sec. from a dataset. When I read from Oracle, I get 30000 rows/sec.
There is a substantial gain.
I tried running on 1, 2, 4 and 6 nodes. I gain the most performance at 4.
I asked our network/server about the configuration. He said it is optimized as best can be. Since I am no expert in server configs, I assume his answer to be correct.
It would be nice to know if I could increase performance from a dataset.
I agree, it doesn't make sense that write to a disk on the same server be slower than writing to a dataset.
If I can provide more info to help rectify this problem, please let me know.
Thanks.
Jim
There is a substantial gain.
I tried running on 1, 2, 4 and 6 nodes. I gain the most performance at 4.
I asked our network/server about the configuration. He said it is optimized as best can be. Since I am no expert in server configs, I assume his answer to be correct.
It would be nice to know if I could increase performance from a dataset.
I agree, it doesn't make sense that write to a disk on the same server be slower than writing to a dataset.
If I can provide more info to help rectify this problem, please let me know.
Thanks.
Jim
Jim Stewart
That is very bad performance writing to a dataset. I would expect your disk space is a SAN and it remotely mounted. You need locally mounted disk space that is not RAID. Get the cheapest fastest drives. ETL does not need to be fault tolerant. Most ETL space is temporary or should be.
Your Oracle performance is very respectable. Could you be sorting your data when you do not need to be? Something seems very wrong. I am very curious now. I wonder if others see similar rows per second on datasets and databases. Also, why only a 6 nodes? Have you tried 12 or 16? Maybe your SAN is configured poorly. I have seen admins configure SANs with one big logical volume on top of lots of physical volumes. This is easy to admin but allows poor performance. If the logical volume more closely matched the physical volumes then your config files would spread the disk access out more evenly. One logical volume may mean the SAN is writing to one physical disk and not spreading the workload across multiple physical drives. So basically you are single threaded on your SAN. Not good.
Your Oracle performance is very respectable. Could you be sorting your data when you do not need to be? Something seems very wrong. I am very curious now. I wonder if others see similar rows per second on datasets and databases. Also, why only a 6 nodes? Have you tried 12 or 16? Maybe your SAN is configured poorly. I have seen admins configure SANs with one big logical volume on top of lots of physical volumes. This is easy to admin but allows poor performance. If the logical volume more closely matched the physical volumes then your config files would spread the disk access out more evenly. One logical volume may mean the SAN is writing to one physical disk and not spreading the workload across multiple physical drives. So basically you are single threaded on your SAN. Not good.
Mamu Kim