Should I use DataSets?

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

Should I use DataSets?

Post by dougcl »

Hi folks, we have a brand new system, and I building my first job. Our system was sized with only 45GB available for DataSets. Our old custom PL-SQL process transfers about 1TB of data every night. Seems to me if I am thinking about landing data in DataSets I will immediately overrun my 45GB. So I have a big decision. Ask for more disk space or not? Plan on landing or not?

Is building a system without the option of landing DataSets considered a compromise?

Thanks,
Doug
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

I have been experimenting with datasets and I have found that using a DBMS is more efficient that using them. Our environment uses Oracle and I am using TMP Oracle datasets as a holding area. The R/W process is much faster.

If datasets are your only option, I think your going to have performance issues.
Jim Stewart
dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

Post by dougcl »

Raftsman wrote:I have been experimenting with datasets and I have found that using a DBMS is more efficient that using them. Our environment uses Oracle and I am using TMP Oracle datasets as a holding area. The R/W process is much faster.

If datasets are your only option, I think your going to have performance issues.
Hi Jim, you mean retrieving data from a remote database is faster than reading a DataSet from a local directory? I'm surprised, but this is great info if it is true. Are you using an order by clause on the source tables?

Do you find that write performance to Oracle is faster than DatSets as well?

Side question, do you happen to be using a particular partitioned read strategy on Oracle?
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Jim, you have either seriously slow disk drives or your config files are poorly designed. No way a database should be as fast as a dataset. Now landing the data maybe slower than not landing the data. That is a whole different issue.
Mamu Kim
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Also undersized hardware will always cause issues. Not enough scratch space means you cannot sort. Not enough RAM will cause performance issues. 45GB is nothing. My laptop has more than that. If you put a Vette motor in a VW doesn't men it can perform like a Vette. All these things need to work together. You need adequate disk space and RAM and CPU power before DataStage can perform properly.
Mamu Kim
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

Our system has over 500G of space and has 32G of ram. I wrote 13 million rows in ~30 minutes. When I write to the Oracle base, I write the same in 15 minutes. When I read, I get approx 4000 row/sec. from a dataset. When I read from Oracle, I get 30000 rows/sec.

There is a substantial gain.

I tried running on 1, 2, 4 and 6 nodes. I gain the most performance at 4.

I asked our network/server about the configuration. He said it is optimized as best can be. Since I am no expert in server configs, I assume his answer to be correct.

It would be nice to know if I could increase performance from a dataset.

I agree, it doesn't make sense that write to a disk on the same server be slower than writing to a dataset.

If I can provide more info to help rectify this problem, please let me know.

Thanks.

Jim
Jim Stewart
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

Dougcl,

No, we are not reading using a particular partitioning strategy.

Jim
Jim Stewart
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

That is very bad performance writing to a dataset. I would expect your disk space is a SAN and it remotely mounted. You need locally mounted disk space that is not RAID. Get the cheapest fastest drives. ETL does not need to be fault tolerant. Most ETL space is temporary or should be.

Your Oracle performance is very respectable. Could you be sorting your data when you do not need to be? Something seems very wrong. I am very curious now. I wonder if others see similar rows per second on datasets and databases. Also, why only a 6 nodes? Have you tried 12 or 16? Maybe your SAN is configured poorly. I have seen admins configure SANs with one big logical volume on top of lots of physical volumes. This is easy to admin but allows poor performance. If the logical volume more closely matched the physical volumes then your config files would spread the disk access out more evenly. One logical volume may mean the SAN is writing to one physical disk and not spreading the workload across multiple physical drives. So basically you are single threaded on your SAN. Not good.
Mamu Kim
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

Let me check with the server tech and get back to you. I know it's not a SAN. Our disk resides on the same server and whatever the default that DS provides during installation are used.

Follow-up to come.

Thanks
Jim Stewart
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

One other aspect is that the type of file system and journalling can play a significant performance impact. Are you using JFS for the data file partition?
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

I googled your question and if I am correct, this is related to a 64 bit system.

No, we are running under a 32 bit system.
Jim Stewart
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Nope, got nothing to do with the "bit-ness" of the system.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

Answer to two of the questions asked from our Server tech.

First:

The disks are local mounted raid5 and scratch is a local mounted non-raid.

Second:

No... JFS is only on unix.

I hope I was clear that we are running Windows. This may be the reason for most of our issues.
Jim Stewart
Raftsman
Premium Member
Premium Member
Posts: 335
Joined: Thu May 26, 2005 8:56 am
Location: Ottawa, Canada

Post by Raftsman »

I apologize, I wasn't clear because I responded to a question asked with a Unix system.

No matter what, should we not get better performance with datasets?

THanks
Jim Stewart
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You marked your post as using the UNIX system for your server, hence the confusion.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply