DataSet size very large with very little data inside

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jmessiha
Participant
Posts: 21
Joined: Fri Nov 12, 2004 9:48 am

DataSet size very large with very little data inside

Post by jmessiha »

I have 150 columns that are all VarChar(500)'s but the partitions in the scratch space are still very large when I set every one of those columns to null using the SetNull() function. ~75,000 rows produces a dataset with 2 partitions each of sizes around 6.0 GB for a total of around 12.0 GB.

The only think I could think of was that maybe the internal delimiter was taking up a lot of space but that's still a bit too much, right?

Any ideas?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Take a look using the Data Set Management utility in Manager to see how much of each segment in each partition is actually in use. Then ask the vendor how come the files are so large.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jmessiha
Participant
Posts: 21
Joined: Fri Nov 12, 2004 9:48 am

Post by jmessiha »

The problem is that the file never finishes being created because it fills up the disk too quickly. I guess I could try reducing the input to the point where the output would be able to get created and finish (75,000 rows would finish... the original input was 1.2 million). I'll try that and see what happens....

I did notice that dataset partitions are always much larger than the amount of user data that is actually stored in it. Why is that? I may have to resort to filesets to handle this problem... Any other ideas?
ray.wurlod wrote:Take a look using the Data Set Management utility in Manager to see how much of each segment in each partition is actually in use. Then ask the vendor how come the files are so large.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

I had asked this question of the vendor. Ascential said that datasets allocate nearly the full amount of space that could be used by column even when no data for that column exists. Thus, we have to use sequential files or the workaround they provided. They said this was a performance enhancement made in version 7.0.1 and up.

Here is the quote from Ascential:
After DS 7.0.1 there was a change in the way varchar columns are stored in datasets. I didn't think this was related because none of the files that I
had received used varchar types but this explains what you are describing.
Here is an extract of the advanced parallel guide regarding this issue:

"Some parallel datasets generated with DataStage 7.0.1 and later releases
require more disk space when the columns are of type VarChar when compared to 7.0. This is due to changes added for performance improvements for bounded length VarChars in 7.0.1. The
preferred solution is to use unbounded length VarChars (don't set any
length) for columns where the maximum length is rarely used.
Alternatively, you can set the environment variable,
APT_OLD_BOUNDED_LENGTH (see page 6-19), but this is not recommended, as it leads to performance degradation."
Post Reply