DataSet size very large with very little data inside

jmessiha · Post by **jmessiha** » Wed Jan 11, 2006 4:31 pm

I have 150 columns that are all VarChar(500)'s but the partitions in the scratch space are still very large when I set every one of those columns to null using the SetNull() function. ~75,000 rows produces a dataset with 2 partitions each of sizes around 6.0 GB for a total of around 12.0 GB.

The only think I could think of was that maybe the internal delimiter was taking up a lot of space but that's still a bit too much, right?

Any ideas?

ray.wurlod · Post by **ray.wurlod** » Wed Jan 11, 2006 6:22 pm

Take a look using the Data Set Management utility in Manager to see how much of each segment in each partition is actually in use. Then ask the vendor how come the files are so large.

jmessiha · Post by **jmessiha** » Wed Jan 11, 2006 7:05 pm

The problem is that the file never finishes being created because it fills up the disk too quickly. I guess I could try reducing the input to the point where the output would be able to get created and finish (75,000 rows would finish... the original input was 1.2 million). I'll try that and see what happens....

I did notice that dataset partitions are always much larger than the amount of user data that is actually stored in it. Why is that? I may have to resort to filesets to handle this problem... Any other ideas?

ray.wurlod wrote:Take a look using the Data Set Management utility in Manager to see how much of each segment in each partition is actually in use. Then ask the vendor how come the files are so large.

Ultramundane · Post by **Ultramundane** » Thu Jan 12, 2006 8:28 am

I had asked this question of the vendor. Ascential said that datasets allocate nearly the full amount of space that could be used by column even when no data for that column exists. Thus, we have to use sequential files or the workaround they provided. They said this was a performance enhancement made in version 7.0.1 and up.

Here is the quote from Ascential:
After DS 7.0.1 there was a change in the way varchar columns are stored in datasets. I didn't think this was related because none of the files that I
had received used varchar types but this explains what you are describing.
Here is an extract of the advanced parallel guide regarding this issue:

"Some parallel datasets generated with DataStage 7.0.1 and later releases
require more disk space when the columns are of type VarChar when compared to 7.0. This is due to changes added for performance improvements for bounded length VarChars in 7.0.1. The
preferred solution is to use unbounded length VarChars (don't set any
length) for columns where the maximum length is rarely used.
Alternatively, you can set the environment variable,
APT_OLD_BOUNDED_LENGTH (see page 6-19), but this is not recommended, as it leads to performance degradation."