Dataset is occupying more space than file

jagadam · Post by **jagadam** » Wed Jun 29, 2011 3:21 pm

Hi,

A job is reading from database,passing it through a transformer and writing to a dataset. There are 50 columns and 33 fields has datatype varchar(4000).I see that dataset is occupying more than 200gb space on the disk than the sequential file (600mb).

I have searched the forum and found the similar post

http://dsxchange.com/viewtopic.php?t=13 ... 2b8fd1bd68

1) I am trimming the data for all fields in the transformer.
2) Auto partition on dataset.

The configuration file we are using is

{
node "node1"
{
fastname "xxxx"
pools ""
resource disk "/A/B/C/resource/resource1" {pools ""}
resource disk "/A/B/C/resource/resource2" {pools ""}
resource disk "/A/B/C/resource/resource3" {pools ""}
resource disk "/A/B/C/resource/resource4" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch1" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch2" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch3" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch4" {pools ""}
}
}
{
node "node2"
{
fastname "xxxx"
pools ""
resource disk "/A/B/C/resource/resource1" {pools ""}
resource disk "/A/B/C/resource/resource2" {pools ""}
resource disk "/A/B/C/resource/resource3" {pools ""}
resource disk "/A/B/C/resource/resource4" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch1" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch2" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch3" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch4" {pools ""}
}

So is it occupying because of the 33 fields of varchar(4000)? Could you please explain why is it taking so much space.In what way dataset is better than text file?

Please help.

Thanks

chulett · Post by **chulett** » Wed Jun 29, 2011 3:36 pm

Doesn't the post you linked to answer all of your questions?

ray.wurlod · Post by **ray.wurlod** » Wed Jun 29, 2011 5:53 pm

To answer your final question, Data Set is perfect for staging data between parallel jobs. It preserves internal format, sorted order and partitioning. None of these is true for a sequential file. The operator generated by a Data Set stage is copy - it simply copies the virtual Data Set (what's on the link) to/from disk.

mhester · Post by **mhester** » Thu Jun 30, 2011 2:11 pm

Well, your datasets are large likely because the PX framework treats bounded fields as fixed width for performance reasons. No amount of trimming etc... will do away with that. This also holds true for how data are moved between operators - fixed width, again, for performance reasons.

There is an environment variable that might help you here and it is -

APT_COMPRESS_BOUNDED_FIELDS (which is the default in version 8.5)

When set generates a modify adapter in copy operators that are writing file datasets to convert bounded length fields to variable length fields for storage in the dataset. The dataset in this case will contain a modify adapter to convert the variable length fields back to bounded length. Note that this requires that copy operators not be optimized out by the score composer.

I felt you also needed to understand what it did and not just that it is there.

Have fun!

jagadam · Post by **jagadam** » Thu Jul 07, 2011 8:28 am

Thanks for all the inputs.