Dataset is occupying more space than file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jagadam
Premium Member
Premium Member
Posts: 107
Joined: Wed Jul 01, 2009 4:55 pm
Location: Phili

Dataset is occupying more space than file

Post by jagadam »

Hi,

A job is reading from database,passing it through a transformer and writing to a dataset. There are 50 columns and 33 fields has datatype varchar(4000).I see that dataset is occupying more than 200gb space on the disk than the sequential file (600mb).

I have searched the forum and found the similar post

http://dsxchange.com/viewtopic.php?t=13 ... 2b8fd1bd68

1) I am trimming the data for all fields in the transformer.
2) Auto partition on dataset.

The configuration file we are using is

{
node "node1"
{
fastname "xxxx"
pools ""
resource disk "/A/B/C/resource/resource1" {pools ""}
resource disk "/A/B/C/resource/resource2" {pools ""}
resource disk "/A/B/C/resource/resource3" {pools ""}
resource disk "/A/B/C/resource/resource4" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch1" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch2" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch3" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch4" {pools ""}
}
}
{
node "node2"
{
fastname "xxxx"
pools ""
resource disk "/A/B/C/resource/resource1" {pools ""}
resource disk "/A/B/C/resource/resource2" {pools ""}
resource disk "/A/B/C/resource/resource3" {pools ""}
resource disk "/A/B/C/resource/resource4" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch1" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch2" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch3" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch4" {pools ""}
}

So is it occupying because of the 33 fields of varchar(4000)? Could you please explain why is it taking so much space.In what way dataset is better than text file?

Please help.

Thanks
NJ
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Doesn't the post you linked to answer all of your questions?
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

To answer your final question, Data Set is perfect for staging data between parallel jobs. It preserves internal format, sorted order and partitioning. None of these is true for a sequential file. The operator generated by a Data Set stage is copy - it simply copies the virtual Data Set (what's on the link) to/from disk.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Well, your datasets are large likely because the PX framework treats bounded fields as fixed width for performance reasons. No amount of trimming etc... will do away with that. This also holds true for how data are moved between operators - fixed width, again, for performance reasons.

There is an environment variable that might help you here and it is -

APT_COMPRESS_BOUNDED_FIELDS (which is the default in version 8.5)

When set generates a modify adapter in copy operators that are writing file datasets to convert bounded length fields to variable length fields for storage in the dataset. The dataset in this case will contain a modify adapter to convert the variable length fields back to bounded length. Note that this requires that copy operators not be optimized out by the score composer.

I felt you also needed to understand what it did and not just that it is there.

Have fun!
jagadam
Premium Member
Premium Member
Posts: 107
Joined: Wed Jul 01, 2009 4:55 pm
Location: Phili

Post by jagadam »

Thanks for all the inputs.
NJ
Post Reply