Hi,
A job is reading from database,passing it through a transformer and writing to a dataset. There are 50 columns and 33 fields has datatype varchar(4000).I see that dataset is occupying more than 200gb space on the disk than the sequential file (600mb).
I have searched the forum and found the similar post
http://dsxchange.com/viewtopic.php?t=13 ... 2b8fd1bd68
1) I am trimming the data for all fields in the transformer.
2) Auto partition on dataset.
The configuration file we are using is
{
node "node1"
{
fastname "xxxx"
pools ""
resource disk "/A/B/C/resource/resource1" {pools ""}
resource disk "/A/B/C/resource/resource2" {pools ""}
resource disk "/A/B/C/resource/resource3" {pools ""}
resource disk "/A/B/C/resource/resource4" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch1" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch2" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch3" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch4" {pools ""}
}
}
{
node "node2"
{
fastname "xxxx"
pools ""
resource disk "/A/B/C/resource/resource1" {pools ""}
resource disk "/A/B/C/resource/resource2" {pools ""}
resource disk "/A/B/C/resource/resource3" {pools ""}
resource disk "/A/B/C/resource/resource4" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch1" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch2" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch3" {pools ""}
resource scratchdisk "/A/B/C/scratch/scratch4" {pools ""}
}
So is it occupying because of the 33 fields of varchar(4000)? Could you please explain why is it taking so much space.In what way dataset is better than text file?
Please help.
Thanks
Dataset is occupying more space than file
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
To answer your final question, Data Set is perfect for staging data between parallel jobs. It preserves internal format, sorted order and partitioning. None of these is true for a sequential file. The operator generated by a Data Set stage is copy - it simply copies the virtual Data Set (what's on the link) to/from disk.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Well, your datasets are large likely because the PX framework treats bounded fields as fixed width for performance reasons. No amount of trimming etc... will do away with that. This also holds true for how data are moved between operators - fixed width, again, for performance reasons.
There is an environment variable that might help you here and it is -
APT_COMPRESS_BOUNDED_FIELDS (which is the default in version 8.5)
When set generates a modify adapter in copy operators that are writing file datasets to convert bounded length fields to variable length fields for storage in the dataset. The dataset in this case will contain a modify adapter to convert the variable length fields back to bounded length. Note that this requires that copy operators not be optimized out by the score composer.
I felt you also needed to understand what it did and not just that it is there.
Have fun!
There is an environment variable that might help you here and it is -
APT_COMPRESS_BOUNDED_FIELDS (which is the default in version 8.5)
When set generates a modify adapter in copy operators that are writing file datasets to convert bounded length fields to variable length fields for storage in the dataset. The dataset in this case will contain a modify adapter to convert the variable length fields back to bounded length. Note that this requires that copy operators not be optimized out by the score composer.
I felt you also needed to understand what it did and not just that it is there.
Have fun!
Mike Hester
mhester@petra-ps.com
mhester@petra-ps.com