Dataset occupying a lot more space compared to a text file.

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Dataset occupying a lot more space compared to a text file.

Post by mavrick21 »

Hello,

A job is reading from sequential file, passing it through a transformer and writing to a DataSet. There are 7 key columns for source and destination. The sequential file has 293 fields and 5 million records. The DataSet has same number of fields and records. But I see that DataSet is occupying 7-8 times more space, on the disk, than the sequential file.

I'd be thankful if anyone could explain why it is so and if there is a way to reduce space consumed by the DataSet.
jcthornton
Premium Member
Premium Member
Posts: 79
Joined: Thu Mar 22, 2007 4:58 pm
Location: USA

Post by jcthornton »

Multiple reasons come to mind that could explain at least a part of the growth.

First, the dataset stores the records using the internal representation of that data. Depending on how those columns are defined and what they contain, this representation can reasonably be larger than a source text file.

Second, there is additional metadata that is stored in the dataset that will not be in a similar text file.

I do not think that those two reasons would account for a 7-8x increase in space on disk however.

The only other item that comes to mind is that perhaps you are doing an 'entire' partition on a multi-node setup. Since the dataset stores the contents of each partition independently, if every partition has the same data (entire) it could account for significant growth.
Jack Thornton
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Varchar fields are stored at their full size, from what I recall.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

jcthornton,
jcthornton wrote: Second, there is additional metadata that is stored in the dataset that will not be in a similar text file.
Does additional metadata include indexes on key columns? What are the other things that are included in metadata?
jcthornton wrote: The only other item that comes to mind is that perhaps you are doing an 'entire' partition on a multi-node setup. Since the dataset stores the contents of each partition independently, if every partition has the same data (entire) it could account for significant growth.
The job uses only 1 node config file.

chulett,
chulett wrote: Varchar fields are stored at their full size, from what I recall.
Is this mentioned in IBM's documents? In which document can I find it?

Thanks for your time.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Not sure where that is documented other than in posts here. Unbounded varchar fields are stored as you would expect but declare a size (which most everyone does) and it takes that full size in the dataset.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

chulett wrote:Not sure where that is documented other than in posts here. Unbounded varchar fields are stored as you would expect but declare a size (which most everyone does) and it takes that full size in the dataset.
I created a tiny job to verify it and what you said is correct.

Thank you!

Thanks jcthornton!
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Data Sets carry four additional fields (per record) used for recording the structure - including sort order - of the data in the Data Set. These fields can be seen when inspecting the record schema of the Data Set (or File Set, which also carry these fields).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Ray,

Please correct me if I'm wrong.

By
ray.wurlod wrote: inspecting the record schema of the Data Set ...
did you mean

orchadmin ll <datasetname.ds> ?

For the above command here is what I see for a single varchar column and I don't see additional four fields anywhere.

Schema:
record
( col1: string;
)
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No, I meant cat /path/fileset.fs (these are easier, since there's no binary data - but you can do something similar with ".ds" files).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

ray.wurlod wrote:Data Sets carry four additional fields (per record) used for recording the structure
Could any one please through some light over these extra columns in additon to the mentioned sort order and what i am curious is the varchar fields taking up the defined space and always the discussion says they are of variable length if this is so what do the rest of the memory locations for the field containg where there is no data (in char it is padded with spaces what happens here?)

Thanks in advance
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I have nothing to add to what I have already said.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Sequential files can take more space than data sets too, up to 7x more. As with many cases, it depends! I would agree that in most cases, the data set file(s) by nature will require more space than the same data in text files. See this topic for an example of such an exception:

viewtopic.php?t=143906
Choose a job you love, and you will never have to work a day in your life. - Confucius
Post Reply