Dataset occupying a lot more space compared to a text file.
Moderators: chulett, rschirm, roy
Dataset occupying a lot more space compared to a text file.
Hello,
A job is reading from sequential file, passing it through a transformer and writing to a DataSet. There are 7 key columns for source and destination. The sequential file has 293 fields and 5 million records. The DataSet has same number of fields and records. But I see that DataSet is occupying 7-8 times more space, on the disk, than the sequential file.
I'd be thankful if anyone could explain why it is so and if there is a way to reduce space consumed by the DataSet.
A job is reading from sequential file, passing it through a transformer and writing to a DataSet. There are 7 key columns for source and destination. The sequential file has 293 fields and 5 million records. The DataSet has same number of fields and records. But I see that DataSet is occupying 7-8 times more space, on the disk, than the sequential file.
I'd be thankful if anyone could explain why it is so and if there is a way to reduce space consumed by the DataSet.
-
- Premium Member
- Posts: 79
- Joined: Thu Mar 22, 2007 4:58 pm
- Location: USA
Multiple reasons come to mind that could explain at least a part of the growth.
First, the dataset stores the records using the internal representation of that data. Depending on how those columns are defined and what they contain, this representation can reasonably be larger than a source text file.
Second, there is additional metadata that is stored in the dataset that will not be in a similar text file.
I do not think that those two reasons would account for a 7-8x increase in space on disk however.
The only other item that comes to mind is that perhaps you are doing an 'entire' partition on a multi-node setup. Since the dataset stores the contents of each partition independently, if every partition has the same data (entire) it could account for significant growth.
First, the dataset stores the records using the internal representation of that data. Depending on how those columns are defined and what they contain, this representation can reasonably be larger than a source text file.
Second, there is additional metadata that is stored in the dataset that will not be in a similar text file.
I do not think that those two reasons would account for a 7-8x increase in space on disk however.
The only other item that comes to mind is that perhaps you are doing an 'entire' partition on a multi-node setup. Since the dataset stores the contents of each partition independently, if every partition has the same data (entire) it could account for significant growth.
Jack Thornton
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
----------------
Spectacular achievement is always preceded by spectacular preparation - Robert H. Schuller
jcthornton,
chulett,
Thanks for your time.
Does additional metadata include indexes on key columns? What are the other things that are included in metadata?jcthornton wrote: Second, there is additional metadata that is stored in the dataset that will not be in a similar text file.
The job uses only 1 node config file.jcthornton wrote: The only other item that comes to mind is that perhaps you are doing an 'entire' partition on a multi-node setup. Since the dataset stores the contents of each partition independently, if every partition has the same data (entire) it could account for significant growth.
chulett,
Is this mentioned in IBM's documents? In which document can I find it?chulett wrote: Varchar fields are stored at their full size, from what I recall.
Thanks for your time.
I created a tiny job to verify it and what you said is correct.chulett wrote:Not sure where that is documented other than in posts here. Unbounded varchar fields are stored as you would expect but declare a size (which most everyone does) and it takes that full size in the dataset.
Thank you!
Thanks jcthornton!
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Data Sets carry four additional fields (per record) used for recording the structure - including sort order - of the data in the Data Set. These fields can be seen when inspecting the record schema of the Data Set (or File Set, which also carry these fields).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Ray,
Please correct me if I'm wrong.
By
orchadmin ll <datasetname.ds> ?
For the above command here is what I see for a single varchar column and I don't see additional four fields anywhere.
Schema:
record
( col1: string;
)
Please correct me if I'm wrong.
By
did you meanray.wurlod wrote: inspecting the record schema of the Data Set ...
orchadmin ll <datasetname.ds> ?
For the above command here is what I see for a single varchar column and I don't see additional four fields anywhere.
Schema:
record
( col1: string;
)
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Premium Member
- Posts: 730
- Joined: Tue Nov 04, 2008 10:14 am
- Location: Bangalore
Could any one please through some light over these extra columns in additon to the mentioned sort order and what i am curious is the varchar fields taking up the defined space and always the discussion says they are of variable length if this is so what do the rest of the memory locations for the field containg where there is no data (in char it is padded with spaces what happens here?)ray.wurlod wrote:Data Sets carry four additional fields (per record) used for recording the structure
Thanks in advance
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Sequential files can take more space than data sets too, up to 7x more. As with many cases, it depends! I would agree that in most cases, the data set file(s) by nature will require more space than the same data in text files. See this topic for an example of such an exception:
viewtopic.php?t=143906
viewtopic.php?t=143906
Choose a job you love, and you will never have to work a day in your life. - Confucius