Dataset occupying a lot more space compared to a text file.

mavrick21 · Post by **mavrick21** » Wed Jun 16, 2010 11:58 am

Hello,

A job is reading from sequential file, passing it through a transformer and writing to a DataSet. There are 7 key columns for source and destination. The sequential file has 293 fields and 5 million records. The DataSet has same number of fields and records. But I see that DataSet is occupying 7-8 times more space, on the disk, than the sequential file.

I'd be thankful if anyone could explain why it is so and if there is a way to reduce space consumed by the DataSet.

jcthornton · Post by **jcthornton** » Wed Jun 16, 2010 12:49 pm

Multiple reasons come to mind that could explain at least a part of the growth.

First, the dataset stores the records using the internal representation of that data. Depending on how those columns are defined and what they contain, this representation can reasonably be larger than a source text file.

Second, there is additional metadata that is stored in the dataset that will not be in a similar text file.

I do not think that those two reasons would account for a 7-8x increase in space on disk however.

The only other item that comes to mind is that perhaps you are doing an 'entire' partition on a multi-node setup. Since the dataset stores the contents of each partition independently, if every partition has the same data (entire) it could account for significant growth.

chulett · Post by **chulett** » Wed Jun 16, 2010 2:10 pm

Varchar fields are stored at their full size, from what I recall.

mavrick21 · Post by **mavrick21** » Wed Jun 16, 2010 2:16 pm

jcthornton,

jcthornton wrote: Second, there is additional metadata that is stored in the dataset that will not be in a similar text file.

Does additional metadata include indexes on key columns? What are the other things that are included in metadata?

jcthornton wrote: The only other item that comes to mind is that perhaps you are doing an 'entire' partition on a multi-node setup. Since the dataset stores the contents of each partition independently, if every partition has the same data (entire) it could account for significant growth.

The job uses only 1 node config file.

chulett,

chulett wrote: Varchar fields are stored at their full size, from what I recall.

Is this mentioned in IBM's documents? In which document can I find it?

Thanks for your time.

chulett · Post by **chulett** » Wed Jun 16, 2010 2:23 pm

Not sure where that is documented other than in posts here. Unbounded varchar fields are stored as you would expect but declare a size (which most everyone does) and it takes that full size in the dataset.

mavrick21 · Post by **mavrick21** » Wed Jun 16, 2010 2:48 pm

chulett wrote:Not sure where that is documented other than in posts here. Unbounded varchar fields are stored as you would expect but declare a size (which most everyone does) and it takes that full size in the dataset.

I created a tiny job to verify it and what you said is correct.

Thank you!

Thanks jcthornton!

ray.wurlod · Post by **ray.wurlod** » Wed Jun 16, 2010 5:03 pm

Data Sets carry four additional fields (per record) used for recording the structure - including sort order - of the data in the Data Set. These fields can be seen when inspecting the record schema of the Data Set (or File Set, which also carry these fields).

mavrick21 · Post by **mavrick21** » Thu Jun 17, 2010 11:47 am

Ray,

Please correct me if I'm wrong.

By

ray.wurlod wrote: inspecting the record schema of the Data Set ...

did you mean

orchadmin ll <datasetname.ds> ?

For the above command here is what I see for a single varchar column and I don't see additional four fields anywhere.

Schema:
record
( col1: string;
)

ray.wurlod · Post by **ray.wurlod** » Thu Jun 17, 2010 5:11 pm

No, I meant cat /path/fileset.fs (these are easier, since there's no binary data - but you can do something similar with ".ds" files).

zulfi123786 · Post by **zulfi123786** » Mon Jul 12, 2010 5:14 am

ray.wurlod wrote:Data Sets carry four additional fields (per record) used for recording the structure

Could any one please through some light over these extra columns in additon to the mentioned sort order and what i am curious is the varchar fields taking up the defined space and always the discussion says they are of variable length if this is so what do the rest of the memory locations for the field containg where there is no data (in char it is padded with spaces what happens here?)

Thanks in advance

ray.wurlod · Post by **ray.wurlod** » Mon Jul 12, 2010 5:33 am

I have nothing to add to what I have already said.

qt_ky · Post by **qt_ky** » Thu Jan 05, 2012 10:43 pm

Sequential files can take more space than data sets too, up to 7x more. As with many cases, it depends! I would agree that in most cases, the data set file(s) by nature will require more space than the same data in text files. See this topic for an example of such an exception:

viewtopic.php?t=143906