DataSet Size

sendmkpk · Post by **sendmkpk** » Thu May 01, 2008 6:01 am

Hi All,

We are doing some kind off capacity planning for our project.

We have a fixed file of size 2436999939 bytes. My job is removing the header and trailer record from the job and creating a dataset with only few necessary columns with two additional columns one varchar(5) with a constant value and one timestamp column with current timestamp value.

The size of the dataset (via dataset management from manager) gives 2436998939 bytes almost same as the source file.

We were under the assumption that it will take lesser space. All the columns are varchar.

Any help on this

Thanks

ray.wurlod · Post by **ray.wurlod** » Thu May 01, 2008 6:15 am

Take a look at the record schema for the Data Set. There is a storage overhead of 80 bits per row. Can you post the size of each of the segment files making up the Data Set (in the bottom frame of the GUI)?

ArndW · Post by **ArndW** » Thu May 01, 2008 6:31 am

If you use unbounded VarChar() columns you will use a lot less space if your strings are nowhere near to being at their maxiumum length.

for example, with a single node if you create a DataSet with 1 million rows declared as varchar(1024) and fill them with "hello world" the file size will be 984.38Mb large, but if you declare it unbounded the same DataSet will only be 15.38Mb in size.

SettValleyConsulting · Thu May 01, 2008 9:22 am

Here is the storage required for each data type. Using these you can get a pretty good estimate of your dataset record size...

Data Type: Size

Integers: 4 bytes
Small Integer: 2 bytes
Tiny Integer: 1 byte
Big Integer: 8 bytes
Decimal: (precision+1)/2, rounded up
Float: 8 bytes
VarChar(n): n + 4 bytes for non-NLS data, 2n + 4 bytes for NLS data
Char(n): n bytes for non-NLS data, 2n bytes for NLS data
Time: 4 bytes , 8 bytes with msec resolution
Date: 4 bytes
Timestamp: 8 bytes, 12 bytes with msec resolution

Add (Number of nullable fields)/8 for null indicators
Add one byte per column for field alignment (on average)

As Arndw says, using unsized Varchars is inefficient, not just in datasets and should be avoided. (But read on)

Phil Clarke.

ArndW · Post by **ArndW** » Thu May 01, 2008 9:26 am

Actually, I stated the exact opposite regarding storage size of unbounded VarChar() fields for datasets.

SettValleyConsulting · Thu May 01, 2008 9:47 am

Ooops

So you did. In case of conflicting advice, readers should believe Arnd over me every time

ArndW · Post by **ArndW** » Thu May 01, 2008 9:51 am

Naah, that isn't the case when it comes to opinions. And I always favor bounded over unbounded strings but when it comes to DataSets with large volumes they save both space and time. Using the fixed-length max size makes indexing faster, but the cost in space is horrendous with strings that are "sparse".