Page 1 of 1

DataSet Size

Posted: Thu May 01, 2008 6:01 am
by sendmkpk
Hi All,

We are doing some kind off capacity planning for our project.

We have a fixed file of size 2436999939 bytes. My job is removing the header and trailer record from the job and creating a dataset with only few necessary columns with two additional columns one varchar(5) with a constant value and one timestamp column with current timestamp value.

The size of the dataset (via dataset management from manager) gives 2436998939 bytes almost same as the source file.

We were under the assumption that it will take lesser space. All the columns are varchar.

Any help on this

Thanks

Posted: Thu May 01, 2008 6:15 am
by ray.wurlod
Take a look at the record schema for the Data Set. There is a storage overhead of 80 bits per row. Can you post the size of each of the segment files making up the Data Set (in the bottom frame of the GUI)?

Posted: Thu May 01, 2008 6:31 am
by ArndW
If you use unbounded VarChar() columns you will use a lot less space if your strings are nowhere near to being at their maxiumum length.

for example, with a single node if you create a DataSet with 1 million rows declared as varchar(1024) and fill them with "hello world" the file size will be 984.38Mb large, but if you declare it unbounded the same DataSet will only be 15.38Mb in size.

Posted: Thu May 01, 2008 9:22 am
by SettValleyConsulting
Here is the storage required for each data type. Using these you can get a pretty good estimate of your dataset record size...
  • Data Type: Size

    Integers: 4 bytes
    Small Integer: 2 bytes
    Tiny Integer: 1 byte
    Big Integer: 8 bytes
    Decimal: (precision+1)/2, rounded up
    Float: 8 bytes
    VarChar(n): n + 4 bytes for non-NLS data, 2n + 4 bytes for NLS data
    Char(n): n bytes for non-NLS data, 2n bytes for NLS data
    Time: 4 bytes , 8 bytes with msec resolution
    Date: 4 bytes
    Timestamp: 8 bytes, 12 bytes with msec resolution
Add (Number of nullable fields)/8 for null indicators
Add one byte per column for field alignment (on average)

As Arndw says, using unsized Varchars is inefficient, not just in datasets and should be avoided. (But read on)

Phil Clarke.

Posted: Thu May 01, 2008 9:26 am
by ArndW
Actually, I stated the exact opposite regarding storage size of unbounded VarChar() fields for datasets.

Posted: Thu May 01, 2008 9:47 am
by SettValleyConsulting
Ooops :oops:

So you did. In case of conflicting advice, readers should believe Arnd over me every time :wink:

Posted: Thu May 01, 2008 9:51 am
by ArndW
Naah, that isn't the case when it comes to opinions. And I always favor bounded over unbounded strings but when it comes to DataSets with large volumes they save both space and time. Using the fixed-length max size makes indexing faster, but the cost in space is horrendous with strings that are "sparse".