Hi All,
We are doing some kind off capacity planning for our project.
We have a fixed file of size 2436999939 bytes. My job is removing the header and trailer record from the job and creating a dataset with only few necessary columns with two additional columns one varchar(5) with a constant value and one timestamp column with current timestamp value.
The size of the dataset (via dataset management from manager) gives 2436998939 bytes almost same as the source file.
We were under the assumption that it will take lesser space. All the columns are varchar.
Any help on this
Thanks
DataSet Size
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Take a look at the record schema for the Data Set. There is a storage overhead of 80 bits per row. Can you post the size of each of the segment files making up the Data Set (in the bottom frame of the GUI)?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
If you use unbounded VarChar() columns you will use a lot less space if your strings are nowhere near to being at their maxiumum length.
for example, with a single node if you create a DataSet with 1 million rows declared as varchar(1024) and fill them with "hello world" the file size will be 984.38Mb large, but if you declare it unbounded the same DataSet will only be 15.38Mb in size.
for example, with a single node if you create a DataSet with 1 million rows declared as varchar(1024) and fill them with "hello world" the file size will be 984.38Mb large, but if you declare it unbounded the same DataSet will only be 15.38Mb in size.
-
- Premium Member
- Posts: 72
- Joined: Thu Sep 04, 2003 5:01 am
- Location: UK & Europe
Here is the storage required for each data type. Using these you can get a pretty good estimate of your dataset record size...
Add one byte per column for field alignment (on average)
As Arndw says, using unsized Varchars is inefficient, not just in datasets and should be avoided. (But read on)
Phil Clarke.
- Data Type: Size
Integers: 4 bytes
Small Integer: 2 bytes
Tiny Integer: 1 byte
Big Integer: 8 bytes
Decimal: (precision+1)/2, rounded up
Float: 8 bytes
VarChar(n): n + 4 bytes for non-NLS data, 2n + 4 bytes for NLS data
Char(n): n bytes for non-NLS data, 2n bytes for NLS data
Time: 4 bytes , 8 bytes with msec resolution
Date: 4 bytes
Timestamp: 8 bytes, 12 bytes with msec resolution
Add one byte per column for field alignment (on average)
As Arndw says, using unsized Varchars is inefficient, not just in datasets and should be avoided. (But read on)
Phil Clarke.
Last edited by SettValleyConsulting on Thu May 01, 2008 9:59 am, edited 1 time in total.
-
- Premium Member
- Posts: 72
- Joined: Thu Sep 04, 2003 5:01 am
- Location: UK & Europe
Naah, that isn't the case when it comes to opinions. And I always favor bounded over unbounded strings but when it comes to DataSets with large volumes they save both space and time. Using the fixed-length max size makes indexing faster, but the cost in space is horrendous with strings that are "sparse".