DataSet Size

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sendmkpk
Premium Member
Premium Member
Posts: 97
Joined: Mon Apr 02, 2007 2:47 am

DataSet Size

Post by sendmkpk »

Hi All,

We are doing some kind off capacity planning for our project.

We have a fixed file of size 2436999939 bytes. My job is removing the header and trailer record from the job and creating a dataset with only few necessary columns with two additional columns one varchar(5) with a constant value and one timestamp column with current timestamp value.

The size of the dataset (via dataset management from manager) gives 2436998939 bytes almost same as the source file.

We were under the assumption that it will take lesser space. All the columns are varchar.

Any help on this

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Take a look at the record schema for the Data Set. There is a storage overhead of 80 bits per row. Can you post the size of each of the segment files making up the Data Set (in the bottom frame of the GUI)?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If you use unbounded VarChar() columns you will use a lot less space if your strings are nowhere near to being at their maxiumum length.

for example, with a single node if you create a DataSet with 1 million rows declared as varchar(1024) and fill them with "hello world" the file size will be 984.38Mb large, but if you declare it unbounded the same DataSet will only be 15.38Mb in size.
SettValleyConsulting
Premium Member
Premium Member
Posts: 72
Joined: Thu Sep 04, 2003 5:01 am
Location: UK & Europe

Post by SettValleyConsulting »

Here is the storage required for each data type. Using these you can get a pretty good estimate of your dataset record size...
  • Data Type: Size

    Integers: 4 bytes
    Small Integer: 2 bytes
    Tiny Integer: 1 byte
    Big Integer: 8 bytes
    Decimal: (precision+1)/2, rounded up
    Float: 8 bytes
    VarChar(n): n + 4 bytes for non-NLS data, 2n + 4 bytes for NLS data
    Char(n): n bytes for non-NLS data, 2n bytes for NLS data
    Time: 4 bytes , 8 bytes with msec resolution
    Date: 4 bytes
    Timestamp: 8 bytes, 12 bytes with msec resolution
Add (Number of nullable fields)/8 for null indicators
Add one byte per column for field alignment (on average)

As Arndw says, using unsized Varchars is inefficient, not just in datasets and should be avoided. (But read on)

Phil Clarke.
Last edited by SettValleyConsulting on Thu May 01, 2008 9:59 am, edited 1 time in total.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Actually, I stated the exact opposite regarding storage size of unbounded VarChar() fields for datasets.
SettValleyConsulting
Premium Member
Premium Member
Posts: 72
Joined: Thu Sep 04, 2003 5:01 am
Location: UK & Europe

Post by SettValleyConsulting »

Ooops :oops:

So you did. In case of conflicting advice, readers should believe Arnd over me every time :wink:
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Naah, that isn't the case when it comes to opinions. And I always favor bounded over unbounded strings but when it comes to DataSets with large volumes they save both space and time. Using the fixed-length max size makes indexing faster, but the cost in space is horrendous with strings that are "sparse".
Post Reply