BLANKS stored in VARCHAR fileds of a dataset

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
ag_ram
Premium Member
Premium Member
Posts: 524
Joined: Wed Feb 28, 2007 3:51 am

BLANKS stored in VARCHAR fileds of a dataset

Post by ag_ram »

Hi,

I have a job that reads records from a CSV flat file. The source has four fields each of which is defined as varchar(250). The records read from the file are written to a dataset. The column length in the dataset is also varchar(250).

The dataset file is currently consuming 1KB to store each record even though the input data size per record is 20 bytes. Is there any environment variable that can prevent the dataset storing BLANK values in the dataset for VARCHAR fields.

Thanks,
Ravish.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No, but you can apply Trim() function in a Transformer stage. Also investigate derivation substitution - you will only need to enter one Trim() function!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

DataSets store the data in fixed width format internally - so "hello" stored in a VarChar(1000) field will use up 1002 bytes.
ag_ram
Premium Member
Premium Member
Posts: 524
Joined: Wed Feb 28, 2007 3:51 am

Post by ag_ram »

Thanks Ray.

But in production the volume is going to be in the order of few hundred million. So i was thinking that adding a transformer stage and trim() with in could potentially affect my performance.

So i did:
1. Added a copy stage between the sequential file stage and the dataset stage.
2. Mapped VARCHAR(250) - Sequential to VARCHAR - Dataset and what i observed was the size of the output dataset reduced dramatically.

So my question, what is the difference between VARCHAR(250) and VARCHAR. Are there any know issue in v7.5 using unbounded VARCHAR.

Thanks again.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I am going to have to check that up to see if using unbounded strings changes the DataSet behaviour on my system as well. If so, it is an interesting way of making sure that datasets use less space.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I just did a test - if I declare a dataset with one column of VarChar(1024) and fill it with just a single character per line, my test job of 500K rows generated a data set 263479K, whereas if I remove the length then the same number of rows only takes up 2752K.

I wasn't aware of that change in behaviour, but I am applying it to a job that I need to run today - the dataset filled up 400Gb of space last time but it has a lot of VarChar columns that are only partially filled!
ag_ram
Premium Member
Premium Member
Posts: 524
Joined: Wed Feb 28, 2007 3:51 am

Post by ag_ram »

Arnd,

Can you please post the result of the job run today. Do you observe any data integrity issues. Does the size of the output dataset reduce from 400GB.

Thanks,
Ravish.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Storage in Data Sets is about the only recommended place to use unbounded VarChars. They are stored as you would expect from the designator in the score; as string[max=N]{prefix_bytes m} where m is the size of the binary number indicating how many characters are stored, and N is the maximum number of characters to be stored. However, only the actual number of characters in the string are stored (to conserve space), the prefix acts as a pointer to the next available storage location.

Anywhere else (that is, when processing), bounded VarChar is to be preferred.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Just as an info, I did a test run with with 5% of the data into a dataset and it was 17.661Gb large, the same data going to unbounded VarChar columns was only 6.109Gb on disk (120 columns, 80 Varchar with 1-50 lengths). The saving in disk space were great but runtime was even better on this I/O bound machine.
Post Reply