BLANKS stored in VARCHAR fileds of a dataset

ag_ram · Post by **ag_ram** » Thu Sep 20, 2007 4:14 pm

Hi,

I have a job that reads records from a CSV flat file. The source has four fields each of which is defined as varchar(250). The records read from the file are written to a dataset. The column length in the dataset is also varchar(250).

The dataset file is currently consuming 1KB to store each record even though the input data size per record is 20 bytes. Is there any environment variable that can prevent the dataset storing BLANK values in the dataset for VARCHAR fields.

Thanks,
Ravish.

ray.wurlod · Post by **ray.wurlod** » Thu Sep 20, 2007 4:59 pm

No, but you can apply Trim() function in a Transformer stage. Also investigate derivation substitution - you will only need to enter one Trim() function!

ArndW · Post by **ArndW** » Thu Sep 20, 2007 6:09 pm

DataSets store the data in fixed width format internally - so "hello" stored in a VarChar(1000) field will use up 1002 bytes.

ag_ram · Post by **ag_ram** » Thu Sep 20, 2007 6:23 pm

Thanks Ray.

But in production the volume is going to be in the order of few hundred million. So i was thinking that adding a transformer stage and trim() with in could potentially affect my performance.

So i did:
1. Added a copy stage between the sequential file stage and the dataset stage.
2. Mapped VARCHAR(250) - Sequential to VARCHAR - Dataset and what i observed was the size of the output dataset reduced dramatically.

So my question, what is the difference between VARCHAR(250) and VARCHAR. Are there any know issue in v7.5 using unbounded VARCHAR.

Thanks again.

ArndW · Post by **ArndW** » Thu Sep 20, 2007 6:36 pm

I am going to have to check that up to see if using unbounded strings changes the DataSet behaviour on my system as well. If so, it is an interesting way of making sure that datasets use less space.

ArndW · Post by **ArndW** » Thu Sep 20, 2007 6:46 pm

I just did a test - if I declare a dataset with one column of VarChar(1024) and fill it with just a single character per line, my test job of 500K rows generated a data set 263479K, whereas if I remove the length then the same number of rows only takes up 2752K.

I wasn't aware of that change in behaviour, but I am applying it to a job that I need to run today - the dataset filled up 400Gb of space last time but it has a lot of VarChar columns that are only partially filled!

ag_ram · Post by **ag_ram** » Thu Sep 20, 2007 7:33 pm

Arnd,

Can you please post the result of the job run today. Do you observe any data integrity issues. Does the size of the output dataset reduce from 400GB.

Thanks,
Ravish.

ray.wurlod · Post by **ray.wurlod** » Thu Sep 20, 2007 8:37 pm

Storage in Data Sets is about the only recommended place to use unbounded VarChars. They are stored as you would expect from the designator in the score; as string[max=N]{prefix_bytes m} where m is the size of the binary number indicating how many characters are stored, and N is the maximum number of characters to be stored. However, only the actual number of characters in the string are stored (to conserve space), the prefix acts as a pointer to the next available storage location.

Anywhere else (that is, when processing), bounded VarChar is to be preferred.

ArndW · Post by **ArndW** » Thu Sep 20, 2007 9:41 pm

Just as an info, I did a test run with with 5% of the data into a dataset and it was 17.661Gb large, the same data going to unbounded VarChar columns was only 6.109Gb on disk (120 columns, 80 Varchar with 1-50 lengths). The saving in disk space were great but runtime was even better on this I/O bound machine.