BLANKS stored in VARCHAR fileds of a dataset
Moderators: chulett, rschirm, roy
BLANKS stored in VARCHAR fileds of a dataset
Hi,
I have a job that reads records from a CSV flat file. The source has four fields each of which is defined as varchar(250). The records read from the file are written to a dataset. The column length in the dataset is also varchar(250).
The dataset file is currently consuming 1KB to store each record even though the input data size per record is 20 bytes. Is there any environment variable that can prevent the dataset storing BLANK values in the dataset for VARCHAR fields.
Thanks,
Ravish.
I have a job that reads records from a CSV flat file. The source has four fields each of which is defined as varchar(250). The records read from the file are written to a dataset. The column length in the dataset is also varchar(250).
The dataset file is currently consuming 1KB to store each record even though the input data size per record is 20 bytes. Is there any environment variable that can prevent the dataset storing BLANK values in the dataset for VARCHAR fields.
Thanks,
Ravish.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
DataSets store the data in fixed width format internally - so "hello" stored in a VarChar(1000) field will use up 1002 bytes.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Thanks Ray.
But in production the volume is going to be in the order of few hundred million. So i was thinking that adding a transformer stage and trim() with in could potentially affect my performance.
So i did:
1. Added a copy stage between the sequential file stage and the dataset stage.
2. Mapped VARCHAR(250) - Sequential to VARCHAR - Dataset and what i observed was the size of the output dataset reduced dramatically.
So my question, what is the difference between VARCHAR(250) and VARCHAR. Are there any know issue in v7.5 using unbounded VARCHAR.
Thanks again.
But in production the volume is going to be in the order of few hundred million. So i was thinking that adding a transformer stage and trim() with in could potentially affect my performance.
So i did:
1. Added a copy stage between the sequential file stage and the dataset stage.
2. Mapped VARCHAR(250) - Sequential to VARCHAR - Dataset and what i observed was the size of the output dataset reduced dramatically.
So my question, what is the difference between VARCHAR(250) and VARCHAR. Are there any know issue in v7.5 using unbounded VARCHAR.
Thanks again.
I am going to have to check that up to see if using unbounded strings changes the DataSet behaviour on my system as well. If so, it is an interesting way of making sure that datasets use less space.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I just did a test - if I declare a dataset with one column of VarChar(1024) and fill it with just a single character per line, my test job of 500K rows generated a data set 263479K, whereas if I remove the length then the same number of rows only takes up 2752K.
I wasn't aware of that change in behaviour, but I am applying it to a job that I need to run today - the dataset filled up 400Gb of space last time but it has a lot of VarChar columns that are only partially filled!
I wasn't aware of that change in behaviour, but I am applying it to a job that I need to run today - the dataset filled up 400Gb of space last time but it has a lot of VarChar columns that are only partially filled!
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Storage in Data Sets is about the only recommended place to use unbounded VarChars. They are stored as you would expect from the designator in the score; as string[max=N]{prefix_bytes m} where m is the size of the binary number indicating how many characters are stored, and N is the maximum number of characters to be stored. However, only the actual number of characters in the string are stored (to conserve space), the prefix acts as a pointer to the next available storage location.
Anywhere else (that is, when processing), bounded VarChar is to be preferred.
Anywhere else (that is, when processing), bounded VarChar is to be preferred.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Just as an info, I did a test run with with 5% of the data into a dataset and it was 17.661Gb large, the same data going to unbounded VarChar columns was only 6.109Gb on disk (120 columns, 80 Varchar with 1-50 lengths). The saving in disk space were great but runtime was even better on this I/O bound machine.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>