Page 1 of 1

VarChar Memory Usage

Posted: Mon Jun 28, 2010 3:51 pm
by dougcl
Hi folks, I've seen elsewhere that
Varchar(N) are always exactly N+4 bytes when in a dataset.
varchar/char(N) with NLS will always be 2N (+4 for varchar).
But I am wondering what the memory and scratch footprint is. Is it the same as above, or is it variable?

At the time of this post, a search on "varchar memory" turns up one hit. This post.

Thanks,
Doug

Posted: Mon Jun 28, 2010 7:07 pm
by ray.wurlod
I would be interested to know what that "elsewhere" is. It's not totally correct. In particular, the numeric prefix is not necessarily four bytes; it may be as small as one or two bytes.

Essentially, the "prefix" size in the record schema gives you this value. Storage in memory aims to use only the required amount. For example a VarChar(200) that contains only 16 characters would typically occupy 19 bytes without NLS and 35 bytes with NLS. The extra byte is the C-style string terminator (0x00). In practice the memory demanded will be in "whole word" units - multiples of 4 bytes on a 32-bit operating system or multiples of 8 bytes on a 64-bit operating system. So my 16 characters strings would actually occupy 20 bytes without NLS and 36 bytes with NLS on a 32-bit operating system, or 20 bytes without NLS and 40 bytes with NLS on a 64-bit operating system. It's just far easier (faster) to move things around in memory on whole-word boundaries.

Posted: Mon Jun 28, 2010 7:30 pm
by dougcl
ray.wurlod wrote:I would be interested to know what that "elsewhere" is.
Forum on IBM. No big deal.

Posted: Tue Jun 29, 2010 2:03 am
by ArndW
There is an anomaly with Data Sets where a Varchar(nnn) column will use nnn storage space per row regardless of string contents whereas an unterminated string will use allocate just the number of bytes in the string.

Posted: Tue Jun 29, 2010 2:30 am
by ray.wurlod
Not an anomaly - a deliberate design decision. Storage in Data Sets is fixed width (the same model as COBOL) so that addition and retrieval is in constant-sized chunks, which is very very efficient (the ZMOV instruction or equivalent) rather than needing to calculate offsets.

Posted: Tue Jun 29, 2010 2:33 am
by ArndW
I realize that this was a deliberate design decision, but it is counterintuitive that a VarChar() column in DS is internally represented as a Char() field for DataSets and thus I maintain that is an anomaly. I gather that this behaviour was different at releases prior to 7.

Posted: Tue Jun 29, 2010 2:34 am
by ray.wurlod
ArndW wrote:I gather that this behaviour was different at releases prior to 7.
Yep. Heaps slower, for one thing.

Posted: Tue Jun 29, 2010 2:05 pm
by dougcl
Hi folks, the answer to this question seems to be that DataStage will require the full upper bound for each varchar if the upper bound (the varchar size) is specified in the column definitions.

If you remove the varchar sizes from the metadata, then DataStage will only consume what the actual data requires.

Thanks,
Doug

Posted: Tue Jun 29, 2010 3:06 pm
by chulett
That behaviour is specific to datasets, AFAIK.

Posted: Tue Jun 29, 2010 5:55 pm
by dougcl
chulett wrote:That behaviour is specific to datasets, AFAIK. ...

Based on testing and using df -h in the scratch area while the job runs, it appears that this applies to the scratch area too. I think it follows then that this also applies to memory.

Thanks,
Doug

Posted: Wed Jun 30, 2010 6:08 am
by chulett
Fair enough. Actually in thinking back about this and Ray's comment regarding the change, I think the change in behaviour is specific to datasets to make them act like their cousins, Scratch and Memory, seeing as how in earlier releases it all did not match up.

Posted: Mon Oct 25, 2010 12:46 pm
by dougcl
Hi folks, looks like everyone is getting on board with the unbounded varchar idea for reducing memory and disk consumption.

Meanwhile I am trying to get away from post-import table def edits. Can someone remind me what environment variable addresses the way varchars are handled? I know there is one, but I can't remember its name. I've been searching all morning.

Thanks,
Doug

Posted: Mon Oct 25, 2010 12:48 pm
by dougcl
Ah! Found it cached in my browser:

APT_OLD_BOUNDED_LENGTH

Posted: Tue Oct 26, 2010 2:50 pm
by mhester
dougcl wrote:Hi folks, looks like everyone is getting on board with the unbounded varchar idea for reducing memory and disk consumption.
Thanks,
Doug
I would not go so far as to say this. That is certainly not IBM's take on character string handling. The framework of DataStage moves fixed width data regardless of whether you setup the environment to trim varchar fields in a dataset. With varchar that are unbounded the framework will move the actual length of each column which is what produces the overhead.

in 8.1 there is a new environment variable which controls this and in 8.5 the behaviour is default. In 8.1 in order to use it you will need to add a copy prior to the write of a dataset and read of a dataset to allow for a modify adapter (so you cannot optimize out the copy).

IBM's guidance has always been to have bound character columns unless that column is generally > ~ 100 bytes.

Job failures can and do happen with unbound character fields in certain load conditions.

The dataset stuff does not affect the manner in which data are moved between operators within a flow.

Posted: Fri Nov 19, 2010 3:54 pm
by dougcl
mhester wrote:in 8.1 there is a new environment variable which controls this and in 8.5 the behaviour is default. In 8.1 in order to use it you will need to add a copy prior to the write of a dataset and read of a dataset to allow for a modify adapter (so you cannot optimize out the copy).
Great post but I have no idea what this means. It sounds good, whatever it is. Basically either we use APT_OLD_BOUNDED_LENGTH or our scratch and dataset storage requirements are astronomical, ridiculous, and prohibitive.
IBM's guidance has always been to have bound character columns unless that column is generally > ~ 100 bytes.
We are unable to make any changes to Table Defs because changes break the shared table status with Metadata Workbench, and re-import using the Import Connector Wizard blows away the changes in the Table Def. So even if I wanted to follow this advice on a column by column basis, I can't because it is impractical.
Job failures can and do happen with unbound character fields in certain load conditions.
All of our jobs fail 100% if datastage uses diskspace unconditionally based upon the varchar lengths in the metadata.
The dataset stuff does not affect the manner in which data are moved between operators within a flow.
Perhaps I am mistaken in assuming the varchar limits in the metadata impact memory consumption. I am basing this on the footprint in the scratch area. It is plausible that I have inferred incorrectly.