VarChar Memory Usage

dougcl · Post by **dougcl** » Mon Jun 28, 2010 3:51 pm

Hi folks, I've seen elsewhere that

Varchar(N) are always exactly N+4 bytes when in a dataset.
varchar/char(N) with NLS will always be 2N (+4 for varchar).

But I am wondering what the memory and scratch footprint is. Is it the same as above, or is it variable?

At the time of this post, a search on "varchar memory" turns up one hit. This post.

Thanks,
Doug

ray.wurlod · Post by **ray.wurlod** » Mon Jun 28, 2010 7:07 pm

I would be interested to know what that "elsewhere" is. It's not totally correct. In particular, the numeric prefix is not necessarily four bytes; it may be as small as one or two bytes.

Essentially, the "prefix" size in the record schema gives you this value. Storage in memory aims to use only the required amount. For example a VarChar(200) that contains only 16 characters would typically occupy 19 bytes without NLS and 35 bytes with NLS. The extra byte is the C-style string terminator (0x00). In practice the memory demanded will be in "whole word" units - multiples of 4 bytes on a 32-bit operating system or multiples of 8 bytes on a 64-bit operating system. So my 16 characters strings would actually occupy 20 bytes without NLS and 36 bytes with NLS on a 32-bit operating system, or 20 bytes without NLS and 40 bytes with NLS on a 64-bit operating system. It's just far easier (faster) to move things around in memory on whole-word boundaries.

dougcl · Post by **dougcl** » Mon Jun 28, 2010 7:30 pm

ray.wurlod wrote:I would be interested to know what that "elsewhere" is.

Forum on IBM. No big deal.

ArndW · Post by **ArndW** » Tue Jun 29, 2010 2:03 am

There is an anomaly with Data Sets where a Varchar(nnn) column will use nnn storage space per row regardless of string contents whereas an unterminated string will use allocate just the number of bytes in the string.

ray.wurlod · Post by **ray.wurlod** » Tue Jun 29, 2010 2:30 am

Not an anomaly - a deliberate design decision. Storage in Data Sets is fixed width (the same model as COBOL) so that addition and retrieval is in constant-sized chunks, which is very very efficient (the ZMOV instruction or equivalent) rather than needing to calculate offsets.

ArndW · Post by **ArndW** » Tue Jun 29, 2010 2:33 am

I realize that this was a deliberate design decision, but it is counterintuitive that a VarChar() column in DS is internally represented as a Char() field for DataSets and thus I maintain that is an anomaly. I gather that this behaviour was different at releases prior to 7.

ray.wurlod · Post by **ray.wurlod** » Tue Jun 29, 2010 2:34 am

ArndW wrote:I gather that this behaviour was different at releases prior to 7.

Yep. Heaps slower, for one thing.

dougcl · Post by **dougcl** » Tue Jun 29, 2010 2:05 pm

Hi folks, the answer to this question seems to be that DataStage will require the full upper bound for each varchar if the upper bound (the varchar size) is specified in the column definitions.

If you remove the varchar sizes from the metadata, then DataStage will only consume what the actual data requires.

Thanks,
Doug

chulett · Post by **chulett** » Tue Jun 29, 2010 3:06 pm

That behaviour is specific to datasets, AFAIK.

dougcl · Post by **dougcl** » Tue Jun 29, 2010 5:55 pm

chulett wrote:That behaviour is specific to datasets, AFAIK. ...

Based on testing and using df -h in the scratch area while the job runs, it appears that this applies to the scratch area too. I think it follows then that this also applies to memory.

Thanks,
Doug

chulett · Post by **chulett** » Wed Jun 30, 2010 6:08 am

Fair enough. Actually in thinking back about this and Ray's comment regarding the change, I think the change in behaviour is specific to datasets to make them act like their cousins, Scratch and Memory, seeing as how in earlier releases it all did not match up.

dougcl · Post by **dougcl** » Mon Oct 25, 2010 12:46 pm

Hi folks, looks like everyone is getting on board with the unbounded varchar idea for reducing memory and disk consumption.

Meanwhile I am trying to get away from post-import table def edits. Can someone remind me what environment variable addresses the way varchars are handled? I know there is one, but I can't remember its name. I've been searching all morning.

Thanks,
Doug

dougcl · Post by **dougcl** » Mon Oct 25, 2010 12:48 pm

Ah! Found it cached in my browser:

APT_OLD_BOUNDED_LENGTH

mhester · Post by **mhester** » Tue Oct 26, 2010 2:50 pm

dougcl wrote:Hi folks, looks like everyone is getting on board with the unbounded varchar idea for reducing memory and disk consumption.
Thanks,
Doug

I would not go so far as to say this. That is certainly not IBM's take on character string handling. The framework of DataStage moves fixed width data regardless of whether you setup the environment to trim varchar fields in a dataset. With varchar that are unbounded the framework will move the actual length of each column which is what produces the overhead.

in 8.1 there is a new environment variable which controls this and in 8.5 the behaviour is default. In 8.1 in order to use it you will need to add a copy prior to the write of a dataset and read of a dataset to allow for a modify adapter (so you cannot optimize out the copy).

IBM's guidance has always been to have bound character columns unless that column is generally > ~ 100 bytes.

Job failures can and do happen with unbound character fields in certain load conditions.

The dataset stuff does not affect the manner in which data are moved between operators within a flow.

dougcl · Post by **dougcl** » Fri Nov 19, 2010 3:54 pm

mhester wrote:in 8.1 there is a new environment variable which controls this and in 8.5 the behaviour is default. In 8.1 in order to use it you will need to add a copy prior to the write of a dataset and read of a dataset to allow for a modify adapter (so you cannot optimize out the copy).

Great post but I have no idea what this means. It sounds good, whatever it is. Basically either we use APT_OLD_BOUNDED_LENGTH or our scratch and dataset storage requirements are astronomical, ridiculous, and prohibitive.

IBM's guidance has always been to have bound character columns unless that column is generally > ~ 100 bytes.

We are unable to make any changes to Table Defs because changes break the shared table status with Metadata Workbench, and re-import using the Import Connector Wizard blows away the changes in the Table Def. So even if I wanted to follow this advice on a column by column basis, I can't because it is impractical.

Job failures can and do happen with unbound character fields in certain load conditions.

All of our jobs fail 100% if datastage uses diskspace unconditionally based upon the varchar lengths in the metadata.

The dataset stuff does not affect the manner in which data are moved between operators within a flow.

Perhaps I am mistaken in assuming the varchar limits in the metadata impact memory consumption. I am basing this on the footprint in the scratch area. It is plausible that I have inferred incorrectly.