VarChar Memory Usage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

VarChar Memory Usage

Post by dougcl »

Hi folks, I've seen elsewhere that
Varchar(N) are always exactly N+4 bytes when in a dataset.
varchar/char(N) with NLS will always be 2N (+4 for varchar).
But I am wondering what the memory and scratch footprint is. Is it the same as above, or is it variable?

At the time of this post, a search on "varchar memory" turns up one hit. This post.

Thanks,
Doug
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I would be interested to know what that "elsewhere" is. It's not totally correct. In particular, the numeric prefix is not necessarily four bytes; it may be as small as one or two bytes.

Essentially, the "prefix" size in the record schema gives you this value. Storage in memory aims to use only the required amount. For example a VarChar(200) that contains only 16 characters would typically occupy 19 bytes without NLS and 35 bytes with NLS. The extra byte is the C-style string terminator (0x00). In practice the memory demanded will be in "whole word" units - multiples of 4 bytes on a 32-bit operating system or multiples of 8 bytes on a 64-bit operating system. So my 16 characters strings would actually occupy 20 bytes without NLS and 36 bytes with NLS on a 32-bit operating system, or 20 bytes without NLS and 40 bytes with NLS on a 64-bit operating system. It's just far easier (faster) to move things around in memory on whole-word boundaries.
Last edited by ray.wurlod on Mon Jun 28, 2010 9:07 pm, edited 1 time in total.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

Post by dougcl »

ray.wurlod wrote:I would be interested to know what that "elsewhere" is.
Forum on IBM. No big deal.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

There is an anomaly with Data Sets where a Varchar(nnn) column will use nnn storage space per row regardless of string contents whereas an unterminated string will use allocate just the number of bytes in the string.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Not an anomaly - a deliberate design decision. Storage in Data Sets is fixed width (the same model as COBOL) so that addition and retrieval is in constant-sized chunks, which is very very efficient (the ZMOV instruction or equivalent) rather than needing to calculate offsets.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I realize that this was a deliberate design decision, but it is counterintuitive that a VarChar() column in DS is internally represented as a Char() field for DataSets and thus I maintain that is an anomaly. I gather that this behaviour was different at releases prior to 7.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

ArndW wrote:I gather that this behaviour was different at releases prior to 7.
Yep. Heaps slower, for one thing.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

Post by dougcl »

Hi folks, the answer to this question seems to be that DataStage will require the full upper bound for each varchar if the upper bound (the varchar size) is specified in the column definitions.

If you remove the varchar sizes from the metadata, then DataStage will only consume what the actual data requires.

Thanks,
Doug
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That behaviour is specific to datasets, AFAIK.
-craig

"You can never have too many knives" -- Logan Nine Fingers
dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

Post by dougcl »

chulett wrote:That behaviour is specific to datasets, AFAIK. ...

Based on testing and using df -h in the scratch area while the job runs, it appears that this applies to the scratch area too. I think it follows then that this also applies to memory.

Thanks,
Doug
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Fair enough. Actually in thinking back about this and Ray's comment regarding the change, I think the change in behaviour is specific to datasets to make them act like their cousins, Scratch and Memory, seeing as how in earlier releases it all did not match up.
-craig

"You can never have too many knives" -- Logan Nine Fingers
dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

Post by dougcl »

Hi folks, looks like everyone is getting on board with the unbounded varchar idea for reducing memory and disk consumption.

Meanwhile I am trying to get away from post-import table def edits. Can someone remind me what environment variable addresses the way varchars are handled? I know there is one, but I can't remember its name. I've been searching all morning.

Thanks,
Doug
dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

Post by dougcl »

Ah! Found it cached in my browser:

APT_OLD_BOUNDED_LENGTH
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

dougcl wrote:Hi folks, looks like everyone is getting on board with the unbounded varchar idea for reducing memory and disk consumption.
Thanks,
Doug
I would not go so far as to say this. That is certainly not IBM's take on character string handling. The framework of DataStage moves fixed width data regardless of whether you setup the environment to trim varchar fields in a dataset. With varchar that are unbounded the framework will move the actual length of each column which is what produces the overhead.

in 8.1 there is a new environment variable which controls this and in 8.5 the behaviour is default. In 8.1 in order to use it you will need to add a copy prior to the write of a dataset and read of a dataset to allow for a modify adapter (so you cannot optimize out the copy).

IBM's guidance has always been to have bound character columns unless that column is generally > ~ 100 bytes.

Job failures can and do happen with unbound character fields in certain load conditions.

The dataset stuff does not affect the manner in which data are moved between operators within a flow.
dougcl
Premium Member
Premium Member
Posts: 137
Joined: Thu Jun 24, 2010 4:28 pm

Post by dougcl »

mhester wrote:in 8.1 there is a new environment variable which controls this and in 8.5 the behaviour is default. In 8.1 in order to use it you will need to add a copy prior to the write of a dataset and read of a dataset to allow for a modify adapter (so you cannot optimize out the copy).
Great post but I have no idea what this means. It sounds good, whatever it is. Basically either we use APT_OLD_BOUNDED_LENGTH or our scratch and dataset storage requirements are astronomical, ridiculous, and prohibitive.
IBM's guidance has always been to have bound character columns unless that column is generally > ~ 100 bytes.
We are unable to make any changes to Table Defs because changes break the shared table status with Metadata Workbench, and re-import using the Import Connector Wizard blows away the changes in the Table Def. So even if I wanted to follow this advice on a column by column basis, I can't because it is impractical.
Job failures can and do happen with unbound character fields in certain load conditions.
All of our jobs fail 100% if datastage uses diskspace unconditionally based upon the varchar lengths in the metadata.
The dataset stuff does not affect the manner in which data are moved between operators within a flow.
Perhaps I am mistaken in assuming the varchar limits in the metadata impact memory consumption. I am basing this on the footprint in the scratch area. It is plausible that I have inferred incorrectly.
Post Reply