Join Key column data type decision

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
SBSA_DW2
Premium Member
Premium Member
Posts: 37
Joined: Fri Jul 25, 2008 1:24 am

Join Key column data type decision

Post by SBSA_DW2 »

Hi Guys

I'm trying to decide if we are overusing varchar data types when doing joins/lookups in Datastage. In 99% of situations we cast data to same varchar types and lengths before doing a join, merge, lookup etc.

It may be we're not experienced in using decimals/integers etc, but my real question is 'what is the performance impact of converting and joining using varchar instead of a number datatype (decimal/integer etc)'?

in database joins joining by using varchar as keys is expensive, does the same apply in datastage?

are other developers experiencing similar 'over-use' of varchar?

thanks!
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

One of the basic tenets for performance is never to perform unnecessary conversions. If your source and target type is numeric for a given column then performing 2 conversion to VarChar on the way is inefficient. Likewise, a comparison of two numbers is far more efficient than comparing two strings. Numbers are far more compact than strings and by using them you save space; particularly as data streams between stages are not always just in memory but sometimes also get buffered to disk. Two bytes can store numbers from -32767 to +32768, but storing "-32767" in a string takes 7 or 8 bytes (depending upon what type of string you have). A BigInt representation of -2147483647 with 4 bytes is compact compared to 13 bytes for string "-2147483647" - with one million rows that is a difference in size of 9Mb.
Post Reply