join & lookup

dsa · Post by **dsa** » Fri Oct 15, 2010 4:35 am

Hi,

Lookup using scratch memory while join uses disk(physical) memory for the sorting it performs.

is it a right statement to make?

ArndW · Post by **ArndW** » Fri Oct 15, 2010 4:56 am

No, that statement does not reflect what happens. Both methods will use memory, but the lookup keeps the reference data in memory while the join stage sorts the streams (on the join key(s)) then needs only minimal memory at runtime.

dsa · Post by **dsa** » Fri Oct 15, 2010 5:00 am

Sorry
what I meant was look up keeps reference data into scratch

is it right now?

ArndW · Post by **ArndW** » Fri Oct 15, 2010 5:06 am

Lookup keeps reference data in memory, not on disk.

dsa · Post by **dsa** » Fri Oct 15, 2010 5:08 am

What my understanding is :
Scratch is temporary memory and when we say resource disk it means permanent memory or disk .

Please correct me if I am wrong.

ArndW · Post by **ArndW** » Fri Oct 15, 2010 5:57 am

"Scratch" is temporary disk space, which is different from "temporary memory" but otherwise the definition is not wrong.

dsa · Post by **dsa** » Fri Oct 15, 2010 6:20 am

oh

so join uses permanent memory which is also not resource disk?

ArndW · Post by **ArndW** » Fri Oct 15, 2010 6:54 am

No, I never said that. Join stages work by sorting the input links (which may or may not require scratch storage or buffer storage) and then doing an efficient comparison of records from the links. Because the data is sorted, it is not necessary to use much memory, unlike the lookup stage which requires that the complete reference data is in memory.

dsa · Post by **dsa** » Fri Oct 15, 2010 11:09 am

Thanks for clearing my doubts !!!

ray.wurlod · Post by **ray.wurlod** » Fri Oct 15, 2010 3:45 pm

The reference data set for a Lookup stage must be able to reside in physical memory (other than for a sparse lookup).

Any other stage that uses memory, such as Sort, Aggregator, Join stage types, will use the amount of memory allocated. Only if they need more memory than that will they spill to scratchdisk.

Disk pools may get involved. For example the Sort stage will first spill to scratchdisk resources identified as being in the "sort" disk pool. If these fill, or if the disk pool does not exist, it will use the default disk pool (""). If this fills it will use the directory identified by the TMPDIR environment variable. If this fills it will use /tmp. If this fills you're dead.