Maximizing resources utilized for sorting

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
timsmith_s
Participant
Posts: 54
Joined: Sun Nov 13, 2005 9:25 pm

Maximizing resources utilized for sorting

Post by timsmith_s »

I have reviewed several threads regarding scratch space issues; however, I was hoping that someone might summarize the best approach for maximizing resources used for sorting (tsort)?

That is, if I am in a situation where I have a very large file to sort, if I use an explicit sort stage and specify the memory, I can only go up to < 1 GB - otherwise I get a nmap error. I have also tried to specify the TSORT environmental variable, but I understand that this is essentially the same thing, but the value is then set globally, rather than at the stage level.

In the end, I am just trying to get a check list of things that I can set to get the job to complete - not necessarily fast, just complete without having to allocate large scratch space filesystems.
felixyong
Participant
Posts: 35
Joined: Tue Jul 22, 2003 7:24 pm
Location: Australia

Re: Maximizing resources utilized for sorting

Post by felixyong »

This will be a global setting within a job if you specify it as part of Job Parameters.
$APT_TSORT_STRESS_BLOCKSIZE = [mb]

When the memory buffer is filled, sort uses temporary disk space in the following order:
Scratch disks in the $APT_CONFIG_FILE "sort" named disk pool
Scratch disks in the $APT_CONFIG_FILE default disk pool
The default directory specified by $TMPDIR
The UNIX /tmp directory

The other parameter that you can play with is BUFFER which also have a few settings before it write to disks the same as SORT listed in the same ordered.
$APT_BUFFER_MAXIMUM_MEMORY
$APT_BUFFER_FREE_RUN
$APT_BUFFER_DISK_WRITE_INCREMENT

We need to see what we're trying to achieve, using all the resources to get the best performance. However, it may not necessary be always truth.

timsmith_s wrote:I have reviewed several threads regarding scratch space issues; however, I was hoping that someone might summarize the best approach for maximizing resources used for sorting (tsort)?

That is, if I am in a situation where I have a very large file to sort, if I use an explicit sort stage and specify the memory, I can only go up to < 1 GB - otherwise I get a nmap error. I have also tried to specify the TSORT environmental variable, but I understand that this is essentially the same thing, but the value is then set globally, rather than at the stage level.

In the end, I am just trying to get a check list of things that I can set to get the job to complete - not necessarily fast, just complete without having to allocate large scratch space filesystems.
      Regards
      Felix
      timsmith_s
      Participant
      Posts: 54
      Joined: Sun Nov 13, 2005 9:25 pm

      Post by timsmith_s »

      Great feedback - thank you.

      I understand about the $APT_TSORT_STRESS_BLOCKSIZE. Or rather I understand its the memory setting, but is this the memory setting per node? For instance, say I have 4GB of RAM per NODE, It doesnt appear that DSEE is burning up the RAM before it starts hitting the scratch partitions - during a sort operation. Maybe this is a two part question that I would defer to another thread.
      Prakashs
      Participant
      Posts: 26
      Joined: Mon Jun 06, 2005 5:43 am
      Location: Melbourne, Australia

      Post by Prakashs »

      Adjusting DataStage heap space may allow you to sort larger files.
      timsmith_s
      Participant
      Posts: 54
      Joined: Sun Nov 13, 2005 9:25 pm

      Post by timsmith_s »

      How is the heap space modified? You mean the process heap space, say at the UNIX OS level?
      Post Reply