datastage performance-- document 2

Archive of postings to DataStageUsers@Oliver.com. This forum intended only as a reference and cannot be posted to.

Moderators: chulett, rschirm

Locked
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

datastage performance-- document 2

Post by admin »

This is a topic for an orphaned message.
admin
Posts: 8720
Joined: Sun Jan 12, 2003 11:26 pm

Post by admin »

Bibhu,
Here is the text from the second document.
HTH,
Mark


Document 2
-------------------------------------------------
DataStage Reference Lookups

The most important factor in maximizing DataStage Reference Lookup performance when using large hashed files is to use the Windows NT Cache Manager by pre-loading entire or major portions of files in cache. However, if the files are small enough to fit in physical memory the pre-load checkbox in the Hashed File Stage can be used to load it in memory thus bypassing the Windows NT cache.

Windows NT Cache Manager

Unless an application specifies the FILE_FLAG_NO_BUFFERING parameter in its call when opening a file, the file system cache is used when the disk is accessed. On reads from the device, the data is first placed into the cache. On writes, the data goes into the cache before going to the disk. If the data is already pre-loaded in cache then the disk I/O request to load the data into cache is eliminated.

The Cache Manager tries to automatically determine, unless otherwise told via explicit flags when a file is opened, which of several methods to use to cache a file.

If a file is accessed sequentially, the file system driver detects this and does predictive asynchronous read-ahead operations to pre-load the data into cache before it is actually required by the application. This is the most efficient method to read a file.

If a file is accessed randomly, as in a majority of the DataStage reference lookups, then no read-ahead operations are done. Performance degrades since we have to go out to disk to read the data, load it into cache and map it into the applications workspace before it can be used.

Knowing how the Cache Manager works, we can make the assumption that if we could pre-load the hashed reference file into the cache, have it stay there, then DataStage reference lookup performance will be maximized since lookup results will be obtained from cache, rather then disk. However, from tests the following became evident:

If a process reads the entire contents of a file sequentially, it in effect loads the file into cache. This assumes that enough free memory is available, otherwise parts of the file are read into cache.

When the process terminates all cache references to the file are removed and the memory allocated by the Cache Manager for the file is returned. However, if another process references the same file, the file is still cached until its references to the file are removed.

Pre-Loading a Hashed File Into Cache

In order to pre-load a file into cache we have to ensure that the same process used by the DataStage Transformer also pre-loads the files into cache. This can be achieved by utilizing the following procedures:

* Use the ExecTCL Before Stage routine and enter the following:

COUNT FileName

FileName should be the name of the hashed file used for reference lookups. Remember to specify the correct case.

If more than one file needs to be referenced, a paragraph must be created at UniVerse TCL level using the editor and the name of the paragraph entered as the ExecTCL Before Stage routine. This can be accomplished by invoking a Telnet session to the DataStage Server. From the > prompt enter the
following:

ED VOC ParagraphName (Substitute a descriptive name for ParagraphName)

The Editor will output status info indicating that this is a New Record. If this is not the case type Q, to exit the editor and select a different name for your paragraph.

Type I to enter input mode
Type PA to specify that this entry is a paragraph
Type COUNT FileName1 (substitute the actual filename for FileName1) Type COUNT FileName2 (substitute the actual filename for FileName2) Enter as many files as are required, then press the Enter key on a blank line to return to command mode, then type FILE to file the paragraph.

* COUNT will read each page in the file sequentially to return a total row count for the file. The side effect will be to pre-load the file into cache.

Since the Cache Manager is accessing the file sequentially, it uses the sequential access algorithm and reads ahead using a separate thread to maximize throughput.

Since the COUNT verb executes as part of the DataStage Transformer process the memory allocated to cache the file is not freed, so when the Before Stage routine completes and the Transformer begins processing most of the reference lookup results will be derived from cache.

* When the DataStage Transformer process completes all memory used to cache disk data for the process is returned to Windows NT unless another process
has a reference to the cached file.

Determining Process Boundaries for DataStage Jobs

In order to take advantage of the previous methods it is essential to know when and how DataStage will break a job into multiple processes for serial or parallel execution. The following rule can be used:

* A passive stage used as a stream input that is linked to an active stage then linked to one or more passive stage runs as an individual process. If multiple active stages are linked, as in the following example, between the passive stages it is still run as an individual process.


In the following example DataStage breaks the job into four processes:


Process 1 consists of PassiveStream, Active1, PassiveStage2, and PassiveStage3 Process 2 consists of PassiveStage3, Active2, and PassiveStage4 Process 3 consists of PassiveStage3, Active3, and PassiveStage5 Process 4 consists of PassiveStage4, Active4, and PassiveStage6

The order of execution will be Process 1 runs till completion, then Processes 2-4 will be executed in parallel. PassiveStage3, or any passive stage for that matter, will serve as a synchronization stage, in that Process 2-4 will wait until Process 1 completes before starting. It also defines process boundaries since PassiveStage3 will be used as a Stream input to Active2-Active4, allowing us to take advantage of multiple processors.

By utilizing passive files to define new process boundaries we can utilize the Windows NT Cache Manager by pre-loading files into cache, knowing that the memory used for cache will be released once the process referencing this memory terminates. This rule applies to files pre-loaded into memory via the pre-load check box in the Hashed File Stage as well, thus allowing a finer level of control over available memory resources.

-----Original Message-----
From: mark.huffman@baesystems.com [mailto:mark.huffman@baesystems.com]
Sent: Friday, August 10, 2001 12:24 PM
To: datastage-users@oliver.com
Subject: RE: datastage performance



Bibhu,
I asked the same question of DS tech support a few months ago, and they sent me these two articles in response. I have attached them as zipped Word 2000 documents.

HTH,
Mark


-----Original Message-----
From: Bibhu C [ mailto:bibhuds@yahoo.com ]
Sent: Friday, August 10, 2001 11:55 AM
To: datastage-users@oliver.com
Subject: datastage performance


Hi All,

Even when I am in the design phase of my ETL jobs, I
was curious to know about the ways to improve
performance of DataStage jobs. I understand that use
of hash files is one of ways, but the manuals are
really tightlipped about them and any other paths that
one would want to tread to improve load performance.

I am looking to the knowledgeable group to tell me
about the various performance enhancing strategies in
DataStage, also tell me if I have missed something in
the manuals.

Thanks
Bibhu


__________________________________________________
Do You Yahoo!?
Make international calls for as low as $.04/minute with Yahoo! Messenger
http://phonecard.yahoo.com/
Locked