Multiple targets and restart

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
chaitanyavm
Participant
Posts: 8
Joined: Fri Apr 09, 2004 7:06 pm

Multiple targets and restart

Post by chaitanyavm »

Hi

This is my first post.

We have interfaces which have multiple targets like MQ, Database (Oracle,DB2 etc). Our interfaces (datastage load jobs) read multiple files one per each target and write to the MQ, Database or file. There are several instances when there should not be duplicate processing for any of the targets. So incase we need to re-run the interface, what is the norm of keeping track of the records that were processed in the previous run. (for both parallel and server jobs)

Also how do we set commit frequency on databases. MQ stage has commits as soon as the data is written.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard! :D
MQ stage has commits as soon as the data is written.
This is expected behaviour for a message queue or named pipe.

If you need to control when and where everything happens, consider a staged approach where you prepare all data through the ETL process, but stage it to disk rather than to final target. Only then do you transmit the results to the final destination. Read Chapter 16 of The Data Warehouse Lifecycle Toolkit by Ralph Kimball et al for more information on staging data.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Regarding your question "commit frequency on databases" you will see that database stages have fields for controlling the transaction sizes. Just go into your database stage and open the help screen, they have a short description of each of these fields. If you set the transaction size to 1 you will get a commit on every row but your performance will decrease.
GIDs
Participant
Posts: 16
Joined: Sun May 23, 2004 2:39 pm
Location: San Francisco

Re: Multiple targets and restart

Post by GIDs »

chaitanyavm wrote:Hi

This is my first post.

We have interfaces which have multiple targets like MQ, Database (Oracle,DB2 etc). Our interfaces (datastage load jobs) read multiple files one per each target and write to the MQ, Database or file. There are several instances when there should not be duplicate processing for any of the targets. So incase we need to re-run the interface, what is the norm of keeping track of the records that were processed in the previous run. (for both parallel and server jobs)

Also how do we set commit frequency on databases. MQ stage has commits as soon as the data is written.
For commit, on oracle (DB2 also has a similar variable), you can use environment variables
APT_ORA_UPSERT_ROW_INTERVAL
APT_ORA_UPSERT_TIME_INTERVAL
The Orchestrate commits at min(the above two variables)

There are a couple of ways by which you can do the re-start
1. Best way is to reduce the granularity of your jobs (as suggested by Ray)
2. For server jobs, keep track of the last commit via hash files or some such mechanism
3. For parallel jobs, its quite difficult, because of the parallel / pipelined data streams, if ur volumes are low, for every row, record the key (this means u have to sort on the key for quicker re-start), or record a unique ID (whichever goes into the DB), this can be done within the Oracle stage (write the unique ID into some sort of Audit table) or land the data and use manual sqlldr (in the case of oracle) to load data, which logs the number of records that made into the DB
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Do the APT_... environment variables work with server jobs too?!! :o

Because chaitanyavm is a newcomer, let me explain that "Orchestrate" was the name of the parallel engine before Ascential acquired it (and Torrent Systems) and renamed it "parallel extender". I believe, therefore, that the APT_... environment variables apply only to parallel jobs; chaitanyavm specified server jobs.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
GIDs
Participant
Posts: 16
Joined: Sun May 23, 2004 2:39 pm
Location: San Francisco

Post by GIDs »

You are right, Ray, I should have specified that the APTs work on parallel jobs (but, chaitanyavm, I thought was referring to both types of jobs) and also, vmcburney had already given a way of dealing with server jobs commit frequency (transaction size), so summarizing......

For server jobs,
Use Transaction size

For Parallel jobs,
Use APT variables.
ogmios
Participant
Posts: 659
Joined: Tue Mar 11, 2003 3:40 pm

Post by ogmios »

My 2 cents on restarting: e.g. include a unique number in your source and your target. You probably have control over your target: put an extra number in the table(s).

Take the highest number (say A) in the target and write it somewhere to your staging area, load your data in the staging area including a number that starts incrementing from A+1.
If the load in your staging area fails run the complete cycle again, taking the highest number in the target again. If it succeeds successfully you've landed your data in a staging area.

Take the highest number (say A) in the target and write it somewhere to your staging area. Now move your data from the staging area to your target but only the rows higher than A.
If the job fails halfway rerun this last part and you will always process each row once: if the job was halfway when it aborted the A in the target will have bumped up.

Several other variations are of course possible.

Ogmios
chaitanyavm
Participant
Posts: 8
Joined: Fri Apr 09, 2004 7:06 pm

Post by chaitanyavm »

Thanks for all the responses. We are having granular jobs. We have an extract , transform and a load job. I was only refering to the load jobs and restarting them incase of aborts. We were also trying to use the @outrownum to keep track of which rows were sent to target. This is maintained in a hashfile. Should we use @inrownum or @outrownum for this purpose for tracking the rows sent to target?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'd suggest using @INROWNUM. Assuming your 'load' jobs work from landed flat files, this gives you a count of records to skip when restarting the job, regardless of how many rows this turns out to be in the target table (due to constraints, etc).

In other words, don't track 'rows sent to target', track staging rows processed into the job.

My .02 anywho. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply