Multiple targets and restart

chaitanyavm · Post by **chaitanyavm** » Mon May 24, 2004 11:06 am

Hi

This is my first post.

We have interfaces which have multiple targets like MQ, Database (Oracle,DB2 etc). Our interfaces (datastage load jobs) read multiple files one per each target and write to the MQ, Database or file. There are several instances when there should not be duplicate processing for any of the targets. So incase we need to re-run the interface, what is the norm of keeping track of the records that were processed in the previous run. (for both parallel and server jobs)

Also how do we set commit frequency on databases. MQ stage has commits as soon as the data is written.

ray.wurlod · Post by **ray.wurlod** » Mon May 24, 2004 4:10 pm

Welcome aboard! :D

MQ stage has commits as soon as the data is written.

This is expected behaviour for a message queue or named pipe.

If you need to control when and where everything happens, consider a staged approach where you prepare all data through the ETL process, but stage it to disk rather than to final target. Only then do you transmit the results to the final destination. Read Chapter 16 of The Data Warehouse Lifecycle Toolkit by Ralph Kimball et al for more information on staging data.

vmcburney · Post by **vmcburney** » Mon May 24, 2004 9:22 pm

Regarding your question "commit frequency on databases" you will see that database stages have fields for controlling the transaction sizes. Just go into your database stage and open the help screen, they have a short description of each of these fields. If you set the transaction size to 1 you will get a commit on every row but your performance will decrease.

GIDs · Post by **GIDs** » Mon May 24, 2004 11:12 pm

chaitanyavm wrote:Hi

This is my first post.

We have interfaces which have multiple targets like MQ, Database (Oracle,DB2 etc). Our interfaces (datastage load jobs) read multiple files one per each target and write to the MQ, Database or file. There are several instances when there should not be duplicate processing for any of the targets. So incase we need to re-run the interface, what is the norm of keeping track of the records that were processed in the previous run. (for both parallel and server jobs)

Also how do we set commit frequency on databases. MQ stage has commits as soon as the data is written.

For commit, on oracle (DB2 also has a similar variable), you can use environment variables
APT_ORA_UPSERT_ROW_INTERVAL
APT_ORA_UPSERT_TIME_INTERVAL
The Orchestrate commits at min(the above two variables)

There are a couple of ways by which you can do the re-start
1. Best way is to reduce the granularity of your jobs (as suggested by Ray)
2. For server jobs, keep track of the last commit via hash files or some such mechanism
3. For parallel jobs, its quite difficult, because of the parallel / pipelined data streams, if ur volumes are low, for every row, record the key (this means u have to sort on the key for quicker re-start), or record a unique ID (whichever goes into the DB), this can be done within the Oracle stage (write the unique ID into some sort of Audit table) or land the data and use manual sqlldr (in the case of oracle) to load data, which logs the number of records that made into the DB

ray.wurlod · Post by **ray.wurlod** » Tue May 25, 2004 12:15 am

Do the APT_... environment variables work with server jobs too?!!

Because chaitanyavm is a newcomer, let me explain that "Orchestrate" was the name of the parallel engine before Ascential acquired it (and Torrent Systems) and renamed it "parallel extender". I believe, therefore, that the APT_... environment variables apply only to parallel jobs; chaitanyavm specified server jobs.

GIDs · Post by **GIDs** » Tue May 25, 2004 12:33 am

You are right, Ray, I should have specified that the APTs work on parallel jobs (but, chaitanyavm, I thought was referring to both types of jobs) and also, vmcburney had already given a way of dealing with server jobs commit frequency (transaction size), so summarizing......

For server jobs,
Use Transaction size

For Parallel jobs,
Use APT variables.

ogmios · Post by **ogmios** » Tue May 25, 2004 4:31 am

My 2 cents on restarting: e.g. include a unique number in your source and your target. You probably have control over your target: put an extra number in the table(s).

Take the highest number (say A) in the target and write it somewhere to your staging area, load your data in the staging area including a number that starts incrementing from A+1.
If the load in your staging area fails run the complete cycle again, taking the highest number in the target again. If it succeeds successfully you've landed your data in a staging area.

Take the highest number (say A) in the target and write it somewhere to your staging area. Now move your data from the staging area to your target but only the rows higher than A.
If the job fails halfway rerun this last part and you will always process each row once: if the job was halfway when it aborted the A in the target will have bumped up.

Several other variations are of course possible.

Ogmios

chaitanyavm · Post by **chaitanyavm** » Tue May 25, 2004 6:04 pm

Thanks for all the responses. We are having granular jobs. We have an extract , transform and a load job. I was only refering to the load jobs and restarting them incase of aborts. We were also trying to use the @outrownum to keep track of which rows were sent to target. This is maintained in a hashfile. Should we use @inrownum or @outrownum for this purpose for tracking the rows sent to target?

chulett · Post by **chulett** » Tue May 25, 2004 6:39 pm

I'd suggest using @INROWNUM. Assuming your 'load' jobs work from landed flat files, this gives you a count of records to skip when restarting the job, regardless of how many rows this turns out to be in the target table (due to constraints, etc).

In other words, don't track 'rows sent to target', track staging rows processed into the job.

My .02 anywho.

DSXchange

Multiple targets and restart

Multiple targets and restart

Re: Multiple targets and restart