Need to create a job that does logical deletes (update)

Su · Post by Su » Fri Nov 21, 2003 10:16 am

DataStage Version = 6 with PX
O/S - AIX 5

Hi,
We have a requirement where we basically need to do a logical delete of records from a target table (say table TGT) where records don't exist in a Source table (Table SRC).

We already have a Server Job that creates a hashfile of the SRC records and uses it as a lookup in a transformer stage and does the logical delete (i.e update table TGT and set a delete_flag = 'Y')

This jobs works fine but for large tables, the hash file creation takes a lot of time.

Wondering if there is an efficient way to do this with PX ?
We are new to PX and we have been experimenting with Change Capture and Change Apply but the jobs don't seem to run.

Thanks,
Surekha

1stpoint · Post by **1stpoint** » Fri Nov 21, 2003 12:22 pm

Hi,
We have a requirement where we basically need to do a logical delete of records from a target table (say table TGT) where records don't exist in a Source table (Table SRC).

We already have a Server Job that creates a hashfile of the SRC records and uses it as a lookup in a transformer stage and does the logical delete (i.e update table TGT and set a delete_flag = 'Y')

This jobs works fine but for large tables, the hash file creation takes a lot of time.

So you basically want to perform an aggregate update:

Update TGT Set delete_flag='Y'
where not exists(select 1 from SRC where....)

Question: Why do you need a hash file?

You can do this by changing the update to user-defined SQL and inserting the not exists condition (assuming that SRC is indexed by the joined columns).

You can also multi thread this job without necessarily using PX although I believe PX will improve performance as well.

Su · Post by Su » Fri Nov 21, 2003 12:49 pm

Thanks but the 'not exists' won't be possible 'coz the tables are not on the same Database

I forgot to mention in my initial post that theTGT and SRC tables are on different Databases/Servers .

Su · Post by Su » Fri Nov 21, 2003 12:56 pm

Also wanted to mention that we do not want to use Dblinks.

I am not sure if PX lets you use dblinks..

1stpoint · Post by **1stpoint** » Fri Nov 21, 2003 1:12 pm

Yes if the tables are on physically separate DB's then a hash file is the solution.

With PX, you lose the capability of using hash files.

Are you re-creating the entire hashfile every time you run?

It seems like your bottleneck is in the hash file rebuild

Why don't you not clear the hash file and simply update it (using UV stage).

kcbland · Post by **kcbland** » Fri Nov 21, 2003 1:16 pm

Just load a work table of the keys from the source into your target instance. That should be obscenely easy and fast. Then, using parallel DML you can execute the DELETE FROM TGT WHERE TGT.KEY NOT IN (SELECT KEY FROM YOURWORKTABLE). Avoid spanning instances with dblinks.

Su · Post by Su » Fri Nov 21, 2003 1:37 pm

1stpoint and Kenneth,

Thanks for your replies.

I will check if I am rebuilding the hashfile every time.

Also, worktables was the first thing I thought of but creating worktables/temp tables is something our Admins frown upon and is only to be done if everthing else fails.

I did create a job in PX with just a change_capture stage going to a target Oracle stage where I do the logical delete (update TGT set delete_flag = 'Y' where key_col = etc) . It seems to work as I set only the 'Drop Output for Delete' to False and everything else to True.

-Surekha

kcbland · Post by **kcbland** » Fri Nov 21, 2003 1:42 pm

If speed is important then a parallel query based delete from a work table of keys is the fastest method for mass deleting. If you have a singlethreaded process handing keys one at a time to the database to delete, you wait while each key is found in the table and deleted.

If the table is partitioned, and you don't supply a column that helps the optimizer identify the partition (partition pruning) you will hit every partition and it will be slower. You're going to have to keep issues like this in mind when deleting data. You may have to put more information into the DELETE statement to assist the database.

1stpoint · Post by **1stpoint** » Fri Nov 21, 2003 1:46 pm

Ken,
He is UPDATING (logical delete), not updating so partitioning would be defeatist here. I do like the worktable idea as well. You can also trim down your hash file to just the key columns.

Teej · Post by **Teej** » Fri Nov 21, 2003 1:57 pm

Su wrote:DataStage Version = 6 with PX
O/S - AIX 5

Hi,
We have a requirement where we basically need to do a logical delete of records from a target table (say table TGT) where records don't exist in a Source table (Table SRC).

We already have a Server Job that creates a hashfile of the SRC records and uses it as a lookup in a transformer stage and does the logical delete (i.e update table TGT and set a delete_flag = 'Y')

This jobs works fine but for large tables, the hash file creation takes a lot of time.

Wondering if there is an efficient way to do this with PX ?
We are new to PX and we have been experimenting with Change Capture and Change Apply but the jobs don't seem to run.

Thanks,
Surekha

Upgrade to 7.0.1 as soon as possible (Mid-December is its current scheduled release). This would help with the performance issues on Lookup stages.

Ideal solution:

Code: Select all

               Source Table
                     |
                     V
Target Table -> lookup -> dead link (copy stage)
                     |
                     V
               (Reject) Delete table

Or you can take it a step further and...

Instead of the dead link and reject link, output table to itself with truncate option enabled. This is ideal if you have a LOT of deletes going on that the above solution would take a while to do.

Naturally, the above solutions are dangerous if there is a failure. You need to ensure that you have a way to recover from a mistake/failure.

Lookup Table should have hash sort, and the input table should be "partitioned" (if Oracle, dunno about other databases). Always a good thing to analyze table afterward (Oracle-wise).

Naturally, there are other ways to handle it, but this is the best 'performance-wise', especially if your database is running on a smaller box than your DataStage server.

-T.J.

kcbland · Post by **kcbland** » Fri Nov 21, 2003 2:16 pm

1stpoint wrote:Ken,
He is UPDATING (logical delete), not updating so partitioning would be defeatist here.

Okay, you got me, I wasn't paying total attention. Instead of delete, it's an update.

Now, can you tell me why a parallel DML query against a partitioned table is NOT the best solution if you are able to supply the partition key and are using locally managed indexes? (Big assumption on my part this is a data warehouse and is following good DBA practices --> See Tom Kytes Expert One-On-One book)

1stpoint · Post by **1stpoint** » Fri Nov 21, 2003 2:25 pm

Now, can you tell me why a parallel DML query against a partitioned table is NOT the best solution if you are able to supply the partition key and are using locally managed indexes? (Big assumption on my part this is a data warehouse and is following good DBA practices --> See Tom Kytes Expert One-On-One book)

That is definitely a good solution assuming the table is partitioned and they supply the partition key. If I believe I heard the original question it wasn't the update operation that was slow, it was the hash build.

Su · Post by Su » Fri Nov 21, 2003 2:27 pm

thanks to all for the wealth of info.

we will be doing physical deletes also and all this will be very useful.

DSXchange

Need to create a job that does logical deletes (update)

Need to create a job that does logical deletes (update)

Understanding the problem

Idea

huh

Re: Need to create a job that does logical deletes (update)

Re: huh