Deleting orphaned datasets

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Inquisitive
Charter Member
Charter Member
Posts: 88
Joined: Tue Jan 13, 2004 3:07 pm

Deleting orphaned datasets

Post by Inquisitive »

Hi
Is there any command for identifying and deleting orphaned datasets ( datasets whose descriptor files have been deleted) ?

Thanks.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Inquisitive,

no, there is not. I just wrote some code to do this for a machine and it ended up being a bit complex - enough work invested into it so that I am loathe to post it in the public domain.

I used the UNIX file command and created my own magic file to identify dataset and lookup file set headers (very trivial); then used a system-wide find to get all the descriptors on the system. Then I read the $APT_CONFIG_FILE value for the temp directories and gathered all of the actual data file names (I should have used the same magic number file mechanism). Using calls to orchadmin to get the detail information from each descriptor, I retrieved the list of non-orphaned data files and what remained in the file list list were orphans.

The actual coding is straightforward, but add in all the error handling and beeps-bells-and-whistles and it became a bigger program and routine set.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

We have a similar process. We opted not to do a system wide search because we force all of our *.ds files to be written to a very select set of directories, and all of the physical data files (referenced by the dataset) are only written in one location. Made the search much quicker (the find command can be very slow on a large system, as I am sure you have noticed).

The other complication we have run into is when you are on an MPP system. On our production system, we have an application server tied to multiple database servers (development is currently only one server - much easier to handle orphan searches). So, the DataStage program runs from the app server, but it spans (parallel processing, you know) all of the servers ... and so do the data files. So even though the dataset itself is on server A, the data it references will be spread across several servers.

We have not implemented the orphan search on production. Thankfully we have a lot of space available so orphans have been less of an issue. However, when we do, we plan to run concurrent searches for the data files on each server, send the mass list of one server to be compared to the list of expected datafiles culled from the datasets themselves. Then an orphan list would be sent to each server to have them deleted.

So, like ArndW mentioned before - easy concept, but pain in the rear to implement.
Post Reply