Desiging Jobs

bala_135 · Post by **bala_135** » Mon Jun 18, 2007 5:57 am

Hello All,

A clarification in designing jobs.

1.I am having a job which does the Extraction from a source table and does some transformations(separates the new records and updates) and does an insert(seprate link)and update(separate link) into the same table .

My doubt is can i do insert and update in the same job or separate the insert job and update jobs.Which is ideal approach?.

I am following this approach
Extract the data from table and dump it onto a dataset.
Read the dataset and do Transformations and load it onto another dataset.
Load the new inserts seprately.
Load the updates separtely.

Problem:-With this approach I am increasing the number of jobs.
What happens if for high data volumes or low data volumes.

Any inputs would be most appreciated.

Regards,
Bala.

DSguru2B · Post by **DSguru2B** » Mon Jun 18, 2007 7:16 am

More number of jobs (modularity) will give you ease of debugging, along with restartability. A single huge job will make debugging a nightmare. Forget about restartability. Weigh your options.
You can probably create a single job to extract, transform and load to staging datasets. Two more jobs, one for insert and the other for updates.

gateleys · Post by **gateleys** » Mon Jun 18, 2007 7:47 am

DSguru2B wrote: You can probably create a single job to extract, transform and load to staging datasets

I would split these jobs as well. This way, I can have all my extraction jobs to use the little window time that I may have to source the rows, and then free the source databases.

DSguru2B · Post by **DSguru2B** » Mon Jun 18, 2007 8:19 am

Another good point by gateleys. Modularization has lots of benefits as opposed to its couterpart design.

bala_135 · Post by **bala_135** » Mon Jun 18, 2007 9:58 pm

Hi All,

Thanks for the inputs.So i guess I am going on with the right approach

Extract the data from table and dump it onto a dataset.
Read the dataset and do Transformations and load it onto another dataset.
Load the new inserts seprately.
Load the updates separtely.

Another doubt.How can i decide on the size of the project directory.If i am creating many datasets as intermediate target say roughly each dataset is of size(50MB) is there any propotional formula or depending on the number and size i can determine the project directory size apart from the space for installables.

Regards,
Bala.

ray.wurlod · Post by **ray.wurlod** » Mon Jun 18, 2007 10:48 pm

In a word, "MORE".

Data Sets, particularly with unbounded strings, consume rather more disk space than you would expect. There is an 80-bit-per-record storage overhead at the record level that also needs to be considered. The Parallel Job Developer's Guide (page 2-32) helps you to calculate the storage requirements for each data type.

In addition to space on your resource disk, where Data Set data files reside, you also need to configure lots of space on scratch disk. How much is really a function of what kind of processing you are doing and how much physical memory can be allocated to those processes - any extra spills to scratch disk.

bala_135 · Post by **bala_135** » Tue Jun 19, 2007 3:55 am

Hi,

Thanks for the response.
Loading the data directly to the database from a dataset vs loading onto the copy stage and then to database any performance issues.
My business requirement has future enhancements so i am keeping a copy stage.Kindly throw your inputs on this and also on my designing approach.

Regards,
Bala.

ray.wurlod · Post by **ray.wurlod** » Tue Jun 19, 2007 5:35 am

A Copy stage that does nothing will be optimized out. You won't see it in the score.