Page 1 of 1

Insert/Update Job Design: Comments?

Posted: Wed Jun 08, 2005 10:02 am
by lshort
I am experimenting with the Compare Stage (trying to determine its usefullness). I am exploring the following scenario.
Requirement:
A. Insert of all New records to DataWarehouse
B. Update DW when changes are made to specific columns ONLY.
Any thoughts or comments? :)

Step 1. Obtain seq file containing TODAY transactions.

Step 2.

Job1: DBMS lookup to determine INSERTS (key not found)

Source Link= Today.seq
Ref Link = DW
<<LookupStage>>
Output Link1 [NOT.FOUND]= Inserts.ds
Output Link2 [FOUND]= CouldBeUpdate.ds
Output Link3 [FOUND]= FoundInDw.ds

Job2: Compare-Determine which rows have a change to the required columns

Source Link1= CouldBeUpdate.ds
Source Link2= FoundInDw.ds
<<CompareStage>>
Output Link =CompareResults.ds

Job3: Seperate updates from no change rows

Source Link1= CompareResults.ds
<<SwitchStage>>
Output Link1 [Result Code<>0] = Updates.ds
Output Link2 [rej] = NoChange.ds

Step 3.
Do Inserts and Updates to the DW
Inserts.ds
Updates.ds

Posted: Wed Jun 08, 2005 1:46 pm
by richdhan
Hi lshort,

Look into the CDC stage. The documentation in parjdev.pdf gives additional information on CDC stage. It is used to identify new inserts, updates as well as deletes.

You can finish it in one job. The CDC stage requires 2 inputs. Use the flat file as one input and the DBMS data as the other input. Define the keys and values in CDC stage. The CDC stage provides an additional column known as change_code which is used to identify an insert, update, delete or an exact copy.

HTH
Rich

Posted: Wed Jun 08, 2005 1:59 pm
by lshort
I'll look into it. Thanks for the info. 8)

Posted: Wed Jun 08, 2005 4:40 pm
by gh_amitava
Hi,

You can use the "Upsert" feature of DataStage.

Regards
Amitava

Posted: Thu Jun 09, 2005 7:20 am
by lshort
True. I could use upsert. However, if I should need to be able to track the number of inserts versus updates upsert will not allow for it. 8)

Posted: Fri Jun 10, 2005 2:57 pm
by bcarlson
richdhan wrote:Hi lshort,

Look into the CDC stage. The documentation in parjdev.pdf gives additional information on CDC stage. It is used to identify new inserts, updates as well as deletes.

Rich
Pardon a potentially dumb question. What is the CDC Stage? I couldn't find references to it in the parjdev manual. Is this the change capture stage?

We are looking into a similar situation (insert new/update only if certain fields change) and I found this posting in a search.

Posted: Tue Jun 14, 2005 11:52 am
by richdhan
Hi Carlson

Yes, I was referring to the Change Capture stage.

The Change Capture stage will work perfectly for your situation. You have to define the fields in the change values section of the CDC stage and change_code additional column will detect the update.

HTH
Rich

Posted: Fri Aug 26, 2005 1:24 pm
by varshanswamy
With respect to the performance does the change capture stage provide better performance when compared to the lookup stage.

Posted: Fri Aug 26, 2005 4:44 pm
by ray.wurlod
It provides different functionality, so you'd be trying to compare apples with oranges even if you did have a definition of "performance".

Posted: Fri Aug 26, 2005 4:48 pm
by vmcburney
I always felt you were better off doing surrogate key lookup first and then doing change data capture on the rows that already exist and sending the new rows directly down an insert path. A surrogate key lookup only retrieves a couple columns from the database while CDC returns every column. So lookup followed by CDC should be faster then CDC followed by lookup.

You can't get away with a CDC alone unless you have built dimensions without surrogate keys. A CDC stage can bring in rows from the dimension for comparison but it will not let you take or use a dimension field. You can only output rows from the "After" link, not the "Before" link.

We had shocking performance of upserts on DB2 and had to make sure we used seperate insert and update paths. The initial load of a table is the worst performer, DB2 has to perform full table scans for every row as it cannot use an index until stats have been run on the table.

a different type of question on this

Posted: Mon Aug 29, 2005 8:29 pm
by balakot
i was working on some thing like the same job that you guys were discussing and in the end we want to write to the data base only there records that have a same data type and the ones that dont have the data types should be dropped out and these records should be written out to the reject file.
Is there any function out there like get stage info or some thin that can give us some details about the record be dropped since there is a different data type out there?

Thanks
-babloo