Finding the Updates

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vzmz
Participant
Posts: 36
Joined: Sun Nov 23, 2003 12:10 pm
Location: Dallas

Finding the Updates

Post by vzmz »

I have a tables which have to be updated using another table (which have all the updated records)
How can i achieve it?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That's going to depend on a number of factors, none of which you bothered to tell us about. Let's call TableA the table that has to be updated, and TableB the table that contains the updates. Does TableB contain only records to cause updates to TableA, or does TableB contain these and other records besides?
(1) If the record structures are identical, and if the rows will be new rows in TableA, and TableA and TableB are in the same database instance, then it could be done with SQL, for example INSERT INTO TableA SELECT * FROM TableB; (possibly qualified with a WHERE clause in the SELECT).
(2) Even if the row structures weren't identical, but there were enough information in TableB rows, this approach would be doable.
(3) If TableA has to be updated rather than get new rows, you could create an UPDATE statement containing a correlated sub-query, but this may run slowly.
(4) You could create a DataStage job. This might be the best approach if, for example, TableA and TableB are in different databases, or if you need a robust mechanism for identifying which rows from TableB will generate INSERT operations on TableA and which rows from TableB will generate UPDATE operations on TableB (for example load a hashed file with primary key values from TableA and perform lookups against it, and maintain the hashed file as new rows are inserted into TableA.
(5) Are the records in TableB timestamped in any way? In this case you may be able to leverage this information to determine which rows have to be affected in TableA.
There are probably other strategies, depending on exactly what it is you want to do. How about revealing just a little more, so we can be a little more targetted in providing assistance? Right now your question falls into the same category as "how do I sort fruit?".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

Also, if you only want to update SOME of TableB records onto TableA, you can use the Lookup stage to verify that you are only pulling TableB records that have the same key values as TableA. This can hurt performance espeically with very large tables.

Can you please share your business requirements with us so we can provide a better (and more precise) suggestion?

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
vzmz
Participant
Posts: 36
Joined: Sun Nov 23, 2003 12:10 pm
Location: Dallas

Post by vzmz »

ray.wurlod wrote:That's going to depend on a number of factors, none of which you bothered to tell us about. Let's call TableA the table that has to be updated, and TableB the table that contains the updates. Does TableB contain only records to cause updates to TableA, or does TableB contain these and other records besides?
(1) If the record structures are identical, and if the rows will be new rows in TableA, and TableA and TableB are in the same database instance, then it could be done with SQL, for example INSERT INTO TableA SELECT * FROM TableB; (possibly qualified with a WHERE clause in the SELECT).
(2) Even if the row structures weren't identical, but there were enough information in TableB rows, this approach would be doable.
(3) If TableA has to be updated rather than get new rows, you could create an UPDATE statement containing a correlated sub-query, but this may run slowly.
(4) You could create a DataStage job. This might be the best approach if, for example, TableA and TableB are in different databases, or if you need a robust mechanism for identifying which rows from TableB will generate INSERT operations on TableA and which rows from TableB will generate UPDATE operations on TableB (for example load a hashed file with primary key values from TableA and perform lookups against it, and maintain the hashed file as new rows are inserted into TableA.
(5) Are the records in TableB timestamped in any way? In this case you may be able to leverage this information to determine which rows have to be affected in TableA.
There are probably other strategies, depending on exactly what it is you want to do. How about revealing just a little more, so we can be a little more targetted in providing assistance? Right now your question falls into the same category as "how do I sort fruit?".
Thanks Ray,
I have a master table called account (The Warehouse table) which has to update with new data coming from different sources. The new data in form of sequential files which has to be validated for nulls and duplicates. After which I populated another table account_new using this data. The new data that we have has new records, old and updated records. Using these 2 tables I have to find Insert Updates and Old records and insert them in different tables namely account_ins(for insert records) account_upd(for all the records that are updates) and account_old(for all the old records).
I am able to extract the information in these tables with select statements (select * from account_new where account_new primaray key not in account) similarly for the rest of the tables.
Now that we have insert which is pretty simple do and updates, for updates I have to write UPDATES query. I have to compare every field and update if different and I am stuck here now.
Now we know that all the records that have to be updated are in account_upd using this I have to modify the table account to reflect the changes.
My question is how I can do this,
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The simplest thing is to let DataStage generate the SQL for you; depending on the stage type you are using there are between six and ten "update rules", such as "insert new rows only", "update existing rows", and so on.

Feed the rows to be updated through a stage with "update existing rows" as the update rule, and inspect the SQL that is generated. For each (non-key) column to be updated there is a parameter marker in the SET clause, while for each key column there is a parameter marker in the WHERE clause.

Your DataStage job design requires a way to determine which rows need to be inserted and which rows need to be updated. The easiest way to do this is a lookup, possibly from the DW table itself, but more likely from a hashed file containing just the primary key values from the DW table.

If the key value exists, then an update is required; if the key value does not exist then an insert is required.

Code: Select all

                      lookup
                         :
                         V
  staging  ----->   Transformer   -----> inserts
                         |
                         +-----> updates
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vzmz
Participant
Posts: 36
Joined: Sun Nov 23, 2003 12:10 pm
Location: Dallas

Post by vzmz »

ray.wurlod wrote:The simplest thing is to let DataStage generate the SQL for you; depending on the stage type you are using there are between six and ten "update rules", such as "insert new rows only", "update existing rows", and so on.

Feed the rows to be updated through a stage with "update existing rows" as the update rule, and inspect the SQL that is generated. For each (non-key) column to be updated there is a parameter marker in the SET clause, while for each key column there is a parameter marker in the WHERE clause.

Your DataStage job design requires a way to determine which rows need to be inserted and which rows need to be updated. The easiest way to do this is a lookup, possibly from the DW table itself, but more likely from a hashed file containing just the primary key values from the DW table.

If the key value exists, then an update is required; if the key value does not exist then an insert is required.

Code: Select all

                      lookup
                         :
                         V
  staging  ----->   Transformer   -----> inserts
                         |
                         +-----> updates
I am getting close. For the look up i was trying to load the account(DW) table into a hash file(keeping in mind that this table have of million of records) when i complied it says key has to be specified but i haven't seen any key column.
So i thought of

Code: Select all

                              TeraData stage (account table)
                                      :  Refernce link  
                                      :
                                      V
(account_upd table)                                     Account table(DW)
        TeraData Stage------->transformer---------->TeraData Stage
                        (steam input)          (stream output)
can we have 2 instance of table one used for lookup other for insert or update.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes, you can have two (or more) links to the same table from DataStage. You may need to be careful with transactions locking the table, however.

This is another reason for preferring a hashed file. Millions of key values is still not that large a hashed file. You only need the primary key values, because all you're checking for is existence.

Are you saying that the target table does not have a primary key defined? In that case, the design won't work very efficiently (or at all), since a reference lookup is based on WHERE keycolumn = value (inspect the generated SQL in the Teradata stage).

Hashed files MUST have a key defined. Edit the column metadata in the hashed file stage to specify which column is the key column.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
sri1dhar
Charter Member
Charter Member
Posts: 54
Joined: Mon Nov 03, 2003 3:57 pm

Post by sri1dhar »

Ray,

But I believe "update existing rows" option updates the record if there is a record with the matching key exists, irrespective of whether any of the column data really changed. How do I do a conditional update? Meaning update the record only if the data changed for atleast one column.

Thanks
Sridhar

[quote="ray.wurlod"]The simplest thing is to let DataStage generate the SQL for you; depending on the stage type you are using there are between six and ten "update rules", such as "insert new rows only", "update existing rows", and so on.

Feed the rows to be updated through a stage with "update existing rows" as the update rule, and inspect the SQL that is generated. For each (non-key) column to be updated there is a parameter marker in the SET clause, while for each key column there is a parameter marker in the WHERE clause.

Your DataStage job design requires a way to determine which rows need to be inserted and which rows need to be updated. The easiest way to do this is a lookup, possibly from the DW table itself, but more likely from a hashed file containing just the primary key values from the DW table.

If the key value exists, then an update is required; if the key value does not exist then an insert is required.

[code]
lookup
:
V
staging -----> Transformer -----> inserts
|
+-----> updates[/code][/quote]
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

sri1dhar wrote:But I believe "update existing rows" option updates the record if there is a record with the matching key exists, irrespective of whether any of the column data really changed. How do I do a conditional update? Meaning update the record only if the data changed for atleast one column.
You are correct, it does the update irrespective of changes. If you truly want to do a 'conditional update' then you will need to determine in your job if something has changed and only issue the update in that case.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In short, you use the output link constraint from a Transformer stage to identify those rows that will be sent on for further processing (in your case, updating rows in a table). There is no limit on what you can do (except that, if it's too complex, you may need to write a routine; there is an upper limit on the lengtth of an expression).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply