Finding the Updates

vzmz · Post by **vzmz** » Fri Jan 02, 2004 9:23 pm

I have a tables which have to be updated using another table (which have all the updated records)
How can i achieve it?

ray.wurlod · Post by **ray.wurlod** » Fri Jan 02, 2004 11:25 pm

That's going to depend on a number of factors, none of which you bothered to tell us about. Let's call TableA the table that has to be updated, and TableB the table that contains the updates. Does TableB contain only records to cause updates to TableA, or does TableB contain these and other records besides?
(1) If the record structures are identical, and if the rows will be new rows in TableA, and TableA and TableB are in the same database instance, then it could be done with SQL, for example INSERT INTO TableA SELECT * FROM TableB; (possibly qualified with a WHERE clause in the SELECT).
(2) Even if the row structures weren't identical, but there were enough information in TableB rows, this approach would be doable.
(3) If TableA has to be updated rather than get new rows, you could create an UPDATE statement containing a correlated sub-query, but this may run slowly.
(4) You could create a DataStage job. This might be the best approach if, for example, TableA and TableB are in different databases, or if you need a robust mechanism for identifying which rows from TableB will generate INSERT operations on TableA and which rows from TableB will generate UPDATE operations on TableB (for example load a hashed file with primary key values from TableA and perform lookups against it, and maintain the hashed file as new rows are inserted into TableA.
(5) Are the records in TableB timestamped in any way? In this case you may be able to leverage this information to determine which rows have to be affected in TableA.
There are probably other strategies, depending on exactly what it is you want to do. How about revealing just a little more, so we can be a little more targetted in providing assistance? Right now your question falls into the same category as "how do I sort fruit?".

Teej · Post by **Teej** » Sat Jan 03, 2004 10:40 am

Also, if you only want to update SOME of TableB records onto TableA, you can use the Lookup stage to verify that you are only pulling TableB records that have the same key values as TableA. This can hurt performance espeically with very large tables.

Can you please share your business requirements with us so we can provide a better (and more precise) suggestion?

-T.J.

vzmz · Post by **vzmz** » Mon Jan 05, 2004 1:50 pm

ray.wurlod wrote:That's going to depend on a number of factors, none of which you bothered to tell us about. Let's call TableA the table that has to be updated, and TableB the table that contains the updates. Does TableB contain only records to cause updates to TableA, or does TableB contain these and other records besides?
(1) If the record structures are identical, and if the rows will be new rows in TableA, and TableA and TableB are in the same database instance, then it could be done with SQL, for example INSERT INTO TableA SELECT * FROM TableB; (possibly qualified with a WHERE clause in the SELECT).
(2) Even if the row structures weren't identical, but there were enough information in TableB rows, this approach would be doable.
(3) If TableA has to be updated rather than get new rows, you could create an UPDATE statement containing a correlated sub-query, but this may run slowly.
(4) You could create a DataStage job. This might be the best approach if, for example, TableA and TableB are in different databases, or if you need a robust mechanism for identifying which rows from TableB will generate INSERT operations on TableA and which rows from TableB will generate UPDATE operations on TableB (for example load a hashed file with primary key values from TableA and perform lookups against it, and maintain the hashed file as new rows are inserted into TableA.
(5) Are the records in TableB timestamped in any way? In this case you may be able to leverage this information to determine which rows have to be affected in TableA.
There are probably other strategies, depending on exactly what it is you want to do. How about revealing just a little more, so we can be a little more targetted in providing assistance? Right now your question falls into the same category as "how do I sort fruit?".

Thanks Ray,
I have a master table called account (The Warehouse table) which has to update with new data coming from different sources. The new data in form of sequential files which has to be validated for nulls and duplicates. After which I populated another table account_new using this data. The new data that we have has new records, old and updated records. Using these 2 tables I have to find Insert Updates and Old records and insert them in different tables namely account_ins(for insert records) account_upd(for all the records that are updates) and account_old(for all the old records).
I am able to extract the information in these tables with select statements (select * from account_new where account_new primaray key not in account) similarly for the rest of the tables.
Now that we have insert which is pretty simple do and updates, for updates I have to write UPDATES query. I have to compare every field and update if different and I am stuck here now.
Now we know that all the records that have to be updated are in account_upd using this I have to modify the table account to reflect the changes.
My question is how I can do this,

ray.wurlod · Post by **ray.wurlod** » Mon Jan 05, 2004 5:51 pm

The simplest thing is to let DataStage generate the SQL for you; depending on the stage type you are using there are between six and ten "update rules", such as "insert new rows only", "update existing rows", and so on.

Feed the rows to be updated through a stage with "update existing rows" as the update rule, and inspect the SQL that is generated. For each (non-key) column to be updated there is a parameter marker in the SET clause, while for each key column there is a parameter marker in the WHERE clause.

Your DataStage job design requires a way to determine which rows need to be inserted and which rows need to be updated. The easiest way to do this is a lookup, possibly from the DW table itself, but more likely from a hashed file containing just the primary key values from the DW table.

If the key value exists, then an update is required; if the key value does not exist then an insert is required.

Code: Select all

                      lookup
                         :
                         V
  staging  ----->   Transformer   -----> inserts
                         |
                         +-----> updates

vzmz · Post by **vzmz** » Mon Jan 05, 2004 6:25 pm

ray.wurlod wrote:The simplest thing is to let DataStage generate the SQL for you; depending on the stage type you are using there are between six and ten "update rules", such as "insert new rows only", "update existing rows", and so on.

Feed the rows to be updated through a stage with "update existing rows" as the update rule, and inspect the SQL that is generated. For each (non-key) column to be updated there is a parameter marker in the SET clause, while for each key column there is a parameter marker in the WHERE clause.

Your DataStage job design requires a way to determine which rows need to be inserted and which rows need to be updated. The easiest way to do this is a lookup, possibly from the DW table itself, but more likely from a hashed file containing just the primary key values from the DW table.

If the key value exists, then an update is required; if the key value does not exist then an insert is required.
Code: Select all
                      lookup
                         :
                         V
  staging  ----->   Transformer   -----> inserts
                         |
                         +-----> updates

I am getting close. For the look up i was trying to load the account(DW) table into a hash file(keeping in mind that this table have of million of records) when i complied it says key has to be specified but i haven't seen any key column.
So i thought of

Code: Select all

                              TeraData stage (account table)
                                      :  Refernce link  
                                      :
                                      V
(account_upd table)                                     Account table(DW)
        TeraData Stage------->transformer---------->TeraData Stage
                        (steam input)          (stream output)

can we have 2 instance of table one used for lookup other for insert or update.

ray.wurlod · Post by **ray.wurlod** » Mon Jan 05, 2004 6:31 pm

Yes, you can have two (or more) links to the same table from DataStage. You may need to be careful with transactions locking the table, however.

This is another reason for preferring a hashed file. Millions of key values is still not that large a hashed file. You only need the primary key values, because all you're checking for is existence.

Are you saying that the target table does not have a primary key defined? In that case, the design won't work very efficiently (or at all), since a reference lookup is based on WHERE keycolumn = value (inspect the generated SQL in the Teradata stage).

Hashed files MUST have a key defined. Edit the column metadata in the hashed file stage to specify which column is the key column.

sri1dhar · Post by **sri1dhar** » Sat Jan 10, 2004 10:12 am

Ray,

But I believe "update existing rows" option updates the record if there is a record with the matching key exists, irrespective of whether any of the column data really changed. How do I do a conditional update? Meaning update the record only if the data changed for atleast one column.

Thanks
Sridhar

[quote="ray.wurlod"]The simplest thing is to let DataStage generate the SQL for you; depending on the stage type you are using there are between six and ten "update rules", such as "insert new rows only", "update existing rows", and so on.

Feed the rows to be updated through a stage with "update existing rows" as the update rule, and inspect the SQL that is generated. For each (non-key) column to be updated there is a parameter marker in the SET clause, while for each key column there is a parameter marker in the WHERE clause.

Your DataStage job design requires a way to determine which rows need to be inserted and which rows need to be updated. The easiest way to do this is a lookup, possibly from the DW table itself, but more likely from a hashed file containing just the primary key values from the DW table.

If the key value exists, then an update is required; if the key value does not exist then an insert is required.

[code]
lookup
:
V
staging -----> Transformer -----> inserts
|
+-----> updates[/code][/quote]

chulett · Post by **chulett** » Sat Jan 10, 2004 10:31 am

sri1dhar wrote:But I believe "update existing rows" option updates the record if there is a record with the matching key exists, irrespective of whether any of the column data really changed. How do I do a conditional update? Meaning update the record only if the data changed for atleast one column.

You are correct, it does the update irrespective of changes. If you truly want to do a 'conditional update' then you will need to determine in your job if something has changed and only issue the update in that case.

ray.wurlod · Post by **ray.wurlod** » Sat Jan 10, 2004 9:45 pm

In short, you use the output link constraint from a Transformer stage to identify those rows that will be sent on for further processing (in your case, updating rows in a table). There is no limit on what you can do (except that, if it's too complex, you may need to write a routine; there is an upper limit on the lengtth of an expression).