DSXchange

Posted: **Tue Jul 20, 2004 12:06 pm**

Hi,

I am designing an ODBC stage where in I need to update existing rec's and insert new records. I want to clarify one of the options given in the Update Action Drop-Menu.

In that there is a option saying Update existing records and insert new records. My question is when inserting a new record will it append it at the end or will this feature re-sort the records and place it to according to a certain sort order.

I would appreciate if anyone can advice me on this.

Thanks,
Sri.

Posted: **Tue Jul 20, 2004 3:40 pm**

Hi,
well your question may seem simple to some but I can see why you ask.
the answer is depending on your table and DB.
in general if a new row was entered it would behave exactly as if you issued an insert statement via any sql editor.
so the exact answer should come from your DBA for this exact DB and table.

IHTH,

Posted: **Tue Jul 20, 2004 3:56 pm**

Thanks Roy... I sure will check with my DBA.

Posted: **Tue Jul 20, 2004 5:12 pm**

The table itself does not need to be resorted whenever an insert is made. The primary key of the table and any indexes on that table may need to be updated after an insert. This does add overhead to your insert statement and is one of the reasons why a bulk load is a lot faster then a large number of insert statements. A bulk load is able to optimise the handling of keys and indexes to provide faster inserts. For example you can configure an Oracle bulk load to disable keys and indexes and rebuild them after all the inserts are done.

You will find many threads on this forum advocating splitting data into an inserts stream and an updates stream and using bulk load for the inserts part.

Posted: **Tue Jul 20, 2004 5:39 pm**

When you use the update existing/insert new option from the drop down list datastage doesn't do anything to your database. It sends the records to the table like usual update sql and insert sql. Based on your database and the way you created the table the new records will go accordingly.

Thanks
Siva

Posted: **Tue Jul 20, 2004 10:07 pm**

For example you can configure an Oracle bulk load to disable keys and indexes and rebuild them after all the inserts are done.

I have tried this option in loading my data into Oracle database. However, it will not check for duplicates data. For example, if I am having two records with same primary keys, these two records will be loaded into the database. In fact, we need only one of these records to be loaded.

May I know is there any way for me to hceck for the duplicates and remove them?

Thanks very much in advance.

Posted: **Tue Jul 20, 2004 10:18 pm**

This is part of the process of seperating your inserts from your updates. You load the primary key of your target table into a hash file, you then lookup your input stream against the hash file, any rows that have a match become update statements, any rows without a match become inserts.

The updates are delivered straight to the database via the OCI stage. The inserts are written to a bulk load stage.

Posted: **Tue Jul 20, 2004 10:23 pm**

This process doesn't check whether you have duplicates records in your dataset or not. If you have multiple records with the same primary key in the source then it will update the target table as many times. The last record which comes from the input will be active in your database.

To avoid this you need to have the process the inplace which will eliminate the duplicates. You can sort the records and then based on your business rule you can do whatever you want to. This will avoid duplicates from the input dataset. To avoid duplicates of the target table have the primary key of the target in the hash file and filter it in your stage.

Note : If you define that your table has a primary key. Then the database will never allow you to insert two records with the same primary key.

Thanks
Siva

Posted: **Tue Jul 20, 2004 10:29 pm**

Thanks for your reply. Oh, i understand what u were explaning. Actually, i am not mean to update the data in this case. For instance, i am just inserting tha data and happen to have 2 records with the same primary key. If i am using the Oracle stage to insert, we are able to modify the user-defined query in order to get unique data loaded. However, i could not manage to get such option in Oracle Bulk Load stage.

So, how can I get the same outputs for Oracle Bulk Load stage instead of Oracle Enterprise stage (not getting duplicated data when inerting simultenously)?

Thanks. Cheers~

Posted: **Tue Jul 20, 2004 10:37 pm**

The last record which comes from the input will be active in your database.

Do u mean we will have all the records with same primary key in the table. However, only the last inserted record is active?

To avoid duplicates of the target table have the primary key of the target in the hash file and filter it in your stage.

So, I need to get the primary key of the target table into a hash file and filter it using a stage? Can i use removes duplicates stage to do it?

Thanks for all your help. Appreciate it so much.

Posted: **Tue Jul 20, 2004 11:13 pm**

If you have duplicate records in your source and want to update the target table using insert new/update existing option then the last duplicated record in your source will be active in the database table.

If you want to avoid duplicated of your existing table then yes you need to put you existing primary key into a hash and lookup with the source to avoid duplicates. Unfornutaley there is no stage in datastage to remove duplicates automatically.

thanks
Siva

Posted: **Wed Jul 21, 2004 7:15 am**

rasi wrote:Unfornutaley there is no stage in datastage to remove duplicates automatically.

Actually, I believe there is on the PX side. No such luck on the Server side.

Warning: Opinion. Using the "insert then update" update action or its twin brother allow ETL jobs to get "sloppy" and can hide a multitude of sins. Nevermind the fact that it is about the slowest option you can pick. I strongly discourage their use. The job designer should be hashing up primary keys and doing a proper job of determining updates versus inserts, duplicates versus unique records, etc. You can then handle each as is appropriate for their type and your requirements. I mean, why take one new record with four "duplicates" behind it and write to your database five times? Better to determine this ahead of time and only do one operation to your database, whatever that 'one operation' is.

In a Server job, use hash files and stage variables (if required) to handle duplicate records. Keep in mind the fact that you can write to the same hash file you are using for the current lookups, this allows new keys to be added to the lookup as they are added to the database. Use a mechanism like this to help handle duplicates inside your source data.

Any Bulk Loader simple shoves records into a database, hopefully as fast as possible. For a Direct Path load into Oracle for example, all indexes are disabled (among other things) to allow records into the database at warp speed. It will then at completion of the load attempt to rebuild / re-enable them. If you've allowed duplicate records to be bulk loaded into a table with a unique constraint you are going to have some issues to deal with.

The table basically becomes unusable until it is "de-duped" and the constraints successfully re-established.

Posted: **Fri Jul 23, 2004 4:22 am**

The simple solution is to declare a UNIQUE constraint on the key column of the database table.

DSXchange

Update table using ODBC

Update table using ODBC