DSXchange

Posted: **Thu Mar 02, 2006 1:10 pm**

HI

i am loading the data into sql server (using Dynamic RDBMS)
there i am using insert or update query to load the data into sql server
record count around 1 million,
i have to refresh the data daily using this job
when i am running the same job again its loading entire records once again
i need to load new records only

Thanks

Posted: **Thu Mar 02, 2006 1:18 pm**

Hi Samba,

I don't know how you extract your data, but if your goal is to load only new records, you could use a change capture stage with your new data as the after dataset and the target table as the before dataset and retain only the new records (inserts is change_code = 1).

Regards,

Mat

Posted: **Thu Mar 02, 2006 1:35 pm**

Thanks

Posted: **Thu Mar 02, 2006 1:40 pm**

1. Your database table should have a primary key or at least a unique constraint on the table.

2. Is there a way that you can only select a delta from your million row source table? An update datetime perhaps? If so you can store this off and pass it into your job as a parameter in your select statement.

3. Now that your unique constraint is in place you should be able to specify an upsert, insert then update (if there are more new rows than older rows) or update then insert (if there are more rows to update than to insert).

Posted: **Thu Mar 02, 2006 1:43 pm**

I need to update the existing records also

Posted: **Thu Mar 02, 2006 1:46 pm**

kwwilliams

But its taking time to update or insert the entire data, nearly 25 hours

Posted: **Thu Mar 02, 2006 1:47 pm**

Then get a unique constraint on the table (which will help with job performance as well) and perform an update then insert and that will do the trick for you.

Posted: **Thu Mar 02, 2006 2:10 pm**

If your update and insert volumes are both very large, try to split your updates and your inserts in two separate datasets (change capture will tell you which is which or any timestamp you put in the source data). This way, you can update and insert separately (reduces DataBase operations).

HTH,

Mat

Posted: **Thu Mar 02, 2006 2:19 pm**

Tell me this samba,
when a record gets updated in your source table, is there any column there that stores a timestamp which tells you when this record was last updated?

Posted: **Thu Mar 02, 2006 2:56 pm**

You could add a column with the datatype "timestamp/ROWVERSION" (I believe will work in both MSSQL and Sybase). This system column type stores the database timestamp automatically for you when an insert or an update is done to the table. If you keep track of this varbinary you can use it determine what records changed and you can pull just those records to use in an upsert. Would be basically the samething if you had a datetime column and all of your applications (or a trigger) always updated that column as well. However, if someone did not update that column, you would lose this update. That is why I like the timestamp datatype.

You would lose the records that need to be deleted using this approach. But, you might not care for your purpose.

One method to implement this approach:
Add a column with the datatype timestamp to your tables that you need to copy.
--> Eg) timestamp timestamp

Create a table to keep track of the database timestamp between runs.
ts_track (table varchar(128) not null,
timestamp timestamp not null,
old_timestamp varbinary(8) null,
row_update_dtime datetime not null)

Insert a record into the ts_track table
insert into ts_track (table, old_timestampe, row_update_dtime) values ('your_table', 0x0, getdate())

Do a full sync.

Update ts_track before next read so that we do not lose record modifications. This could result in a few extra rows, but should be no biggie.

Job1:
UPDATE ts_track
SET old_timestamp = timestamp,
row_update_dtime = getdate()
WHERE table = 'your_table'

Job2:
Next time you read do a join with the new table.
Eg)
SELECT T1.*
FROM your_table T1,
ts_track T2
WHERE T1.timestamp >= T2.old_timestamp
AND T2.table = 'your_table'

Posted: **Thu Mar 02, 2006 2:58 pm**

samba wrote:kwwilliams

But its taking time to update or insert the entire data, nearly 25 hours

What indexes do you have on the table. That is why I said this would improve your performance. My guess is that a unique constraint on your key field (the same key that would be in your where clause) would significantly increase your performance. 1 million is not an unreasonable amount of data, I have a job that does this with 7 million rows in just over an hour.

Posted: **Thu Mar 02, 2006 3:06 pm**

Samba, if you have a column in your source table that maintains the timestamp of when that record was last updated or inserted, then the most easiest thing for you to do is, just select the changed records from your source table.
in the DRS stage, select user defined SQL and in sql select only changed records.
say you have a table XYZ with four columns A, B, C, DT where DT is the column that maintains the timestamp of when that record was last modified(inserted or updated)

in the user defined sql you could do something like this

Code: Select all

SELECT     A, B, C, DT
FROM        XYZ
WHERE     (DT BETWEEN
                          (SELECT     MAX(DT)
                            FROM          XYZ) AND GETDATE())

This sql will get only the changed records,
and in your target DRS stage, choose the option of update and insert or insert and update, depending upon the frequency of updates and inserts you get.
this will also solve you initial problem of records getting loaded twice.
i hope it helps.

Posted: **Thu Mar 02, 2006 5:42 pm**

I agree with GURU2b.

That's why I stated this earlier:

2. Is there a way that you can only select a delta from your million row source table? An update datetime perhaps? If so you can store this off and pass it into your job as a parameter in your select statement.

Posted: **Fri Mar 03, 2006 1:44 am**

A completely other scenario: If the number of inserts/updates is large and it's difficult to find the changed records: Just truncate and reload.
I'm not used to sql server but on an oracle this is the fastest solution once you reach about 10% inserts/updates.

Posted: **Fri Mar 03, 2006 3:34 am**

jasper wrote:A completely other scenario: If the number of inserts/updates is large and it's difficult to find the changed records: Just truncate and reload.
I'm not used to sql server but on an oracle this is the fastest solution once you reach about 10% inserts/updates.

What if you use Change Capture Stage?

DSXchange

Insert and update the records to sqlserver

Insert and update the records to sqlserver