Insert and update the records to sqlserver

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

samba
Premium Member
Premium Member
Posts: 62
Joined: Wed Dec 07, 2005 11:44 am

Insert and update the records to sqlserver

Post by samba »

HI

i am loading the data into sql server (using Dynamic RDBMS)
there i am using insert or update query to load the data into sql server
record count around 1 million,
i have to refresh the data daily using this job
when i am running the same job again its loading entire records once again
i need to load new records only

Thanks
samba
Mat01
Premium Member
Premium Member
Posts: 50
Joined: Wed Jun 02, 2004 11:12 am
Location: Montreal, Canada

Post by Mat01 »

Hi Samba,

I don't know how you extract your data, but if your goal is to load only new records, you could use a change capture stage with your new data as the after dataset and the target table as the before dataset and retain only the new records (inserts is change_code = 1).

Regards,

Mat
samba
Premium Member
Premium Member
Posts: 62
Joined: Wed Dec 07, 2005 11:44 am

Post by samba »

Thanks
samba
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Post by kwwilliams »

1. Your database table should have a primary key or at least a unique constraint on the table.

2. Is there a way that you can only select a delta from your million row source table? An update datetime perhaps? If so you can store this off and pass it into your job as a parameter in your select statement.

3. Now that your unique constraint is in place you should be able to specify an upsert, insert then update (if there are more new rows than older rows) or update then insert (if there are more rows to update than to insert).
samba
Premium Member
Premium Member
Posts: 62
Joined: Wed Dec 07, 2005 11:44 am

Post by samba »

I need to update the existing records also
samba
samba
Premium Member
Premium Member
Posts: 62
Joined: Wed Dec 07, 2005 11:44 am

Post by samba »

kwwilliams

But its taking time to update or insert the entire data, nearly 25 hours
samba
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Post by kwwilliams »

Then get a unique constraint on the table (which will help with job performance as well) and perform an update then insert and that will do the trick for you.
Mat01
Premium Member
Premium Member
Posts: 50
Joined: Wed Jun 02, 2004 11:12 am
Location: Montreal, Canada

Post by Mat01 »

If your update and insert volumes are both very large, try to split your updates and your inserts in two separate datasets (change capture will tell you which is which or any timestamp you put in the source data). This way, you can update and insert separately (reduces DataBase operations).

HTH,

Mat
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Tell me this samba,
when a record gets updated in your source table, is there any column there that stores a timestamp which tells you when this record was last updated?
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Ultramundane
Participant
Posts: 407
Joined: Mon Jun 27, 2005 8:54 am
Location: Walker, Michigan
Contact:

Post by Ultramundane »

You could add a column with the datatype "timestamp/ROWVERSION" (I believe will work in both MSSQL and Sybase). This system column type stores the database timestamp automatically for you when an insert or an update is done to the table. If you keep track of this varbinary you can use it determine what records changed and you can pull just those records to use in an upsert. Would be basically the samething if you had a datetime column and all of your applications (or a trigger) always updated that column as well. However, if someone did not update that column, you would lose this update. That is why I like the timestamp datatype.

You would lose the records that need to be deleted using this approach. But, you might not care for your purpose.

One method to implement this approach:
Add a column with the datatype timestamp to your tables that you need to copy.
--> Eg) timestamp timestamp

Create a table to keep track of the database timestamp between runs.
ts_track (table varchar(128) not null,
timestamp timestamp not null,
old_timestamp varbinary(8) null,
row_update_dtime datetime not null)

Insert a record into the ts_track table
insert into ts_track (table, old_timestampe, row_update_dtime) values ('your_table', 0x0, getdate())

Do a full sync.

Update ts_track before next read so that we do not lose record modifications. This could result in a few extra rows, but should be no biggie.

Job1:
UPDATE ts_track
SET old_timestamp = timestamp,
row_update_dtime = getdate()
WHERE table = 'your_table'

Job2:
Next time you read do a join with the new table.
Eg)
SELECT T1.*
FROM your_table T1,
ts_track T2
WHERE T1.timestamp >= T2.old_timestamp
AND T2.table = 'your_table'
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Post by kwwilliams »

samba wrote:kwwilliams

But its taking time to update or insert the entire data, nearly 25 hours
What indexes do you have on the table. That is why I said this would improve your performance. My guess is that a unique constraint on your key field (the same key that would be in your where clause) would significantly increase your performance. 1 million is not an unreasonable amount of data, I have a job that does this with 7 million rows in just over an hour.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Samba, if you have a column in your source table that maintains the timestamp of when that record was last updated or inserted, then the most easiest thing for you to do is, just select the changed records from your source table.
in the DRS stage, select user defined SQL and in sql select only changed records.
say you have a table XYZ with four columns A, B, C, DT where DT is the column that maintains the timestamp of when that record was last modified(inserted or updated)

in the user defined sql you could do something like this

Code: Select all

SELECT     A, B, C, DT
FROM        XYZ
WHERE     (DT BETWEEN
                          (SELECT     MAX(DT)
                            FROM          XYZ) AND GETDATE())
This sql will get only the changed records,
and in your target DRS stage, choose the option of update and insert or insert and update, depending upon the frequency of updates and inserts you get.
this will also solve you initial problem of records getting loaded twice.
i hope it helps.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Post by kwwilliams »

I agree with GURU2b.

That's why I stated this earlier:

2. Is there a way that you can only select a delta from your million row source table? An update datetime perhaps? If so you can store this off and pass it into your job as a parameter in your select statement.
jasper
Participant
Posts: 111
Joined: Mon May 06, 2002 1:25 am
Location: Belgium

Post by jasper »

A completely other scenario: If the number of inserts/updates is large and it's difficult to find the changed records: Just truncate and reload.
I'm not used to sql server but on an oracle this is the fastest solution once you reach about 10% inserts/updates.
kumar_s
Charter Member
Charter Member
Posts: 5245
Joined: Thu Jun 16, 2005 11:00 pm

Post by kumar_s »

jasper wrote:A completely other scenario: If the number of inserts/updates is large and it's difficult to find the changed records: Just truncate and reload.
I'm not used to sql server but on an oracle this is the fastest solution once you reach about 10% inserts/updates.
What if you use Change Capture Stage?
Impossible doesn't mean 'it is not possible' actually means... 'NOBODY HAS DONE IT SO FAR'
Post Reply