Page 1 of 1

Best way for incremental loading using sequential file

Posted: Thu Jul 05, 2018 7:08 pm
by sivakr4056@gmail.com
Hi, i have Oracle database as a source and Sq file is the output. i would like to know the best way for the incremental load in sq file based on the updated timestamp.

Below is what I am thinking. Please tell me if there is any better approach.

Parallel Job stages => Oracle Connector -> Transformer ->2 Sq Files (1st Sq File for saving Last execution timestamp, 2nd Sq file for original data loading)

In Oracle connector, I use "SELECT * FROM table_name WHERE date >= last_execution_timestamp" (this I am saving 1st Sq File from transformer stage).

Can anyone suggest a simple and better approach for incremental loading?

Posted: Thu Jul 05, 2018 8:41 pm
by chulett
Welcome!

Can you detail for us exactly what you mean by "Sq file"? I'd rather not guess what it is and how you plan on using it for an incremental load. Thanks.

Posted: Fri Jul 06, 2018 5:17 am
by sivakr4056@gmail.com
Sq file -> sequential file
1st sq file contains updated date and 2nd sq file contains incremental data.

Posted: Fri Jul 06, 2018 6:45 am
by chulett
Okay... was hoping for a wee bit more detail than that. So we're not even touching on the 'loading' aspect of this, just the extraction side for now. Fine.

Here's a high level look at how we handle things of that nature. We generally stream data in from the source where the data is > our last high water mark. Note that I'm not talking about execution time but instead an actual timestamp value from the source. We will in some instances use ">=" depending on the data and frequency of change but then we also add duplicate handling for the overlap between two loads from pulling some of the information twice.

Then it's straight on to the target, no landed file required, with a lookup driving the action there - insert or update or discard. We also split the source field that drives the extraction off to an aggregator set to "max" and capture that output as our next starting high water mark. For us, that HWM is stored in a database table (one record per source and table) and is selected out as the parameter for the next run.

Now, maybe you are landing the data because they are always inserts and they are all handled by a bulk loader... but you didn't share that detail with us.