Problem extracting records within time frame

Post by **admin** » Wed May 09, 2001 3:32 am

Get Oracle to select just the required records.

Use a job parameter in the WHERE clause of user-defined SQL. SELECT col1, col2, ... FROM table WHERE INSERT_DATETIME > #LastRetrieveDate#;

Create a job control routine to set the value of the controlled jobs parameter, after reading it from wherever it is stored.

-----Original Message-----
From: Melvin Thong [mailto:melvin@adtel.esb.com.my]
Sent: Wednesday, 09 May 2001 13:39
To: informix-datastage@oliver.com
Cc: ray_wurlod@informix.com
Subject: Problem extracting records within time frame

Hello,

I am sorry to post this message with a wrong subject previously. Here I re-send the email and below is my problem description.

I am encountering a problem with a DataStage job that I have developed. This job has been scheduled to run every 15 minutes to extract data from an Oracle table and insert to Unidata table.

The job will extract the Oracle new records base on the timestamp of a field called INSERT_DATETIME in the Oracle table. The comparison of extraction is something like this in the Constraints of the transformer. INSERT_DATETIME >= LAST_RETRIEVE_DATE

The LAST_RETRIEVE_DATE is actually the last retrieval datetime written into a file. When the job starts executing, it will actually read the last retrieval date from the mentioned file.

The problem that I am facing is that the records in the Oracle table keeps growing and currently there are more than 5 million records. I noticed that DataStage jobs will actually read through the whole table from record number 1 to the end of 5 million and each of the record retrieved will have to compare the INSERT_DATETIME with the LAST_RETRIEVE_DATE.

This method has actually affected the performance of the DataStage jobs which is supposed to complete within 15 minutes. As for this problem, it caused the DataStage job to exceed the 15 minutes buffer.

My question is, whether the DataStage have a way to recognise the newly inserted records in the Oracle table without having scan through the Oracle table from record 1 again when doing the comparison?

Our customer has been complaining about this performance issue since the job cannot complete within 15 minutes buffer.

Hope you all could give some suggestions on how to rectify this problem. I would appreciate your help. Thank you.

Regards,
Melvin

Post by **admin** » Wed May 09, 2001 3:38 am

Hello,

I am sorry to post this message with a wrong subject previously.
Here I re-send the email and below is my problem description.

I am encountering a problem with a DataStage job that I have
developed. This job has been scheduled to run every 15 minutes to
extract data from an Oracle table and insert to Unidata table.

The job will extract the Oracle new records base on the timestamp of
a field called INSERT_DATETIME in the Oracle table. The comparison of
extraction is something like this in the Constraints of the
transformer. INSERT_DATETIME >= LAST_RETRIEVE_DATE

The LAST_RETRIEVE_DATE is actually the last retrieval datetime
written into a file. When the job starts executing, it will actually
read the last retrieval date from the mentioned file.

The problem that I am facing is that the records in the Oracle table
keeps growing and currently there are more than 5 million records. I
noticed that DataStage jobs will actually read through the whole table
from record number 1 to the end of 5 million and each of the record
retrieved will have to compare the INSERT_DATETIME with the
LAST_RETRIEVE_DATE.

This method has actually affected the performance of the DataStage
jobs which is supposed to complete within 15 minutes. As for this
problem, it caused the DataStage job to exceed the 15 minutes buffer.

My question is, whether the DataStage have a way to recognise the
newly inserted records in the Oracle table without having scan through
the Oracle table from record 1 again when doing the comparison?

Our customer has been complaining about this performance issue since
the job cannot complete within 15 minutes buffer.

Hope you all could give some suggestions on how to rectify this
problem. I would appreciate your help. Thank you.

Regards,
Melvin

Post by **admin** » Wed May 09, 2001 3:47 am

As you have figured out, your input query is always reading the entire table.

I suggest that you put the last_retrieve_date in an Oracle table rather than in a file.

You can then change your input to a user defined query that joins the data table and the one row from your other table with the last_retrieve_date. This will then only present the new data to DataStage and you wont need a constraint.

Alternative, have a calling job which obtains the last_retrieve_date from the file and passes it as a parameter to your job. In your ORAOCI8 input stage, add a constraint which says

INSERT_DATETIME >= to_date(#last_reteieve_date#,YYYY-MM-DD HH24:MI:SS)

Assuming the last_retrieve_date is in the standard timestamp format that DataStage uses with Oracle.

David Barham
Information Technology Consultant
InformAtect Pty Ltd
Brisbane, Australia

e-mail: david@barham.hm

-----Original Message-----
From: Melvin Thong [mailto:melvin@adtel.esb.com.my]
Sent: Wednesday, 9 May 2001 1:39 PM
To: informix-datastage@oliver.com
Cc: ray_wurlod@informix.com
Subject: Problem extracting records within time frame

Hello,

I am sorry to post this message with a wrong subject previously. Here I re-send the email and below is my problem description.

I am encountering a problem with a DataStage job that I have developed. This job has been scheduled to run every 15 minutes to extract data from an Oracle table and insert to Unidata table.

The job will extract the Oracle new records base on the timestamp of a field called INSERT_DATETIME in the Oracle table. The comparison of extraction is something like this in the Constraints of the transformer. INSERT_DATETIME >= LAST_RETRIEVE_DATE

The LAST_RETRIEVE_DATE is actually the last retrieval datetime written into a file. When the job starts executing, it will actually read the last retrieval date from the mentioned file.

The problem that I am facing is that the records in the Oracle table keeps growing and currently there are more than 5 million records. I noticed that DataStage jobs will actually read through the whole table from record number 1 to the end of 5 million and each of the record retrieved will have to compare the INSERT_DATETIME with the LAST_RETRIEVE_DATE.

This method has actually affected the performance of the DataStage jobs which is supposed to complete within 15 minutes. As for this problem, it caused the DataStage job to exceed the 15 minutes buffer.

My question is, whether the DataStage have a way to recognise the newly inserted records in the Oracle table without having scan through the Oracle table from record 1 again when doing the comparison?

Our customer has been complaining about this performance issue since the job cannot complete within 15 minutes buffer.

Hope you all could give some suggestions on how to rectify this problem. I would appreciate your help. Thank you.

Regards,
Melvin

Post by **admin** » Wed May 09, 2001 3:51 am

Ray,

To quote you:

Create a job control routine to set the value of the controlled jobs parameter, after reading it from wherever it is stored.

Are you simply being economical with your words or are you implying that there is some way a job (in job control) can change the value of its own parameters?

David Barham
Information Technology Consultant
InformAtect Pty Ltd
Brisbane, Australia

e-mail: david@barham.hm

-----Original Message-----
From: Ray Wurlod [mailto:ray.wurlod@Informix.Com]
Sent: Wednesday, 9 May 2001 1:33 PM
To: informix-datastage@oliver.com
Subject: RE: Problem extracting records within time frame

Create a job control routine to set the value of the controlled jobs parameter, after reading it from wherever it is stored.

Post by **admin** » Wed May 09, 2001 5:15 am

No.

While a job can change its own parameters, its too late by then, because all of the resolution of parameter values occurs at job start time. So that changing them has no effect.

I really did mean two jobs - a controlling job and a controlled job. Sorry if this was not clear - I am trying to respond to these questions while (during the breaks in) teaching a "Programming with Datastage BASIC" class.

Which, by the way, has just been updated to release 4.0+ of DataStage; new exercises and new material (e.g. on stage variables).

Regards,
Ray

-----Original Message-----
From: David Barham [mailto:david@barham.hm]
Sent: Wednesday, 09 May 2001 13:52
To: informix-datastage@oliver.com
Subject: RE: Problem extracting records within time frame

Ray,

To quote you:

Create a job control routine to set the value of the controlled jobs parameter, after reading it from wherever it is stored.

Are you simply being economical with your words or are you implying that there is some way a job (in job control) can change the value of its own parameters?

David Barham
Information Technology Consultant
InformAtect Pty Ltd
Brisbane, Australia

e-mail: david@barham.hm

-----Original Message-----
From: Ray Wurlod [mailto:ray.wurlod@Informix.Com]
Sent: Wednesday, 9 May 2001 1:33 PM
To: informix-datastage@oliver.com
Subject: RE: Problem extracting records within time frame

Create a job control routine to set the value of the controlled jobs parameter, after reading it from wherever it is stored.

Post by **admin** » Wed May 09, 2001 6:56 am

How long does the SELECT take in Oracle (without DataStage)? Theres nothing you can do in DataStage to reduce this time. Please dont blame DataStage if the delay is external to DataStage.

You could certainly try putting an index on the insert_date field in the Oracle table. This should speed the restriction or join.

Dont do the comparison in DataStage. Put #LastRetrieveDate# into the WHERE clause in the SQL that selects the rows from Oracle. Or, as David suggested, have Oracle perform a join with the table containing the LastRetrieveDate value.

Post by **admin** » Wed May 09, 2001 6:59 am

Hi David,

Thanks for your prompt reply.

My job design is almost the same with your suggestion. I have wrote
some UV Basic code in the calling job to read in the last retrieve date
from a file and supply the value into the job parameter which is also
called the #LastRetrieveDate# in the called job. After the job finished
running, it will capture the last run timestamp and save into a file.

My main concern is that, I am having a performance issue with the
DataStage job and not on the logic design. I am sorry if I have misled
your understanding in my previous email.

My question is that, how am I going to retrieve records from an Oracle
table which has more than 5 million of records within 15 minutes. My
DataStage job will actually scan through the Oracle table and each
record will compare the insert_date field in the Oracle table with
the parameter #LastRetrieveDate#.

As mentioned earlier on, I think the Oracle table with a total of 5
million rows is huge and my job has problem finishing the extraction in
15 minutes. For your information, this Oracle table keeps growing. This
table is a payment transaction table keeping track of payment records.

My DataStage job has been scheduled to run every 15 minutes interval.
Each job execution, I expect it to extract the lastest records inserted
based on the insert_date field from the Oracle table.

As I have seen from the DataStage director, the number of records that
read in from the Oracle table is increasing in sequence and after
passing through the transformer stage, the number of records is lesser
due to the date comparison filtering. Because of the huge number of
Oracle rows this has caused the performance of my DataStage job.

Is there anyway DataStage can extract the records without having to
scan the Oracle table starting from record 1 until the last record. Can
you imagine how long it will take to scan from record 1 until the 5th
million of records just to extract maybe around 100 of rows inserted
into the table within the last 15 minutes?

I am interested to know whether you guys have gone through this
experience dealing with huge number of rows in an Oracle table which is
in millions of records.

Hope to hear some suggestions or turn-around that I can do with
DataStage job. Would appreciate your assistance here.

Regards,
Melvin

----- Original Message -----
From: "David Barham"
Date: Wednesday, May 9, 2001 11:47 am
Subject: RE: Problem extracting records within time frame

> As you have figured out, your input query is always reading the entire
> table.
>
> I suggest that you put the last_retrieve_date in an Oracle table
> rather than
> in a file.
>
> You can then change your input to a user defined query that joins
> the data
> table and the one row from your other table with the
> last_retrieve_date.This will then only present the new data to
> DataStage and you wont need a
> constraint.
>
> Alternative, have a calling job which obtains the
> last_retrieve_date from
> the file and passes it as a parameter to your job. In your
> ORAOCI8 input
> stage, add a constraint which says
>
> INSERT_DATETIME >= to_date(#last_reteieve_date#,YYYY-MM-DD
> HH24:MI:SS)
> Assuming the last_retrieve_date is in the standard timestamp
> format that
> DataStage uses with Oracle.
>
>
> David Barham
> Information Technology Consultant
> InformAtect Pty Ltd
> Brisbane, Australia
>
> e-mail: david@barham.hm
>
> -----Original Message-----
> From: Melvin Thong [mailto:melvin@adtel.esb.com.my]
> Sent: Wednesday, 9 May 2001 1:39 PM
> To: informix-datastage@oliver.com
> Cc: ray_wurlod@informix.com
> Subject: Problem extracting records within time frame
>
> Hello,
>
> I am sorry to post this message with a wrong subject previously.
> Here I re-send the email and below is my problem description.
>
> I am encountering a problem with a DataStage job that I have
> developed. This job has been scheduled to run every 15 minutes to
> extract data from an Oracle table and insert to Unidata table.
>
> The job will extract the Oracle new records base on the
> timestamp of
> a field called INSERT_DATETIME in the Oracle table. The comparison of
> extraction is something like this in the Constraints of the
> transformer. INSERT_DATETIME >= LAST_RETRIEVE_DATE
>
> The LAST_RETRIEVE_DATE is actually the last retrieval datetime
> written into a file. When the job starts executing, it will actually
> read the last retrieval date from the mentioned file.
>
> The problem that I am facing is that the records in the Oracle
> tablekeeps growing and currently there are more than 5 million
> records. I
> noticed that DataStage jobs will actually read through the whole table
> from record number 1 to the end of 5 million and each of the record
> retrieved will have to compare the INSERT_DATETIME with the
> LAST_RETRIEVE_DATE.
>
> This method has actually affected the performance of the DataStage
> jobs which is supposed to complete within 15 minutes. As for this
> problem, it caused the DataStage job to exceed the 15 minutes buffer.
>
> My question is, whether the DataStage have a way to recognise the
> newly inserted records in the Oracle table without having scan through
> the Oracle table from record 1 again when doing the comparison?
>
> Our customer has been complaining about this performance issue
> sincethe job cannot complete within 15 minutes buffer.
>
> Hope you all could give some suggestions on how to rectify this
> problem. I would appreciate your help. Thank you.
>
> Regards,
> Melvin
>
>
>
>

Post by **admin** » Wed May 09, 2001 9:39 am

Hi Ray,

I got your message.

I appreciate your advice and I will try out your method.

Thank you.

Regards,
Melvin

Ray Wurlod wrote:

> How long does the SELECT take in Oracle (without DataStage)? Theres
> nothing you can do in DataStage to reduce this time. Please dont
> blame DataStage if the delay is external to DataStage.
>
> You could certainly try putting an index on the insert_date field in
> the Oracle table. This should speed the restriction or join.
>
> Dont do the comparison in DataStage. Put #LastRetrieveDate# into the
> WHERE clause in the SQL that selects the rows from Oracle. Or, as
> David suggested, have Oracle perform a join with the table containing
> the LastRetrieveDate value.

--
==================================================
Thong Yip Chin (Melvin)
Systems Engineer
Technical Department

Adtel Systems Sdn. Bhd.
34 & 36 Jalan SS 22/21,
Damansara Jaya,
47400 Petaling Jaya,
Selangor Darul Ehsan,
Malaysia.

Tel: 03-77277277
Fax: 03-77273198
Mobile: 017-8819665
Email: melvin@adtel.esb.com.my
Website: http://www.esb.com.my/ ==================================================