Datastage logic vs SQL

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
avi21st
Charter Member
Charter Member
Posts: 135
Joined: Thu May 26, 2005 10:21 am
Location: USA

Datastage logic vs SQL

Post by avi21st »

Hi


I am working in a project having strong fan following for SQL and PL/SQL. Still as a Datastage developer I want to use Datastage in most of my logic. But sometimes there is often a functional overlap between SQL and DataStage stages.

My Question is when to use Datastage logic and when to use SQL?

Which is faster-Use SQL joins or DataStage joins. As I know when combining data from very large tables, or when the source includes a large number of database tables, the efficiency of the DataStage EE in-link-sort and Join stages is faster than an equivalent SQL query. But for Server jobs I think SQL would be faster than using a Hashed file

Again stored procedures like Oracle PL/SQL which works on a per-row basis within a high-volume data flow. So I think Datastage EE is far better. But here we are using Standard Edition (server) -Is it better than PL?SQL. Should we use PL/SQL or Datastage

Another reason is the Oracle (OCI) stage cannot track Oracle errors properly so here in my project they want to use Unix script to call a SQL or PL/SQL script- in that way they can track errors- Is it the correct process

Some of us are willing to mix these two things well- please suggest what are the grey areas where SQL or PL/SQL works better than Datastage :)
Avishek Mukherjee
Data Integration Architect
Chicago, IL, USA.
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Re: Datastage logic vs SQL

Post by gateleys »

If performance is the only issue, you could do everything within oracle itself (since you have not pointed out any other data sources). However, one of the major objective of an ETL tool (such as DataStage) is to make the process modular, and hence maintainable. In such a scenario, one would want most of the logic embedded within DataStage. This means minimizing the use of external processes, such as PL/SQL, dll, etc as much as possible.
Yes, SQL in ORAOCI will definitely be faster in case of joins in Server ed., and is recommended to filter as many records in the source itself. Regarding errors generated by ORAOCI, yeah, I agree, and have had to live with that (design in a way that errors are almost eliminated :wink: ).

gateleys
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Your question requires at least 4 hours to verbally answer and show the various pros and cons to the most common architectural solutions. Are you doing batch ETL, micro-batch, realtime, one-time data migration, billions or millions or thousands of rows of data? Do you have unlimited budget and developers and time or no budget and just yourself and need it in two weeks? Is your hardware a 64cpu unix monster or a single cpu Windoze machine? Are your source and target on the same database instance or are they different database technologies on different servers?
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

Long story, but the architecture of my current DW is such that the entire Staging layer is in Oracle Externally Organised Tables, and the Enterprise (3NF) layer is in regular Oracle tables. Instead of doing lookups with Hashed Files, this enabled us to use SQL joins on Oracle.

Of course it would then be possible to take it a step further and perform the whole transformation in Oracle, but we made a design rule that joins would be done in SQL and business rules applied in DS. The end result was SQLs with simple join conditions, no WHERE filters (unless there were 1000's of out-of-scope rows in the Staging layer), no functions/arithmetic/concatenation in the SELECT clause, no sub-queries, no other funky SQL stuff.

The presentation layer (star schemas) is all Materialized Views over the Enterprise Layer, so no DS was necessary.

An equally valid approach (if you're a UV guru) is to put the Staging and Enterprise layers in UniVerse, and Presentation in Oracle. Then you can use Hased Files for lookups, or the UV stage for joins.
I don't think I'd be courageous enough to do this though, as I don't have the experience to fix UV problems when things go pear-shaped.
Ross Leishman
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Don't rely on UniVerse unless you have a UniVerse licence. The next release of DataStage does not use a UniVerse-like repository.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I really like those policies, the show a good knowledge of the strengths of an ETL tool and metadata lineage and reporting.

We had a very good example today of the ETL v SQL debate. We have a fact summary table built from a fact detail table. It could be written as a fairly simple insert into SQL statement as both tables are in the same schema. Instead we built a parallel job with an aggregation stage. We wanted to track it through our ETL scheduling, logging and metrics.

Just this week someone decided we needed a new dimension based on date calculations, each fact summary row needed to join to a dimension table based on a date field being due this month, due in the future, 1-6 months old or greater then 6 months old. If we had taken the insert SQL path we would have some difficult select SQL functions to write and poor error handling around this. Since we had taken the ETL path it was an easy matter to add a transformer before the aggregation stage to derive the fields with a reject link for failed calculations.

So I would argue that the ETL approach is more flexible and maintainable and has better reporting and metadata options.

We are using Enterprise Edition so we have very good trapping of failed SQL inserts and updates and some derived data quality metrics from this.
avi21st
Charter Member
Charter Member
Posts: 135
Joined: Thu May 26, 2005 10:21 am
Location: USA

Post by avi21st »

Sorry for being late in replying to your valued mails.

I would like to provide the following information on what we plan:

We are working on two things-Coversion(Initial Load) and Update on Unix platform having 8 CPU

Our conversion would be really big (4 tera bytes !!!) but I think it really does matter as we dont care if our inital load take 1 day :lol:

Now the problem is with the Updates : still we are having a max 20 milllion to 30 million records for diffferent updates-divided in weekly and quaterly.


Only some Monster dimensions like Account, Email Option we do not use Hashed File for lookups. We load the incoming file infile in a staging table and then have a lookup using SQL joins. So staging tables serve the purpose for QA as well as helps to get rid of Hashed files lookups.

Again for insert and updates we use a Oracle Merge and call it through Datrastage Execute Activity in the sequencers.

So mainly they are planning a SQL or PL/SQL project :oops:

Another thing which I dislike is that we are using Datastage as a scheduling tool here. We are writing SQL codes and wrapping it up with Unix shell. We are then scheduling this shell using Datastage ( :shock: )

This is my first experince on Datastage scheduling :wink: previously used Autosys or ControlM. I dont really agree on many aspects how we are handling things here.
Avishek Mukherjee
Data Integration Architect
Chicago, IL, USA.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

DataStage is not a scheduler.

It happens to contain a GUI so that you can use cron or at but it is not a scheduler.

Sounds like a huge waste of money if that's all they're using DataStage to do.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Many companies just buy it Ray, you know that ;)
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
avi21st
Charter Member
Charter Member
Posts: 135
Joined: Thu May 26, 2005 10:21 am
Location: USA

Post by avi21st »

Thats true Ray can't help- just as DSGuru2B said.....they just bought it


Even they are not really have any guy knowing Datastage well- they hired me but after joining this project I am pretty bored writing silly jobs(only simple load jobs).

Anyway I can't do anything more than suggesting.........

So long live the Datastage Scheduler :lol:
Avishek Mukherjee
Data Integration Architect
Chicago, IL, USA.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You can do more. You can challenge.

One of the biggest DataStage sales in the Asia region was made because the pre-sales engineer (not I, btw) had the guts to stand up and tell them their model was wrong (and why, and how it should be) rather than just kowtowing to this very important company.

Find out why these decisions have been made. Get them out of their comfort zones at least long enough to examine their own - possibly invalid - assumptions. Be prepared to answer their questions factually.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
xli
Charter Member
Charter Member
Posts: 74
Joined: Fri May 09, 2003 12:31 am

Post by xli »

ray.wurlod wrote:Don't rely on UniVerse unless you have a UniVerse licence. The next release of DataStage does not use a UniVerse-like repository. ...
then, what would be used to replace Universe as DataStage repository ?

thanks, xli
Hadi
Participant
Posts: 17
Joined: Wed Mar 22, 2006 4:06 am

Post by Hadi »

ray.wurlod wrote:DataStage is not a scheduler. It happens to contain a GUI so that you can use cron or at but it is not a scheduler.
Hi Ray,

When do we use Scheduling tool in DataStage Director then? What is it used for?
thanks & regards,
hadi
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

xli wrote:
ray.wurlod wrote:Don't rely on UniVerse unless you have a UniVerse licence. The next release of DataStage does not use a UniVerse-like repository. ...
then, what would be used to replace Universe as DataStage repository ?

thanks, xli
The next ("Hawk") release has an "open" repository that is shared by all the Ascential products, and which you no longer access directly. "Open" in the first instance means your choice of DB2, Oracle or SQL Server. Since Hawk is completely service-oriented, you will access the repository via a couple of services, one of which is the Metadata Delivery service.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Hadi wrote:When do we use Scheduling tool in DataStage Director then? What is it used for?
Schedule view in Director is nothing more than a GUI front end to the operating system scheduling service; cron or at command.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply