Datastage logic vs SQL

avi21st · Post by **avi21st** » Tue Apr 04, 2006 8:28 am

Hi

I am working in a project having strong fan following for SQL and PL/SQL. Still as a Datastage developer I want to use Datastage in most of my logic. But sometimes there is often a functional overlap between SQL and DataStage stages.

My Question is when to use Datastage logic and when to use SQL?

Which is faster-Use SQL joins or DataStage joins. As I know when combining data from very large tables, or when the source includes a large number of database tables, the efficiency of the DataStage EE in-link-sort and Join stages is faster than an equivalent SQL query. But for Server jobs I think SQL would be faster than using a Hashed file

Again stored procedures like Oracle PL/SQL which works on a per-row basis within a high-volume data flow. So I think Datastage EE is far better. But here we are using Standard Edition (server) -Is it better than PL?SQL. Should we use PL/SQL or Datastage

Another reason is the Oracle (OCI) stage cannot track Oracle errors properly so here in my project they want to use Unix script to call a SQL or PL/SQL script- in that way they can track errors- Is it the correct process

Some of us are willing to mix these two things well- please suggest what are the grey areas where SQL or PL/SQL works better than Datastage

gateleys · Post by **gateleys** » Tue Apr 04, 2006 9:19 am

If performance is the only issue, you could do everything within oracle itself (since you have not pointed out any other data sources). However, one of the major objective of an ETL tool (such as DataStage) is to make the process modular, and hence maintainable. In such a scenario, one would want most of the logic embedded within DataStage. This means minimizing the use of external processes, such as PL/SQL, dll, etc as much as possible.
Yes, SQL in ORAOCI will definitely be faster in case of joins in Server ed., and is recommended to filter as many records in the source itself. Regarding errors generated by ORAOCI, yeah, I agree, and have had to live with that (design in a way that errors are almost eliminated

).

gateleys

kcbland · Post by **kcbland** » Tue Apr 04, 2006 11:12 am

Your question requires at least 4 hours to verbally answer and show the various pros and cons to the most common architectural solutions. Are you doing batch ETL, micro-batch, realtime, one-time data migration, billions or millions or thousands of rows of data? Do you have unlimited budget and developers and time or no budget and just yourself and need it in two weeks? Is your hardware a 64cpu unix monster or a single cpu Windoze machine? Are your source and target on the same database instance or are they different database technologies on different servers?

rleishman · Post by **rleishman** » Wed Apr 05, 2006 2:10 am

Long story, but the architecture of my current DW is such that the entire Staging layer is in Oracle Externally Organised Tables, and the Enterprise (3NF) layer is in regular Oracle tables. Instead of doing lookups with Hashed Files, this enabled us to use SQL joins on Oracle.

Of course it would then be possible to take it a step further and perform the whole transformation in Oracle, but we made a design rule that joins would be done in SQL and business rules applied in DS. The end result was SQLs with simple join conditions, no WHERE filters (unless there were 1000's of out-of-scope rows in the Staging layer), no functions/arithmetic/concatenation in the SELECT clause, no sub-queries, no other funky SQL stuff.

The presentation layer (star schemas) is all Materialized Views over the Enterprise Layer, so no DS was necessary.

An equally valid approach (if you're a UV guru) is to put the Staging and Enterprise layers in UniVerse, and Presentation in Oracle. Then you can use Hased Files for lookups, or the UV stage for joins.
I don't think I'd be courageous enough to do this though, as I don't have the experience to fix UV problems when things go pear-shaped.

ray.wurlod · Post by **ray.wurlod** » Wed Apr 05, 2006 3:41 pm

Don't rely on UniVerse unless you have a UniVerse licence. The next release of DataStage does not use a UniVerse-like repository.

vmcburney · Post by **vmcburney** » Wed Apr 05, 2006 7:48 pm

I really like those policies, the show a good knowledge of the strengths of an ETL tool and metadata lineage and reporting.

We had a very good example today of the ETL v SQL debate. We have a fact summary table built from a fact detail table. It could be written as a fairly simple insert into SQL statement as both tables are in the same schema. Instead we built a parallel job with an aggregation stage. We wanted to track it through our ETL scheduling, logging and metrics.

Just this week someone decided we needed a new dimension based on date calculations, each fact summary row needed to join to a dimension table based on a date field being due this month, due in the future, 1-6 months old or greater then 6 months old. If we had taken the insert SQL path we would have some difficult select SQL functions to write and poor error handling around this. Since we had taken the ETL path it was an easy matter to add a transformer before the aggregation stage to derive the fields with a reject link for failed calculations.

So I would argue that the ETL approach is more flexible and maintainable and has better reporting and metadata options.

We are using Enterprise Edition so we have very good trapping of failed SQL inserts and updates and some derived data quality metrics from this.

avi21st · Post by **avi21st** » Fri Apr 07, 2006 12:46 pm

Sorry for being late in replying to your valued mails.

I would like to provide the following information on what we plan:

We are working on two things-Coversion(Initial Load) and Update on Unix platform having 8 CPU

Our conversion would be really big (4 tera bytes !!!) but I think it really does matter as we dont care if our inital load take 1 day

Now the problem is with the Updates : still we are having a max 20 milllion to 30 million records for diffferent updates-divided in weekly and quaterly.

Only some Monster dimensions like Account, Email Option we do not use Hashed File for lookups. We load the incoming file infile in a staging table and then have a lookup using SQL joins. So staging tables serve the purpose for QA as well as helps to get rid of Hashed files lookups.

Again for insert and updates we use a Oracle Merge and call it through Datrastage Execute Activity in the sequencers.

So mainly they are planning a SQL or PL/SQL project

Another thing which I dislike is that we are using Datastage as a scheduling tool here. We are writing SQL codes and wrapping it up with Unix shell. We are then scheduling this shell using Datastage (

)

This is my first experince on Datastage scheduling

previously used Autosys or ControlM. I dont really agree on many aspects how we are handling things here.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 07, 2006 4:15 pm

DataStage is not a scheduler.

It happens to contain a GUI so that you can use cron or at but it is not a scheduler.

Sounds like a huge waste of money if that's all they're using DataStage to do.

DSguru2B · Post by **DSguru2B** » Fri Apr 07, 2006 4:57 pm

Many companies just buy it Ray, you know that

avi21st · Post by **avi21st** » Fri Apr 07, 2006 9:46 pm

Thats true Ray can't help- just as DSGuru2B said.....they just bought it

Even they are not really have any guy knowing Datastage well- they hired me but after joining this project I am pretty bored writing silly jobs(only simple load jobs).

Anyway I can't do anything more than suggesting.........

So long live the Datastage Scheduler

ray.wurlod · Post by **ray.wurlod** » Sat Apr 08, 2006 3:19 pm

You can do more. You can challenge.

One of the biggest DataStage sales in the Asia region was made because the pre-sales engineer (not I, btw) had the guts to stand up and tell them their model was wrong (and why, and how it should be) rather than just kowtowing to this very important company.

Find out why these decisions have been made. Get them out of their comfort zones at least long enough to examine their own - possibly invalid - assumptions. Be prepared to answer their questions factually.

xli · Post by **xli** » Sat Apr 08, 2006 5:20 pm

ray.wurlod wrote:Don't rely on UniVerse unless you have a UniVerse licence. The next release of DataStage does not use a UniVerse-like repository. ...

then, what would be used to replace Universe as DataStage repository ?

thanks, xli

Hadi · Post by **Hadi** » Sat Apr 08, 2006 8:07 pm

ray.wurlod wrote:DataStage is not a scheduler. It happens to contain a GUI so that you can use cron or at but it is not a scheduler.

Hi Ray,

When do we use Scheduling tool in DataStage Director then? What is it used for?

ray.wurlod · Post by **ray.wurlod** » Sun Apr 09, 2006 12:43 am

xli wrote:
ray.wurlod wrote:Don't rely on UniVerse unless you have a UniVerse licence. The next release of DataStage does not use a UniVerse-like repository. ...
then, what would be used to replace Universe as DataStage repository ?

thanks, xli

The next ("Hawk") release has an "open" repository that is shared by all the Ascential products, and which you no longer access directly. "Open" in the first instance means your choice of DB2, Oracle or SQL Server. Since Hawk is completely service-oriented, you will access the repository via a couple of services, one of which is the Metadata Delivery service.

ray.wurlod · Post by **ray.wurlod** » Sun Apr 09, 2006 12:45 am

Hadi wrote:When do we use Scheduling tool in DataStage Director then? What is it used for?

Schedule view in Director is nothing more than a GUI front end to the operating system scheduling service; cron or at command.

DSXchange

Datastage logic vs SQL

Datastage logic vs SQL

Re: Datastage logic vs SQL