most frequently used stages?

mctny · Post by **mctny** » Sat Mar 11, 2006 1:30 pm

Helloo guys,

I am very new to DS EE and I am trying to learn it just by reading now, I was wondering which stages do you use most in DS EE. so that I can study these stages. And also any suggestion for learning DS EE will be appreciated.

Thanks in advance
Cetin

ray.wurlod · Post by **ray.wurlod** » Sat Mar 11, 2006 1:57 pm

Welcome aboard. :D

You did not state whether you are coming from a server edition background or are entirely new to DataStage EE.

I doubt that you will get much consensus.

DataStage EE has many more stage types than server, because it follows a philosophy of one stage, one task (unlike the Transformer stage in server jobs, which performs a multitude of tasks).

People will choose the stage types that connect to whatever data sources they have - be it DB2, Oracle, SQL Server or others. Enterprise stages are to be preferred in general, because they innately support parallelism (though not in all circumstances).

You choose the appropriate Processing stage for the task at hand. Some advise to avoid Transformer stages (only because they have to generate C++ code that has to be compiled and linked back into the job), but sometimes you just have to use it.

Differentiate between Lookup, Join, Merge and Funnel, all of which combine data horizontally, but have subtle differences. Choose the one that does exactly what you require.

Similarly, if splitting (uncombining) data, maybe the Switch stage, maybe the Filter stage, maybe the Copy stage is sufficient.

Summarizing data is possibly the only unambiguous choice (Aggregator), though it is possible to summarize using a Transformer stage with stage variables.

Many stages require sorted data. Most stages provide for sorting on their input links, but there is also a Sort stage (with the ability also to access external sorting programs). The Sort stage gives better control over how much memory is allocated to sorting.

Change detection can be performed by Difference, Compare or Change Capture stages. Each is slightly different, each has a specific outcome.

And so on.

You can probably bypass the Restructure stages (Make/Split/Promote Subrecord, Make/Split Vector, Column Export/Import, Combine Records) and Custom stages while in the beginner phase.

What you do really need to get your head around are parallelism concepts, such as partitioning of data. A good starting point is Chapter 2 of the Parallel Job Developer's Guide - then you may begin to understand what is being asked of you on the Partitioning tab on most stages' Input links.

A transition class from server to parallel mindset is offered through this site (watch the home page for announcements - one was run last week). If you're totally new, IBM offers an Enterprise Edition essentials class.

mctny · Post by **mctny** » Sat Mar 11, 2006 3:41 pm

Thank you ver much Ray, it looks like you summarized it very well.

I know a little bit about DS server edition but I don't have much experience on it either. I think it is not necessary to be very good at DS server edition (or to know completely) before you start to DS EE, is it?

so can you say that Lookup, Join, Merge, Funnel, Switch, aggregator, Difference, Compare and Change Capture states are used more frequently and more critical to learn in the beginning?

ray.wurlod · Post by **ray.wurlod** » Sat Mar 11, 2006 5:10 pm

No. Only if those are the tasks you need to perform. But they're as good a starting point as any, provided you don't overlook the technicalities of getting the passive stages to work right. Otherwise you won't be able to do the "E" and "L" parts of "ETL"! And you really will need one or both of Modify and Transformer stages from time to time.

kumar_s · Post by **kumar_s** » Sat Mar 11, 2006 11:07 pm

Some advise to avoid Transformer stages (only because they have to generate C++ code that has to be compiled and linked back into the job), but sometimes you just have to use it.

Hi Ray,
Is all the Myth about Transformer is only for/about first run (until the job get compiled).
Because it shouldnt be producing any C++ code during run time.

ray.wurlod · Post by **ray.wurlod** » Sun Mar 12, 2006 7:41 am

That's true as far as it goes, but remember that the libraries produced by compiling the Transformer stage have to be dynamically linked at run time. So there is also the call overhead.

I prefer to use stage types that use operators directly, because the Orchestrate engine is geared to the explicit use of operators and ought - solely on that basis - to be more efficient than external code. Further, however, I'd like to think that the supplied operators, having been around for a while, have been proven to be optimally efficient and robust; can I believe the same about the Transformer-generated code, which is a new piece from the (admittedly competent) Ascential engineers when Orchestrate was as new to them as it was to us?