Diff between server Job Parallel Job

kcbland · Post by **kcbland** » Mon Oct 17, 2005 7:28 pm

prabu wrote:what about Parallel routines , which forces the DataStage folks to know C?? comments please

Are you moving millions or billions of rows of data?

Is your hardware thousands or hundreds of thousands of dollars?

Is your software budget one hundred or five hundred thousand dollars?

Is your team skillset beginner or advanced data integration experienced?

This should be the questions you're answering. Server is for lower volume, a less powerful server, smaller skillset developers. Parallel is for higher volume, more powerful multi-noded servers, higher skillset develoeprs. C is more complicated, but is the requirement for doing for powerful and expeditious transformations.

kumar_s · Post by **kumar_s** » Mon Oct 17, 2005 8:55 pm

The hash file stage is powerful in that it allows for reading, writing, and updating a reference dataset.

Hi,
Lookup file set / lookup stage with dataset (hash partitoned) -- Wont it do the respective work of hash file stage.

regards
kumar

ray.wurlod · Post by **ray.wurlod** » Mon Oct 17, 2005 10:27 pm

It's your idea Kumar_S - you run with it.

Prove to us, or otherwise, your hypothesis that a Lookup File Set can be updated in real time; that is, that the import operation that maps it into a virtual Data Set is sensitive to changes in the underlying Lookup File Set. The alternative possibility is that the virtual Data Set is loaded when the Lookup stage starts and can not subsequently be updated.

Please post your methodology and your results.

vmcburney · Post by **vmcburney** » Mon Oct 17, 2005 11:53 pm

Server jobs are easier then parallel jobs, a hash file stage and transformer stage is easier to learn then lookup + join + merge + change data capture + filter + modify + transformer + change data apply + dataset + fileset. Especially since the change capture stage produces a mountain of warning messages, and the modify stage has a different function list to the transform stage, which has a different syntax to the filter stage, and none of the stages provide a good betweens lookup, and half of them have reject links and half don't.

However most clients working with large amounts of data need some parallel jobs, and for those sites is a question of only using parallel jobs or using a mixture of both.

The argument for server jobs for smaller volumes is that they are easier and faster to build and they use less resources. A small parallel job will compete for node resources with larger jobs and have longer start and stop times. Many of these smaller jobs use Database tables that are not partitioned so they perform a lot of useless partition unless you force them to be single threaded.

The argument for sticking with parallel jobs is to promote re-use between jobs. Eg. a large job writing a dataset for a smaller job to use. Copying stages between jobs. Sharing custom routines. Parallel shared containers. Only have one type of job to learn is an important consideration for the initial development team and subsequent maintenance and enhancements to the jobs.

I've been happy to go with parallel only, server only or a mix of both. For the sake of maintainability I would try and stick with just one job type.

kanchub79 · Post by **kanchub79** » Wed Oct 19, 2005 12:41 am

Hi,

I have some idea on this may be this will clear some doubts on this topic.

Server jobs run on single node processors.where as parallel jobs run on SMP AND MPP machines,where in we got multiple nodes to process data.

Given a senario I would prefer using server jobs if the data which we have on source sides is not of huge volume since while processing such huge volumes,one main intention is to see for performance issues.performance of server job active stages will have adverse effect on huge data.But there is work around here where we do the aggregations or pre sorting the data at the Database level,as we will benefit from the indexes and partitions on the tables in Database side.But this can solve our performance issue to some extent only.

Now on the Parallel jobs we got 2,4,8 nodes in dev/test/prod environments.then we can process huge data by selecting the appropriate partitioning method(Round Robin,Hash,Modulus etc) where in the data processed by the active stages will be divided according to the partitioning mode in to all the nodes available in the particular environment.This can improve our performance in handling huge data.

Again all of this rests with the organizations as others pointed out about te costs of each in replies.

Please correct me if I understood worngly.

ray.wurlod · Post by **ray.wurlod** » Wed Oct 19, 2005 12:48 am

You are right that server jobs must run on a single machine, however this can have lots of CPUs and it is possible to implement both pipeline and partition parallelism in server jobs. However, this must be designed, and does not automatically scale if, for example, one desires to assign more processors to the task. There is no scope for processing on the multiple machines in an MPP cluster. Server jobs can deal with large volumes of data.

Parallel jobs, on the other hand, can scale automatically by the simple expedient of running with a configuration file that specifies a different number of processing nodes and associated resources. Parallel jobs can be executed using multiple machines in an cluster or a grid. Parallel jobs can deal not just with large volumes of data, but huge volumes of data.

And it needs to be said that developing in parallel jobs is a different skill set from developing in server jobs; server job techniques and assumptions do not, in general, translate very well to the parallel job environment. Come along to the "Server to Parallel Transition" class to learn more; next presentation is November 16-18 in Las Vegas.

prabu · Post by **prabu** » Wed Oct 19, 2005 1:59 pm

ray.wurlod wrote:Server routines force you to know DataStage BASIC.

And there are probably a lot more C programmers out there than there are DataStage/UniVerse BASIC programmers.

What's the point? If you're going to program, you need to know/learn the programming language of choice for the application.

Ray, i agree with you to some extent. my idea is to hightlight that Parallel job doesnt support lot of built-in functions(As compared to server jobs), which is not a good news. example: trim(string)

Also C uses 1 byte for char whereas DataStage i belive uses 2 bytes. wchar_t* doesnt seems to be supported.

i can see lot of considerations when the data type is passed around between diff execution environemts. is there any guarentee that the data type between DataStage and C is 1-1??

.

the switch over to execute external routines will defintely be costly.
i would like to see more discussion on this (topic)...

ray.wurlod · Post by **ray.wurlod** » Wed Oct 19, 2005 4:45 pm

You'd perhaps be surprised at how few external routines you can get away with using if your knowledge of what's available in the product is comprehensive.

I agree that there's a need for something like the SDK suite for parallel jobs; such a suite is indicative of a mature product - it took some years before it was developed for server jobs.

DataStage BASIC does not really have data types in the conventional sense; it uses a structure called a DATUM that can change "data type" on the fly. Clearly there are overheads involved (but it is not true to say that server requires two bytes for Char - it can be more, depending on a number of factors it probably will be -for example any DATUM needs to carry a REMOVE pointer and a hint mechanism for the EXTRACT function).

DataStage BASIC can pass data to and fro to C functions through interfaces such as the General Call Interface which is essentially a scratch pad for converting between the strongly-typed C environment and the DATUM-based DataStage BASIC environment.

This conversion overhead is the main issue when using BASIC Transformer stage in parallel jobs.

Except when using Server Shared Containers or BASIC Transformer stages, and when passing parameter values from a job sequence to a parallel job, there is not much call for passing values between the two environments.

ameyvaidya · Post by **ameyvaidya** » Wed Oct 19, 2005 5:29 pm

2 points more that I'd like to add:

1. If you have a warehouse to be built for a (very) rapidly growing client(Retail would be a good example). And lets assume that his data volumes are expected to increase say by double every quarter(it is possible). However considering the client's area of operation, you only have a fixed time frame (Say 1 hour) to execute his entire warehouse refresh.

How will you approach this problem in Server Jobs??

You write the best, most optimized Job design.. Implement partitiioning and parallelism and get the warehouse refresh done in 15 minutes. But there will always come a time where you will need to optimize again and again and again. And by optimization , I mean development.. change control... testing... QC

.

In parallel Jobs, the easiest solution is to double the Processing power. If I had a single computer, I will implement an MPP of 2, then 4 then 8 machines. (Theoretically atleast) DSEE could be scaled up that way almost indefinitely. The Job design once optimized can remain constant.
And if the company is growing that rapidly, it usually can afford to throw money at hardware, but users will raise

HELL

if they find incorrect data because of your latest optimization.

2.
A little birdie (actually an Ascential Consultant) once whispered in one of my previous client's ear that Server jobs are due to be extinct in some future version of DataStage and he'd be better off doing his entire dev in PX for future maintainability.

ray.wurlod · Post by **ray.wurlod** » Wed Oct 19, 2005 5:55 pm

Balance that whispered advice with the fact that the sales dude's commission is far greater for EE than for server.

A job sequence is a special case of a server job (with job control code). The server engine functionality is not going away any time soon.

vmcburney · Post by **vmcburney** » Wed Oct 19, 2005 6:07 pm

I thought it was theoretically possible to implement server jobs on an MPP by using the SOA edition. Your large server jobs would be turned into real time jobs but called up via a standard batch scheduler, the SOA agents would then decide which server to run it on. Sorry I'm a bit vague on the details, had a quick discussion about it at Ascworld 2004 but have never used SOA Edition. A large job could then be split into multiple instances running on different servers. Still more difficult then EE as you have to consider repartitioning and sharing data.

prabu · Post by **prabu** » Wed Oct 19, 2005 8:12 pm

ray.wurlod wrote:
I agree that there's a need for something like the SDK suite for parallel jobs; such a suite is indicative of a mature product - it took some years before it was developed for server jobs.

Exactly, the whole idea of using an ETL tool is to bring down the development time. From my experience, the problem with external function is to test it comprihensively before publishing it [exception handling and such]. i remeber someone was asking in some other post, how to find the full month from the input date value... write some case statement like if month =1 then January ....

ray.wurlod wrote:
DataStage BASIC does not really have data types in the conventional sense; it uses a structure called a DATUM that can change "data type" on the fly. Clearly there are overheads involved (but it is not true to say that server requires two bytes for Char - it can be more, depending on a number of factors it probably will be -for example any DATUM needs to carry a REMOVE pointer and a hint mechanism for the EXTRACT function).

if the internal memory allocation of datatype is not fixed[like say 2 byte for char, 1 byte for int, how can it read any input??. Is DATUM something like read till find end of character '\0'??. i think '\0' is true for strings only

].

i thought unicode characters can be represented in 2 bytes. could you please be kind to eloborate more on DATUM.

A web defintion of Unicode
==================
"A 16-bit character encoding scheme allowing characters from Western European, Eastern European, Cyrillic, Greek, Arabic, Hebrew, Chinese, Japanese, Korean, Thai, Urdu, Hindi and all other major world languages, living and dead, to be encoded in a single character set"

ray.wurlod wrote:
DataStage BASIC can pass data to and fro to C functions through interfaces such as the General Call Interface which is essentially a scratch pad for converting between the strongly-typed C environment and the DATUM-based DataStage BASIC environment.

Any Call Interface should have the data-type mappings/conversion atleast correct.

example:

i have some string "abcde" in some system, say UNIX and let us consider that UNIX will allocate 2 bytes for a char. totally it will take 5+1(1 char for end of character].

Say , if DS tries to read the character by assuming that chars are 1 byte , then it read the 2 bytes represnting 1 char in UNIX for 2 charcters and produce 2 chars, wont it ??

ray.wurlod wrote:
This conversion overhead is the main issue when using BASIC Transformer stage in parallel jobs.

Except when using Server Shared Containers or BASIC Transformer stages, and when passing parameter values from a job sequence to a parallel job, there is not much call for passing values between the two environments.

i am sorry, i am confused here. which are the 2 environments.
is it not true that everytime an external function is called , we will incur an overhead.

A general question [maybe hijacking the thread]
=============
Why not there is some stand alone operator for NULL in Parallel job, the literal which represnt "unknown" values. it sucks when the only equivalent is setNull() which unfortunately returns an int8.

as a example, i have written a C function to implement the trim functionlaity. i have passed test strings like space(8):string , string:space(8) , space(8) etc. when i try to pass a NULL , it doesnt allow me becuase there is no way to represent a NULL in parallel job.

so i have created 1 more dummy C function which returns NULL
// C return value
return NULL

my call goes like Exfn_Trim(Exfn_givemeNULL) for every input string. The worst case is when i return NULL from my C routine , and i check for ISNULL(Exfn_Trim(Exfn_givemeNULL)) it doesnt recognize it.

it treats the NULL return value from the external C function like an empty string ""

if it will help to discuss ,i can paste my C functions here

i think , this non-availability of "good old" NULL, is a serious limitation, why cant it understand nulls??

thanks for your patience

regards,
Prabu

kcbland · Post by **kcbland** » Wed Oct 19, 2005 8:14 pm

ameyvaidya wrote:A little birdie (actually an Ascential Consultant) once whispered in one of my previous client's ear that Server jobs are due to be extinct in some future version of DataStage and he'd be better off doing his entire dev in PX for future maintainability.

Har har har. Yeah, that's a good one. Do you think a company would alienate 80%+ of their customer base established ETL solutions? Back in 2001 when Ascential acquired Torrent's technology for the PX framework, they had what, 5 installed customers compare to 2000+ Server installations. How many Server licenses compared to PX licenses would they have to sell in 4 years to justify shutting off that maintenance stream for Server? Do you realize how many of those Server licenses are on NT platforms that currently aren't in the support path for PX?

Do you know, or have to ever spoke to Informatica sales reps who have made offers to DS customers to swap out licenses for free? Given the choice of a forced rewrite to PX from Server or a free switchover to new tool and then a forced rewrite, some might just opt for the cheaper maintenance fee on the new product?

Companies make decisions on getting the job done now, not saying, maybe we should wait a year or two for the vaporware to materialize. You do what you have to do. Sometimes the lesser tool is used because it gets the job done now, as opposed to waiting till later for a different tool on a promise it will get the job done better. Somehow the world turned, data got loaded, billions of rows juggled, and all without PX. I'm not saying it's a junk tool, I'm just saying that the data is managed just fine using java, C, shell scripts, bulk loaders, stored procedures, and yes, even Server jobs can load multi terabyte warehouses.

One last note, as I've posted many times previously. The best tool/database can be made to look like garbage by the worst code designer. The best code designer can make the worst tool/database look like its awesome. Folks think that just because you write a PX job it's automatically the fastest loading solution. Read these forums, there's plenty of people wondering why their PX jobs are loading at 50 row/s second.

prabu · Post by **prabu** » Wed Oct 19, 2005 8:34 pm

ameyvaidya wrote:2 points more that I'd like to add:

1. If you have a warehouse to be built for a (very) rapidly growing client(Retail would be a good example). And lets assume that his data volumes are expected to increase say by double every quarter(it is possible). However considering the client's area of operation, you only have a fixed time frame (Say 1 hour) to execute his entire warehouse refresh.

How will you approach this problem in Server Jobs??

You write the best, most optimized Job design.. Implement partitiioning and parallelism and get the warehouse refresh done in 15 minutes. But there will always come a time where you will need to optimize again and again and again. And by optimization , I mean development.. change control... testing... QC .
.

I am waiting for a tool from Ascential , whcih can migrate the Server job inot Parallel jobs

ameyvaidya wrote: 2.
A little birdie (actually an Ascential Consultant) once whispered in one of my previous client's ear that Server jobs are due to be extinct in some future version of DataStage and he'd be better off doing his entire dev in PX for future maintainability.

Read my blind guess above, but the said toll may cost $$$$$$$$$

kcbland · Post by **kcbland** » Wed Oct 19, 2005 9:04 pm

prabu wrote:I am waiting for a tool from Ascential , whcih can migrate the Server job inot Parallel jobs

There's no equivalent of a HASH file, all user-written DS BASIC functions would have to be magically ported, you'd end up with non-parallel jobs just to satisfy jobs that HAVE TO sequentially process data reading/writing to staging hash files without caching where row-order dependencies exist, unless you're going to create a magical custom operator writer to handled multiple record iteration/reconcilations (vec to rec, agg to rec, whatever).

Server and PX are as similar as Informatic and PX. The technologies, languages, etc are too different, there are few analagous components. If you're waiting, then I'll pass you by with a team of consultants doing the total rewrite to PX. You'll get a better result. Ever try to maintain or enhance code that was initially generated instead of architected? It's brutal, and a company that does it deserves what comes from that poor choice.

DSXchange

Diff between server Job Parallel Job

Re: Comments about Parallel routines

Re: Diff between server Job Parallel Job

Parallel jobs -C dependency...