Server to Parallel

sam334 · Post by **sam334** » Sat Apr 19, 2014 12:57 am

All,
Hope you are doing well. Need an advice from you. I am actually working in Datastage for almost 3 years now, currently 9.1. But my current project only uses server jobs not parallel. Basically I started with Server job datastage career. It seems like, all the companies in the market are using parallel jobs. Can you please suggest me how can I move to parallel from server job. I know shell scripting and PL/SQL which can be useful for server job. But to really move to parallel environment which should I do?

Thanks,

Appreciate your thought.

Sam.

qt_ky · Post by **qt_ky** » Sat Apr 19, 2014 7:52 am

You could start by reading all the available product documentation.

eostic · Post by **eostic** » Sat Apr 19, 2014 10:17 am

Exactly...and just start building some Parallel Jobs and playing with it. See what's different...and what is the same. In addition to the documentation, review the threads here...do some searching --- there are resources on how to build parallel Jobs and why/what, etc. all over the place...not just in this forum.

...and don't drop your Server skills...they will come in very handy.....there are times when you will want Server Jobs alongside, or instead of, EE Jobs.

Ernie

kduke · Post by **kduke** » Sat Apr 19, 2014 11:42 am

Once config files are setup then PX jobs are not much different. It is a slightly different mindset. Sort only when you need it. Set your partitioning and leave it the same. Only a few stages require sorting like Merge. So you need a hash partition and same from the time you sort until Merge stage.

The biggest mindset is to never land the data. One big job is usually better than landing the data and a lot of little jobs. Grab your data. Do everything you need and land it in the target. One big job. Often a sort in the database especially on multiple keys often does not match what DataStage wants. So sort it in your job when you need it. Debugging this can be a nightmare.

More stages in PX is not necessarily a bad thing. PX obs will optimize out copy stages. A trick I learned from IBM developers was to have a copy stage right before a join, merge or lookup stage. Throw in a peak stage so see why your join, merge or lookup is not working like you think.

You need to think about how memory is used. Lookups are all in memory per each node. So keep as few fields as possible. If you have an 8 node config and millions of rows on a lookup then might need to change to a join or a merge. Right your job both ways and test it. Look at memory usage. Remember production probably has a lot more jobs all running at the same time. So one job in DEV may run great and in PROD it kills the server.

Just because something runs without warnings does not mean it is accurate or optimized. You lookups or joins could be failing because of something you did not think about like metadata. If one side is trimmed and mapped to a varchar and the other side is not trimmed or mapped to char then you are not going to get results you think about.

PX is a lot more picky about metadata. It will do implicit conversions when you are sloppy. You either need to convert your column types up front or right before insert/update. Be consistent. It will save you lots of work debugging.

All outputs need reject links. You never know if an insert or update failed without reject links. You want to trap all rejects so you know how many failed.

Remember PX is not that much different than server. If you are good in one you can be good in the other. Try to imagine what is happening when it splits a stream into multiple partitions. Sometimes 4 nodes will out perform 6 nodes if you can do lookups without sorts compared to merge with sorts. Same is true for options on stages. Just because it will run with array size of 30,000 does not mean that it is faster. Maybe 5,000 is faster in production because the database is not so over worked creating rollback segments and monitoring locks.

Most of this is what we in Texas call "common sense". A term we grew up with meaning think it through, it might not be as simple as you thought.

chulett · Post by **chulett** » Sun Apr 20, 2014 7:39 am

Not sure how common 'common sense' is any more, Kim.

All great advice. From a reading standpoint, I highly recommend this IBM Redbook: InfoSphere DataStage Parallel Framework Standard Practices.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Mon Apr 21, 2014 7:32 am

Reading many of the manuals is not extremely helpful. Quite a few are extremely repetitive (boring!) and are self referential ("the Buffer button is used to set Buffers" - Doh!). They are a good reference if you know what you are looking for.

I'd say a better place to start is the Redbook IBM InfoSphere DataStage Data Flow and Job Design.

http://www.redbooks.ibm.com/abstracts/sg247576.html

chulett · Post by **chulett** » Mon Apr 21, 2014 7:35 am

Yah... that one too.

rkashyap · Post by **rkashyap** » Mon Apr 21, 2014 9:39 am

You can also start by viewing Server to Parallel Transition Lab with RayWurlod

chulett · Post by **chulett** » Mon Apr 21, 2014 9:59 am

Wowzers... that brings back memories.

sam334 · Post by **sam334** » Mon Apr 21, 2014 1:15 pm

Thanks everybody. I will start reading the materials very soon.

Craig, Thanks for your suggestion to post it in general blog..Worked great..

Also, if somebody can open the non visible part for a while that will be great. As said before, my premium membership is not activated yet though I paid almost a month back.

Thanks....