I got a document on DS EE best practices, It says about modify stage
"3.3.7 Modify stage
After DataStage release 7.5.1, Transformer stage performs better than Modify stage even for simple null handling operations. Moreover Modify stage breaks the metadata link between the stages. So it is not recommended to use Modify stage in the jobs."
Is this true?
is it worse compared to a transformer or is this just another made up argument?
I am not pretty sure about your satatement about modify stage. In which document you have read about this performance tips ? even I have some performance documents, they are saying that Modify stage is the more useful stage in the DS Parallel. Until, I haven't find any drgastic performance degrade with modify stage, while handling nulls.
NageshSunkoji
If you know anything SHARE it.............
If you Don't know anything LEARN it...............
As far as i know Modify Stage gives more performance than the Transformer Stage. Also we are mostly using modify stage in our jobs instead of Transformer (wherever it is possible), and we are trying to avoiding Transformer stage bcoz of performance.
It is true that transformer is more efficient in 7.5.1 than in previous versions but modify stage should be at least as efficient as transformer if not more.
There are far too many unsupported assertions in that document, which - if it is or is based upon the one of which I'm thinking - you should not have (it's IBM Internal Confidential, produced by the Center of Excellence for use by IBM consultants).
While it is true to claim that performance improvements have been made in the Transformer stage, it remains true that the very primitive Modify stage is very efficient precisely because it is primitive. Indeed, if you inspect the code generated when a Transformer stage is compiled, you are very likely to see modify operators used in that code!
Find the author, demand objective proof!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
I talked about this in my blog entry Is the DataStage parallel transformer evil? and my approach is to always go with the Transformer first since it is the easiest stage to use and the most user-friendly.
The Modify stage can be plain nasty. It's okay if you are just doing trimming but if you need to perform more than one function on a field forget about it, and if you haven't used it before and you need to do several types of functions you could spend hours getting the syntax right. The Transformer on the other hand helps you with the syntax with the right click menu and syntax checking.
I would only use the Modify stage if I needed to eke some extra performance out of a job, so I would add it after I had completed my job design and discovered in performance testing that it was too slow. Even then I wouldn't be surprised to get a 2% performance improvement.
What i encountered was that errors generated while using functions in modify stage were far more difficult to correct and took many iterations.I almost gave up on PX when confronted with modify stage . On the theoretical side , im however led to believe equivalent code/transformation done by modify stage will be faster than a PX Transformer stage. I admit that im still confused when confronted with something that could be done by either stage.
It would be great if IBM could develop expression editor similar to that of transformer for writing modify specification in the modify stage. This would solve problems to a certain extent.
I took some time to play with it, to learn its idiosyncracies. Its very value is in how primitive an operator it is. It IS worth learning for all those little things (null handling, column name change, data type change) that you often have to do to get downstream stages to work properly.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
There is documentation for the Modify Stage in the Parallel Job Developer's Guide for your version (assuming you are using at least v7.5.1 or above). In the 7.5.1 doc, it's in Chapter 28 and includes most if not all available functions and the proper syntax.
To "preserve" your metadata in the visual sense (in that it's displayed on your output column links), make use of table definitions and load them and/or manually add columns to the metadata grid. There is no mapping tab as in other stages.
Internally, the operator itself will generate the proper output metadata to be shared with the next operator downstream. You can see this by adding the $OSH_PRINT_SCHEMAS environment variable.
Regards,
- james wiles
All generalizations are false, including this one - Mark Twain.