Transformer Stage Replacement

mail2krishna79 · Post by **mail2krishna79** » Fri Mar 25, 2011 12:28 pm

Hi,
Somebody designed a job in my project with two transformer Stages are there in the job.
1st transformer they maintained Hash Partitioning and Second Transformer they maintained same partition. Everything else straight move.

Job design is like this :
Ext Src Stg---->Tfm--->Remdup---->OraConnector---->Tfm---->OraConnector

Can anyone suggest me to improve performance in this scenario.

Thanks
hk

soumya5891 · Post by **soumya5891** » Fri Mar 25, 2011 12:44 pm

If only partition related issues are handled in the transformer then actually no need to implement transformer.This can be handled in any other stage.
Ex-For remove duplicate stage use a sort before it with hash partition and then make the partition in the rdup stage as same.

jwiles · Post by **jwiles** » Fri Mar 25, 2011 1:28 pm

FYI: There is typically little need to actually specific Same partitioning. When partitioning is set to Auto, the engine will normally not repartition unless it determines a need to do so.

If all that the transformers are doing is to copy input columns to the output (no null checking, datatype conversion, column generation, etc.) than they are unnecessary and you can do as soumya has suggested.

What type of data volume is being processed and what do the Oracle connectors in the job do, especially the first one?

Regards,

mail2krishna79 · Post by **mail2krishna79** » Tue Mar 29, 2011 4:31 pm

Hi Soumaya/jwiless,
Thank you so much for your response. the data volume is 10 million rows.
Sorry for late reply

jwiles · Post by **jwiles** » Tue Mar 29, 2011 6:09 pm

10 million rows through transformers is not an issue in your case, so long as you are doing as little in them as you have stated. That leaves the Oracle Connectors, Remove Duplicates and the External Source stage. So, another series of questions as we try to narrow down.

1) How long does this job run when processing 10 million records?

2) What is External Source calling to provide the data?

3) (as asked before) What do the Oracle Connectors do in this job?

4) What degree of parallelism is the job being run at?

5) How wide are the records? (Average bytes per record)

Potential bottlenecks are the the External Source Stage, the Oracle Connectors and the Remove Duplicates stage.

1) Determine how quickly the External Source stage runs by creating a job with just it dumping to a Copy stage: ExtSrc-->Copy. Compile and run to see how long it takes to provide the 10million records.

2) The Remove Duplicates probably has a Sort inserted in front of it at runtime. Depending on the total size of the data being processed, it may have a minor impact.

3) The Oracle connectors: What does each one do (I suspect the first is an Upset of some sort and the second loads rejects)? Do they load to the same table or separate tables? Do they use custom SQL statements? Are the tables being loaded to Indexed, and are the indexes kept up to date?

DSXchange

Transformer Stage Replacement

Transformer Stage Replacement

Re: Transformer Stage Replacement