Will unused columns in Sourcestage will effect per

raju_chvr · Post by **raju_chvr** » Thu Oct 02, 2003 1:13 pm

Will unused columns in Source stage will effect performance ?

I mean when we use 'Load' utility to load the table definition/metadata in the Source stage for whatever it is ODBC table or Sequential file, u load all of the columns for a given table and you may not use them in many cases. Will this effect the speed of the job. I am asking this as I have one job which is slow and I want to tune it up and looking for factors that might effect it.

I know there are many other factors to look into for performance.

thanks

kcbland · Post by **kcbland** » Thu Oct 02, 2003 1:52 pm

It's extra bytes in the pipeline potentially running over the network. You need to decide if eliminating unused columns provides the performance you need or if instantiating the query is better.

Kenneth Bland

raju_chvr · Post by **raju_chvr** » Thu Oct 02, 2003 2:35 pm

Thanks Mr.Bland for you time!

Well most of my source stages are UniData, I think I have to go for 'eliminating unused columns'.

Can you clear this for me? I assume that DS Server first requests all of the columns are present in the stage properties and then later discard them if not present in the output stages. is my understanding on how DS server works is right?

kcbland · Post by **kcbland** » Thu Oct 02, 2003 4:16 pm

Well, now that you said you're pulling from UniData, you have many other things to factor in:

1. Are you doing any normalization? This is a huge blackhole to tune, because of the act of one row having to explode into many.
2. Are you using an F-correlatives or I-descriptors, as this is again a blackhole.
3. What's the Unidata host like? Does it have a lot of cpu's where if you range instantiated the job w/ranged query you could have more than one job pulling from the Unidata file?
4. What's the load on the Unidata host when you're doing your extracts?

#3 may give you the best performance, as it generally gives you multiples in return. For example, you could put 3 input streams coming from your Unidata stage, each one doing a different range of the source data (or MOD(someintegercolumnlikeakey,3)= 0 or 1 or 2) and spool the results to 3 sequential files, then concatenate them together. This hopefully can make your job extract 3X as many rows as it used to.

Good luck!

Kenneth Bland

ray.wurlod · Post by **ray.wurlod** » Thu Oct 02, 2003 5:51 pm

Sequential files can only be processed sequentially. What this means in practice is that you must read past every byte in the file to get to the next byte. You cannot ignore columns in a sequential file; but you can discard them in a following Transformer stage.

The same is true of other stream-type inputs, such as the FTP stage or Merge stage (server jobs).

For every other kind of non-stream input, such as data from a database server, you can be selective about which columns to select. The fewer the better in general.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

kcbland · Post by **kcbland** » Fri Oct 03, 2003 7:50 am

Let's be quite clear how DataStage works with remote data sources (anything not a sequential file or hash file on the DS server).

In an ODBC, OCI, CLI, UV/ODBC, or UNI/ODBC stage, DataStage passes the SQL (in the case of U2, SELECT) statement to the data source and then receives the spooling result down a connection to the DS server. Let's pretend the output goes to a transformer stage. The data source has to prepare the source data, so if you are asking it for more columns than you actually care about, you will incur the overhead of those bytes traveling down the connection.

Whether you map those columns to your output file or not IS IRRELEVANT. The data source still had to prepare those columns and pass them down the connection. So, whatever is in your SQL or SELECT statements is what is computed and passed to DS. There is not anything resembling a predicate push based on what you do with the result.

Now, your question is not for everyone, because you state most of your sources are Unidata. If you don't work with Unidata and Universe, STOP READING.

Unidata and Universe have multivalued data which, when selected via their stages, have the ability to explode that multivalued data into many rows. So, if you select columns you don't want, Universe/Unidata is going through the exercise of exploding those extraneous columns, which can lead to significant numbers of bytes depending on your situation.

IT IS MY RECOMMENDATION THAT ANYONE SOURCING UNIVERSE/UNIDATA SETUP AN ACCOUNT SPECIFIC TO DATASTAGE, WHEREBY YOU CAN CREATE F-POINTERS TO THE SOURCE FILES AND LOCAL DICTIONARIES TO THAT ACCOUNT. YOU THEN SHOULD CREATE "CLEAN" DICTIONARIES THAT ACCURATELY DESCRIBE YOUR PHYSICAL DATA. YOU SHOULD AVOID T-CORRELATIVES AND I-DESCRIPTORS, AS THEY DO OFTEN NOT DEPICT THE DATA WITHIN THE FILE.

Feel free to disagree, that just MY opinion. Your dictionaries are probably full of LOTS of junk, which, if you are selecting T-correlatives and I-descriptors but then throwing away once they go to the DS server, you ARE WASTING CPU CYCLES on the source system.

Kenneth Bland

raju_chvr · Post by **raju_chvr** » Fri Oct 03, 2003 11:34 am

WOW Kenneth that stuff you posted abt UniData is amazing. I have no idea abt Unidata at all.

thanks for that.

ray.wurlod · Post by **ray.wurlod** » Fri Oct 03, 2003 2:52 pm

On the other hand, you sometimes do want to be selective on values in the multi-valued fields, by adding WHEN clauses to the retrieval criteria. It's a matter of what you really need brought forward into your DataStage job. DataStage can handle multi-valued columns; it's then a case of whether the ETL target can!
Analyze this critically; the secret of success is, as with database tables, to retrieve only the columns and only the rows that your ETL job actually needs.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

raju_chvr · Post by **raju_chvr** » Mon Oct 06, 2003 3:20 pm

Thanks Ray Wurlod, I am working on it now.