Inline/embeded SQL

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

rupeshg
Premium Member
Premium Member
Posts: 60
Joined: Thu Dec 22, 2005 6:02 am
Location: United Kingdom

Inline/embeded SQL

Post by rupeshg »

Hi All,

I need to pull one column from a database table for which I have the following query

SELECT FORM_EDITION_DT
INTO :PB90H-FORM-EDITION-DT
FROM FORM_TAB_V A
WHERE A.FRM_ST_CD = :PB90H-FRM-ST-CD
AND A.FORM_ID = :PB90H-FORM-ID
AND A.FORM_EDITION_DT =
(SELECT MAX(FORM_EDITION_DT)
FROM FORM_TAB_V
WHERE FRM_ST_CD = A.FRM_ST_CD
AND FORM_ID = A.FORM_ID
AND FORM_EDITION_DT <= :PB90H-FORM-EDITION-DT
AND POL_SYMBOL_CD = A.POL_SYMBOL_CD)
AND DOCUMENT_CD =
(SELECT MIN(DOCUMENT_CD)
FROM FORM_TAB_V
WHERE FRM_ST_CD = A.FRM_ST_CD
AND FORM_ID = A.FORM_ID
AND FORM_EDITION_DT = A.FORM_EDITION_DT
AND POL_SYMBOL_CD = A.POL_SYMBOL_CD)
AND A.POL_SYMBOL_CD = :PB90H-POL-SYMBOL-CD

Now the Variables are with : and have a value. But the question is how do I take care this one in a relational stage or is there any work around using multiple stages. This is tricky one.

Regards,
Rupesh
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

The host variable values being specified in the query will have to come in to the job as job parameters. Enticing DataStage to use these job parameters in the WHERE clause with anything other than a simple SELECT statement can be problematic.

How many total rows are there in FORM_TAB_V?

Mike
rupeshg
Premium Member
Premium Member
Posts: 60
Joined: Thu Dec 22, 2005 6:02 am
Location: United Kingdom

Post by rupeshg »

Mike,

Truly speaking I did not give more clearity on what exactly my problem is?

Yes, there are many rows in Form_Tab table.

But the issue is, that in the query one of the variable accepts value from the input which is being read. For example variables :PB90H-FORM-ID and :PB90H-POL-SYMBOL-CD would have a value from input stream.

In this case if I define them as job parameter I wont be able to set them dynamically for each row coming in.

To give you a picture, consider a flat file as input (which has column A) followed by a lookup stage which looks up a relational stage. In the relational stage I put sql query which should accept this Column A value in where clause. I guess for this I cannot use job parameter as i need to set them depending on each row.

Hope I didnt confuse you. To tell you in plain words, I was worried about Inline SQL but now its more complicated. Let me know if you need more information.

Regards,
Rupesh
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

More info needed:
Yes, there are many rows
doesn't tell me anything. :?

Are we talking 100s, 1000s, millions, billions?

You've now mentioned a flat file input source that contains values for at least two of your host variables. How many rows in this flat file?

Where are the values for your other two host variables (PB90H-FRM-ST-CD and PB90H-FORM-EDITION-DT) coming from?

Mike
rupeshg
Premium Member
Premium Member
Posts: 60
Joined: Thu Dec 22, 2005 6:02 am
Location: United Kingdom

Post by rupeshg »

Yes, we are talking about billion of rows coming from Input and 1000s of rows in lookup table.

Other two host variables are set to defaults like PB90H-FRM-ST-CD = 'UK' and PB90H-FORM-EDITION-DT = CURRENT_DATE

Hope that really helps.

Regards,
Rupesh
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Ok. We're making progress on the specification. Though I still don't have a clear enough picture to recommend some design alternatives.

More clarification still needed:

Is it the relational table, FORM_TAB_V, the input source with the billion rows? Or is that the input source with thousands of rows?

Is the flat file containing POL-SYMBOL-CD and FORM-ID the input source with a billion rows? Or is that the input source with thousands of rows.

What about the other host variables? You state that FRM-ST-CD and FORM-EDITION-DT are set to defaults ('UK' and CURRENT DATE). How/where are non-default values obtained? Or does each job run have a different set of defaults?

Mike
rupeshg
Premium Member
Premium Member
Posts: 60
Joined: Thu Dec 22, 2005 6:02 am
Location: United Kingdom

Post by rupeshg »

Mike,

As I mentioned early,
"billion of rows coming from Input and 1000s of rows in lookup table"this should expalin it all.

But for your clarification

Is it the relational table, FORM_TAB_V, the input source with the billion rows? Or is that the input source with thousands of rows?

Answer: Form_Tab_V is a relational table, and has 1000s of rows.

Is the flat file containing POL-SYMBOL-CD and FORM-ID the input source with a billion rows? Or is that the input source with thousands of rows.

Answer: Input Flat File has Billions of rows.

What about the other host variables? You state that FRM-ST-CD and FORM-EDITION-DT are set to defaults ('UK' and CURRENT DATE). How/where are non-default values obtained? Or does each job run have a different set of defaults?

Answer: To make it more clear, I would keep these two defaults hardcoded in the SQL query. As they are standard for each row incoming. In short I won't keep them as variables.

I think I gave more clearity now, so I would appreciate your valuable input on this one.

Regards,
Rupesh
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Thank you Rupesh.

Sorry for forcing you to spell things out for me. I just wanted to make sure I understood your specifications clearly. With an input source containing "billions" of rows, a poor design decision will have devastating performance implications.

First, a couple of points:
1) You do not want to do a direct lookup to the relational table. Issuing "billions" of SELECT statements is something to be avoided.
2) You do not want to sort a file with "billions" of rows. A design that avoids the need to sort will be a good thing.

Now for a design option:
I would suggest creating 2 jobs for this task:

1) A simple job that will extract the relational lookup data into a flat file. This flat file will be used as the reference input into a hash lookup in the second job.

Three stages:

Code: Select all

relational stage -> transformer -> flat file

Code: Select all

SELECT A.FORM_ID,
       A.POL_SYMBOL_CD,
       A.FORM_EDITION_DT 
FROM   FORM_TAB_V A 
WHERE  A.FRM_ST_CD       = 'UK' 
AND    A.FORM_EDITION_DT = (SELECT MAX(B.FORM_EDITION_DT) 
                            FROM   FORM_TAB_V B 
                            WHERE  B.FRM_ST_CD       =  A.FRM_ST_CD
                            AND    B.FORM_ID         =  A.FORM_ID 
                            AND    B.FORM_EDITION_DT <= CURRENT DATE 
                            AND    B.POL_SYMBOL_CD   =  A.POL_SYMBOL_CD) 
AND    A.DOCUMENT_CD     = (SELECT MIN(C.DOCUMENT_CD)
                            FROM   FORM_TAB_V C
                            WHERE  C.FRM_ST_CD       = A.FRM_ST_CD 
                            AND    C.FORM_ID         = A.FORM_ID 
                            AND    C.FORM_EDITION_DT = A.FORM_EDITION_DT 
                            AND    C.POL_SYMBOL_CD   = A.POL_SYMBOL_CD) 
2) A job that reads the input and does a conditional hash lookup and writes to the output.

Six Stages:

Code: Select all

                                       Flat File (from job 1)
                                          |
                                          |
                                         \|/
Flat file (billions) -> Transformer -> Lookup -> Transformer -> Flat file
The hash lookup (make sure you specify a hash lookup) will cause the lookup file to be read into memory at the start of the job. Lookups will take place at memory i/o speed instead of disk i/o speed. The conditional lookup can be used to skip the lookup operation when the lookup keys from the current row are the same as the keys from the previous input row. Normally I wouldn't bother with the conditional lookup on a hash lookup, but multiplying even a millisecond in savings by billions can be significant.

Mike
rupeshg
Premium Member
Premium Member
Posts: 60
Joined: Thu Dec 22, 2005 6:02 am
Location: United Kingdom

Post by rupeshg »

Thanks a million Mike.

That really helped. So far I was following the same approach, but did not think of spliting it first into a file and than looking up.

Code: Select all

                     DB2 table with query
                             |
                             |
                             V
input----->Transformer---->lookup----->Transformer------>DB2 Loadready File.
Definately there would be a performance hinderance if not splited.

Just out of curiosity, what if one of the variable was inside an inline query?

Regards,
Rupesh
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

For performance reasons, you'll want to extract the lookup data to a flat file first. Attaching the relational stage directly to the lookup stage results in those "billions" of SELECT statements that you want to avoid.
what if one of the variable was inside an inline query
This isn't possible with your query. Host variables can not be directly coded into the SQL statement. The best that you can hope for is that the code optimizer will push a transformer constraint back into the SQL WHERE clause. The optimizer probably won't even attempt that with user-defined SQL (which is required due to the subselect clauses).

With a hash lookup, it won't make much difference whether you are matching on one key or two keys.

Note: I've made the assumption that when you said the lookup file contains thousands of rows, that that was the total volume for the entire lookup table (and not the volume for a single combination of form id and symbol code). If the lookup file volume will be significantly higher than about 100,000 rows, we will likely have to revisit the design options.

Mike
rupeshg
Premium Member
Premium Member
Posts: 60
Joined: Thu Dec 22, 2005 6:02 am
Location: United Kingdom

Post by rupeshg »

Thanks Mike.

Actually I didnt understand the part in the following quote.
The best that you can hope for is that the code optimizer will push a transformer constraint back into the SQL WHERE clause.
Do you mean to say that one of the constraint in WHERE clause can be put into transformer Constraint?

And if yes, how are the sub select query's parameter taken care?

What Code Optimizer are we talking here?

Hope you understand my question.

Regards,
Rupesh
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

The code generation engine generates COBOL code and then performs an optimization of the code. For example columns that aren't referenced in a downsteam stage won't be passed on. The SQL for a relational stage will be optimized as well. If you select a column that isn't used downstream, the SQL that is passed to DB2 won't include that column. If there is a simple constraint in a transformer immediately after the relational stage, then that constraint may be converted to a SQL WHERE clause during optimization.

In the relational stage, you can't code something like:

Code: Select all

SELECT KEY1, COL1
FROM    TBL
WHERE  KEY1 = :JOBPARM
You can code this:

Code: Select all

SELECT KEY1, COL1
FROM    TBL
After the relational stage, you can put a constraint in the transformer as something like KEY1 = JOBPARM.

The generated code should include "WHERE KEY1 = :JOBPARM" in the generated SQL statement.

Mke
dls
Premium Member
Premium Member
Posts: 96
Joined: Tue Sep 09, 2003 5:15 pm

Post by dls »

It has been awhile since I last worked with DS390, but I think that the Relational stage has been enhanced to allow for the selection of job parameters in the Where grid.

Early on, as Mike said, the only way to accomplish this was through code optimization.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

Dls,

Thanks for the update. That would be a really nice enhancement. I haven't touched a mainframe job in quite a while. You wouldn't happen to remember which version included that enhancement would you? I don't think it was available as of 7.5.1.

Mike
dls
Premium Member
Premium Member
Posts: 96
Joined: Tue Sep 09, 2003 5:15 pm

Post by dls »

Mike, maybe someone with access to the MF version can confirm this. I'm working on the server platform right now.

My on-line documentation for v6.0 supports what I recall, though.
Post Reply