Querying a DSN from within a controlling job

spracht · Post by **spracht** » Tue Jul 22, 2003 12:10 am

Dear all,

I would sometimes want to read a single value from a database table into a variable within a controlling job, but couldn't find anything how this is to be done. I'm thinking of a routine that would just need the DSN name, user, password and the sql statement. Any idea? Thank you very much in advance.

Stephan

ray.wurlod · Post by **ray.wurlod** » Tue Jul 22, 2003 12:27 am

You have two choices.

One is to code using BCI functions, which mimic ODBC functions. To do this from a UNIX-based DataStage server, you will need to have licensed your ODBC driver.

The other is to construct a small job that executes the SQL statement via ODBC and deposits the result in, for example, a text file or hashed file, from which that result can be recovered. The small job can be run from the control job, which can subsequently use code (via a Routine Activity if you want to preserve the graphical interface) to retrive the required result.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

spracht · Post by **spracht** » Tue Jul 22, 2003 4:06 am

Ray,

to get the ODBC drivers licenced ( we are using the Merant drivers that came along with our DS 5.2 server under HP-UX), who must be addressed and what are the costs?

Thank you and kind regards
Stephan

kcbland · Post by **kcbland** » Tue Jul 22, 2003 7:47 am

Create some simple reusable scripts for running SQL against remote databases. In your case, I'd write a generic shell script that can be handed a series of parameters (DSN, login, SQL script name, etc) and runs that SQL script. You can capture the results and parse it. You don't have to mess with BCI, odbc connections, whatever. Simply develop wrapper scripts to ISQL, sqlplus, dbaccess, whatever you need. You have to have those clients installed anyway on the datastage server.

Kenneth Bland

spracht · Post by **spracht** » Tue Jul 22, 2003 9:55 am

Yes, I'm currently doing it like that, but I thought, I'd better do it with on-board means. A benefit I'm expecting is that a routine using BCI should work with any kind of DBMS?

Stephan

kduke · Post by **kduke** » Tue Jul 22, 2003 10:49 am

Stephan

If you are reading in one value into variable then take your select and output it into a hash file job before this job runs. Then create a stage variable which intial value calls UtilityHashLookup routine and reads this value in. Like this:

Job1: (Outputs to hash file HashMaxKeys)
select max(CustId) from Customer -> "HashMaxKeys", "Customer"

Job2:
Stage Variable (Initial Value):
StartCustId -> UtilityHashLookup("HashMaxKeys", "Customer")

You have just called this routine once. It is extremely efficient. It is all contained in DataStage jobs and not routines. Jobs are visual and routines are not. DataStage is suppoed to be a visual solution.

I think too many developers go out of their way to process data in shell scripts or call ODBC directly from a routine. This is not the best way to use DataStage. Sometimes it is necessary but most of the time it is not and creates unreadable and hard to maintain solutions. I hate to follow a developer who went out of his way to go around DataStage. Try to create solutions which stay within the Designer and avoid routines unless that is a lot faster or the only solution.

Ken's solution is if you do not have an ODBC driver to use within DataStage then you can execute a shell script which outputs to a sequential file and this result is then returned to DataStage. Like:

Rtn = ExecSh("read_cust_max.sh", ErrCode)
openseq "read_cust_max.txt" to MySeq else stop
process sequential file to get max custid

---------------------------------------------------
read_cust_max.sh:
sqlplus read_cust_max.sql >read_cust_max.txt

---------------------------------------------------
read_cust_max.sql:
select max(CustId) from Customer;

Kim.

Kim Duke
DwNav - ETL Navigator
www.Duke-Consulting.com

ray.wurlod · Post by **ray.wurlod** » Tue Jul 22, 2003 4:47 pm

For DataDirect drivers visit their website:
http://www.datadirect-technologies.com
(These are the drivers currently shipped with DataStage.)

For Merant drivers visit their website:
http://www.merant.com
(Are these guys still in the driver business?)

For OpenLink drivers visit their website:
http://www.openlinksw.com
(An alternative.)

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

spracht · Post by **spracht** » Thu Jul 24, 2003 5:45 am

Thanks all of you for your help! As I want to read the value into a variable within a controlling job, I tried to store the value into a hash file as Kim suggested and then tried to use the 'UtilityHashLookup' within the controlling job. After having learnt how to do that, I succeeded.

Thanks again and best regards
Stephan

calvinlo · Post by **calvinlo** » Thu Jul 24, 2003 8:36 pm

Job1: (Outputs to hash file HashMaxKeys)
select max(CustId) from Customer -> "HashMaxKeys", "Customer"

Job2:
Stage Variable (Initial Value):
StartCustId -> UtilityHashLookup("HashMaxKeys", "Customer")

For this method will there be concurrency problems if "CustId" is always changing??? If so, how to solve?
Thank you

Cal

kduke · Post by **kduke** » Thu Jul 24, 2003 9:08 pm

Cal

There is only concurrency problems of you have PX and are running multiple threads. If only one job is running then this is the preferred way to do it. The max value is the starting point. The new key = max key + @OUTROWNUM.

Kim.

Kim Duke
DwNav - ETL Navigator
www.Duke-Consulting.com

ray.wurlod · Post by **ray.wurlod** » Thu Jul 24, 2003 9:10 pm

Now you're getting into maintaing slowly changing dimensions, which has been covered more than once in this forum.

In summary, you pre-load the hashed file with the current maximum value of the surrogate key (which is independent of CustID).

Each time a CustID is processed that does not exist in the target database, one output link handles this; the new value of the surrogate key is derived on this link as the result of the lookup against the hashed file (which does not change) plus @OUTROWNUM.

A different output link handles CustID values that do already exist in the target database, in this case a new surrogate key value is not required. Typically, a separate hashed file is loaded with existing (current) CustID values and their corresponding surrogate key values. This hashed file may need to be maintained by the job itself, in which case you would choose "lock for updates" in the pre-load file to memory drop-down list, in order to prevent concurrency problems.

The "solution" for PX does not involve hashed files. Unfortunately, if you're going to allocate unique values across all threads, it is unavoidable that you must single thread allocation. Once that point has been conceded there are several solutions; the best (if possible) is to use features of the target database, such as a sequence or a serial data type.

kcbland · Post by **kcbland** » Thu Jul 24, 2003 9:45 pm

Ray's last paragraph is so import to PX users. It points out that slowly changing dimension are difficult because of PX's "over-amped" approach. Because of shared-nothing parallel execution, how to you synchronize nodes to share a surrogate key sequencing from an initial seed value?

Any solution that introduces a serialized assignment process (a DB assigned SERIAL column) puts an untolerable restriction on your job's processing capabilities. This is why ETL solutions that do active reference queries against a database DO NOT SCALE. The more parallel job's you have doing reference lookups against the same instance, the more likely you will will saturate your database with queries and then no amount of instantiate will improve net processing. This is why you rehost data into hash files in Server and data sets in Parallel, so that your reference capabilities are not limited by database query saturation.

Well, when assigning surrogate keys you do not want to tie in the database to these activities. In addition, PX only achieves its ultimate processing potential when it can work with streaming datasets into the database, and not actively querying as it's trying to transform. So, you will want to think about how to give each node its own seed value, so that it can use that seed value as it's base from which to assign. You end up with either a solution where each node either has a known range from which to assign surrogates, or a seed value and node number to always assign unique surrogates using a mathematical expression.

Kenneth Bland

kcbland · Post by **kcbland** » Thu Jul 24, 2003 9:54 pm

Here's a copy of a routine a bunch of Non-Ascential people and myself collaborated on over the years for doing parallel surrogate key assignments. When you have instantiated Server jobs, you need to be able to assign surrogate keys in parallel, but not introduce unnecessary burden on your jobs managing a lock table or colliding on record locks in the hash file. So, this routine is passed 3 arguments (read the code to find out what they are) that deal with opening a hash file that contains the seed value for a table (hopefully updated with a SELECT MAX(skey) from table job before running any jobs that use it). The routine then allocates blocks of numbers to that job, and when that job exhausts it's internal block of numbers this routine will then lock and fetch the next range of values from the hash file. Very cool, you'll have to tune each job to the block size according to the jobs characteristics. Smaller block size means more locks and file fetches but smaller gaps in the sequencing (who cares). Larger block size means few locks and file fetches but bigger gaps (who cares). If for some reason the common hash file gets locked for too long, it will timeout and abort the job.

Enjoy!

COMMON /ParallelKeyAssignment/ LastSurrogateKeyUsed, PresizeChunk, KeysToUse, Initialized, F.FILE

FunctionName = "ParallelKeyAssignment"
LastSurrogateKeyFile = "ctl_MAX_SKEYS"

If Initialized # "INITIALIZED" OR KeysToUse = 0 Then
If Initialized # "INITIALIZED" Then
Call DSLogInfo("Initializing", FunctionName)
OpenFailed = @TRUE
OPEN LastSurrogateKeyFile TO F.FILE Then
OpenFailed = @FALSE
End Else
TCL = "CREATE.FILE ":LastSurrogateKeyFile:" DYNAMIC"
Call DSLogInfo("TCL: ":TCL, FunctionName)
Call DSExecute("TCL", TCL, ScreenOutput, SystemReturnCode)
Call DSLogInfo(ScreenOutput, FunctionName)
OPEN LastSurrogateKeyFile TO F.FILE Then
OpenFailed = @FALSE
End
End
If OpenFailed Then
Call DSLogFatal("Unable to open/create the file ":LastSurrogateKeyFile, FunctionName)
End
Initialized = "INITIALIZED"
Call DSLogInfo("Initialization finished", FunctionName)
End

TableName = UPCASE(TRIM(Arg1))
PresizeChunk = TRIM(Arg2)
IncrementValue = TRIM(Arg3)
If PresizeChunk = "" Then PresizeChunk = 5000

Call DSLogInfo("Retrieving next surrogate key for ":TableName, FunctionName)

RecordLocked = @FALSE
TimeOut = 60 * 5 ; * 60 seconds * 5 minutes
SleepPulse = 15
Loop
ReadU row From F.FILE,TableName LOCKED
TimeOut -= 1
Sleep 1
SleepPulse -= 1
If SleepPulse = 0 Then
Call DSLogInfo("Waiting on last surrogate key for ":TableName, FunctionName)
SleepPulse = 15
End
RecordLocked = @TRUE
End Then
RecordLocked= @FALSE
CONVERT '"' TO '' IN row
End Else
RecordLocked = @FALSE
row = ""
row = 0
End
LastSurrogateKeyUsed = row
Until TimeOut = 0 OR NOT(RecordLocked) Do Repeat

If TimeOut > 0 Then
Call DSLogInfo("Read last surrogate key ":LastSurrogateKeyUsed:" for ":TableName, FunctionName)
NewLastSurrogateKeyUsed = LastSurrogateKeyUsed + PresizeChunk
KeysToUse = PresizeChunk
row = NewLastSurrogateKeyUsed
WRITE row ON F.FILE,TableName

LastSurrogateKeyUsed = LastSurrogateKeyUsed + IncrementValue
KeysToUse = PresizeChunk - IncrementValue
Ans = LastSurrogateKeyUsed
End Else
Call DSLogFatal("Unable to obtain lock on ":LastSurrogateKeyFile:" record ":TableName, FunctionName)
End
End Else
*
* Increment the last surrogate key used
*
LastSurrogateKeyUsed = LastSurrogateKeyUsed + IncrementValue
KeysToUse -= IncrementValue
Ans = LastSurrogateKeyUsed
End

Kenneth Bland

ray.wurlod · Post by **ray.wurlod** » Fri Jul 25, 2003 5:39 pm

The final line of code ("Kenneth Bland") does not compile! [:D]
Sorry, couldn't resist.