Job reads its own link counts

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Job reads its own link counts

Post by PhilHibbs »

I saw some "interesting" code yesterday that a customer had written. It had two Transformers, and the second Transformer called DSGetLinkInfo to get the link count of the input link to the first transformer, and compared the result to @INROWNUM. The intention here was to write out a summary record at the end of processing. I had a bit of an episode, but to my astonishment the job actually worked! I had a bit of a think about what was going on - was it picking up the result of the previous run? If so, it isn't reliable because the row count might change, and there's always the first run to worry about. Was it relying on the fact that with a small row count, the first Transformer process will have finished reading the file before the second Transformer gets around to processing the first row? If so, it would be unreliable on large files, and only "reliable" on small files if the server isn't busy.

I created a test case of my own, that read in a fairly large file, and compared the DSGetLinkInfo call to @INROWNUM in the constraint. When I ran it, it output two rows - one with an INROWNUM of 300, and one with an INROWNUM of 9667 which was the input row count. I spanked the developer soundly and sent him back to re-design his job.

Has anyone come across this dubious practice before? He said he'd done it loads of times on different projects (he's a contractor) so maybe it's one to watch out for, I hope he won't be doing it again.
Phil Hibbs | Capgemini
Technical Consultant
Cr.Cezon
Participant
Posts: 101
Joined: Mon Mar 05, 2007 4:59 am
Location: Madrid

Post by Cr.Cezon »

yes, I use this practice a lot of times.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Re: Job reads its own link counts

Post by chulett »

PhilHibbs wrote:Was it relying on the fact that with a small row count, the first Transformer process will have finished reading the file before the second Transformer gets around to processing the first row?
:? There's really no such phenomena going on. Unless you have row buffering turned on, then a single record will go from beginning to end (or at least as far as the first 'blocking' stage) before the second record gets its turn. With it turned on, all bets are off as to what record will be where when, run over run. Unless there was a passive stage between the two transformers?

There are much more better ways to generate summary information.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

DSGetLinkInfo() will return a valid count only when called from the end-of-job.
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Re: Job reads its own link counts

Post by PhilHibbs »

chulett wrote:
PhilHibbs wrote:Was it relying on the fact that with a small row count, the first Transformer process will have finished reading the file before the second Transformer gets around to processing the first row?
Unless you have row buffering turned on... With it turned on, all bets are off as to what record will be where when, run over run.

There are much more better ways to generate summary information.
Yes, Row Buffering is turned on. I suggested that he process the file once into a hashed file with a single key storing the @INROWNUM and reference this hashed file comparing against @INROWNUM in a second pass of the file. He ended up using an Aggregator stage, but that's just a matter of personal preference. An Aggregator is more intentional, actually, so I think he made the right choice.

Update: I just tried the same thing with Row Buffering turned off, and the link count never equals @INROWNUM. It starts off at 0 and stays 0 for the first 100 rows, then from row 101 on it is 100, until row 301, where it jumps up to 300, then at 359 it jumps to 358, and that's the pattern. Every few hundred rows, it jumps up to being one less than the @INROWNUM. It never catches up, not even at the end, for the last few rows the row count is a hundred or so behind.

So, it can work for small files only if row buffering is turned on, but doesn't work for large files with or without row buffering. With buffering you risk duplicates, and without buffering the counts never match.
Phil Hibbs | Capgemini
Technical Consultant
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

It never matches with row buffering off because (a) both Transformer stages are in the same process and (b) row counts are only updated periodically (you can see that by inspecting the generated source code).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

In other words...
ArndW wrote:DSGetLinkInfo() will return a valid count only when called from the end-of-job.
:wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

ArndW, chulett, I totally agree with what you're saying. I was just working out the mechanisms behind it. Interesting that it's periodic updates at the job end. I would have assumed it was latency in the database - an independent process updating the database and committing/flushing periodically, and the job process just picking the latest committed value from the database.

To Cr.Cezon, will you continue to use this technique? It's a tough call whether you should go back and redesign any jobs that do this, "if it ain't broke, don't fix it", especially if it's in production code. What do you premium folks think? Where is this on the richter scale of bad practice?
Phil Hibbs | Capgemini
Technical Consultant
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Actually, Ray's mention of 'periodic updates' was while the job was running. Arnd mentioned that the only safe place to get those stats was at 'end of job', meaning once it had completed, so I assume Ray was clarifying the why of the what.

I was curious what exact 'practice' Cr.Cezon was asserting to be using a lot of times, but hadn't gotten around to asking for a clarification quite yet. Was waiting for all this to die down. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ag_ram
Premium Member
Premium Member
Posts: 524
Joined: Wed Feb 28, 2007 3:51 am

Post by ag_ram »

DataStage Release 7.5 Developer's Help wrote:DSGetLinkInfo Function:

Provides a method of obtaining information about a link on an active stage, which can be used generally as well as for job control. This routine may reference either a controlled job or the current job, depending on the value of JobHandle.
...
DSJ.LINKROWCOUNTInteger - number of rows that have passed down a link so far.
I am curoius to know the 'periodic updates' of the link counts for the read/write operation happening in the Job. How often this is being triggered? How far this function is helpful when we deal with the current Job Execution with regard to Link Counts?
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

I think the advice above is clear - do not rely on DSGetLinkInfo within a job. There may be exceptions to this, such as where there is a passive stage in between the link and the function call, but even then I would not rely on it. If you need the number of rows in a file or that match a condition, then count them up front as a separate process.
Phil Hibbs | Capgemini
Technical Consultant
sjutba
Participant
Posts: 2
Joined: Fri Jun 06, 2008 9:37 am
Location: Florida

Post by sjutba »

PhilHibbs wrote:I think the advice above is clear - do not rely on DSGetLinkInfo within a job. There may be exceptions to this, such as where there is a passive stage in between the link and the function call, but even then I would not rely on it. If you need the number of rows in a file or that match a condition, then count them up front as a separate process.
Hi,

Do you have a sample job of counting the number of rows in a file?
SamJutba
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This is "hijacking the thread" - the question is unrelated to the subject of this thread. Please post a new topic.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply