Job reads its own link counts

PhilHibbs · Post by **PhilHibbs** » Wed May 21, 2008 6:14 am

I saw some "interesting" code yesterday that a customer had written. It had two Transformers, and the second Transformer called DSGetLinkInfo to get the link count of the input link to the first transformer, and compared the result to @INROWNUM. The intention here was to write out a summary record at the end of processing. I had a bit of an episode, but to my astonishment the job actually worked! I had a bit of a think about what was going on - was it picking up the result of the previous run? If so, it isn't reliable because the row count might change, and there's always the first run to worry about. Was it relying on the fact that with a small row count, the first Transformer process will have finished reading the file before the second Transformer gets around to processing the first row? If so, it would be unreliable on large files, and only "reliable" on small files if the server isn't busy.

I created a test case of my own, that read in a fairly large file, and compared the DSGetLinkInfo call to @INROWNUM in the constraint. When I ran it, it output two rows - one with an INROWNUM of 300, and one with an INROWNUM of 9667 which was the input row count. I spanked the developer soundly and sent him back to re-design his job.

Has anyone come across this dubious practice before? He said he'd done it loads of times on different projects (he's a contractor) so maybe it's one to watch out for, I hope he won't be doing it again.

Cr.Cezon · Post by **Cr.Cezon** » Wed May 21, 2008 6:21 am

yes, I use this practice a lot of times.

chulett · Post by **chulett** » Wed May 21, 2008 6:50 am

PhilHibbs wrote:Was it relying on the fact that with a small row count, the first Transformer process will have finished reading the file before the second Transformer gets around to processing the first row?

There's really no such phenomena going on. Unless you have row buffering turned on, then a single record will go from beginning to end (or at least as far as the first 'blocking' stage) before the second record gets its turn. With it turned on, all bets are off as to what record will be where when, run over run. Unless there was a passive stage between the two transformers?

There are much more better ways to generate summary information.

ArndW · Post by **ArndW** » Wed May 21, 2008 7:44 am

DSGetLinkInfo() will return a valid count only when called from the end-of-job.

PhilHibbs · Post by **PhilHibbs** » Wed May 21, 2008 8:29 am

chulett wrote:
PhilHibbs wrote:Was it relying on the fact that with a small row count, the first Transformer process will have finished reading the file before the second Transformer gets around to processing the first row?
Unless you have row buffering turned on... With it turned on, all bets are off as to what record will be where when, run over run.

There are much more better ways to generate summary information.

Yes, Row Buffering is turned on. I suggested that he process the file once into a hashed file with a single key storing the @INROWNUM and reference this hashed file comparing against @INROWNUM in a second pass of the file. He ended up using an Aggregator stage, but that's just a matter of personal preference. An Aggregator is more intentional, actually, so I think he made the right choice.

Update: I just tried the same thing with Row Buffering turned off, and the link count never equals @INROWNUM. It starts off at 0 and stays 0 for the first 100 rows, then from row 101 on it is 100, until row 301, where it jumps up to 300, then at 359 it jumps to 358, and that's the pattern. Every few hundred rows, it jumps up to being one less than the @INROWNUM. It never catches up, not even at the end, for the last few rows the row count is a hundred or so behind.

So, it can work for small files only if row buffering is turned on, but doesn't work for large files with or without row buffering. With buffering you risk duplicates, and without buffering the counts never match.

ray.wurlod · Post by **ray.wurlod** » Wed May 21, 2008 3:30 pm

It never matches with row buffering off because (a) both Transformer stages are in the same process and (b) row counts are only updated periodically (you can see that by inspecting the generated source code).

chulett · Post by **chulett** » Wed May 21, 2008 3:33 pm

In other words...

ArndW wrote:DSGetLinkInfo() will return a valid count only when called from the end-of-job.

PhilHibbs · Post by **PhilHibbs** » Thu May 22, 2008 2:01 am

ArndW, chulett, I totally agree with what you're saying. I was just working out the mechanisms behind it. Interesting that it's periodic updates at the job end. I would have assumed it was latency in the database - an independent process updating the database and committing/flushing periodically, and the job process just picking the latest committed value from the database.

To Cr.Cezon, will you continue to use this technique? It's a tough call whether you should go back and redesign any jobs that do this, "if it ain't broke, don't fix it", especially if it's in production code. What do you premium folks think? Where is this on the richter scale of bad practice?

chulett · Post by **chulett** » Thu May 22, 2008 6:26 am

Actually, Ray's mention of 'periodic updates' was while the job was running. Arnd mentioned that the only safe place to get those stats was at 'end of job', meaning once it had completed, so I assume Ray was clarifying the why of the what.

I was curious what exact 'practice' Cr.Cezon was asserting to be using a lot of times, but hadn't gotten around to asking for a clarification quite yet. Was waiting for all this to die down.

ag_ram · Post by **ag_ram** » Fri May 23, 2008 4:58 am

DataStage Release 7.5 Developer's Help wrote:DSGetLinkInfo Function:

Provides a method of obtaining information about a link on an active stage, which can be used generally as well as for job control. This routine may reference either a controlled job or the current job, depending on the value of JobHandle.
...
DSJ.LINKROWCOUNTInteger - number of rows that have passed down a link so far.

I am curoius to know the 'periodic updates' of the link counts for the read/write operation happening in the Job. How often this is being triggered? How far this function is helpful when we deal with the current Job Execution with regard to Link Counts?

PhilHibbs · Post by **PhilHibbs** » Fri May 23, 2008 6:11 am

I think the advice above is clear - do not rely on DSGetLinkInfo within a job. There may be exceptions to this, such as where there is a passive stage in between the link and the function call, but even then I would not rely on it. If you need the number of rows in a file or that match a condition, then count them up front as a separate process.

sjutba · Post by **sjutba** » Fri Jun 06, 2008 11:17 am

PhilHibbs wrote:I think the advice above is clear - do not rely on DSGetLinkInfo within a job. There may be exceptions to this, such as where there is a passive stage in between the link and the function call, but even then I would not rely on it. If you need the number of rows in a file or that match a condition, then count them up front as a separate process.

Hi,

Do you have a sample job of counting the number of rows in a file?

ray.wurlod · Post by **ray.wurlod** » Sun Jun 08, 2008 11:25 pm

This is "hijacking the thread" - the question is unrelated to the subject of this thread. Please post a new topic.

DSXchange

Job reads its own link counts

Job reads its own link counts

Re: Job reads its own link counts

Re: Job reads its own link counts