Job reads its own link counts
Moderators: chulett, rschirm, roy
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
Job reads its own link counts
I saw some "interesting" code yesterday that a customer had written. It had two Transformers, and the second Transformer called DSGetLinkInfo to get the link count of the input link to the first transformer, and compared the result to @INROWNUM. The intention here was to write out a summary record at the end of processing. I had a bit of an episode, but to my astonishment the job actually worked! I had a bit of a think about what was going on - was it picking up the result of the previous run? If so, it isn't reliable because the row count might change, and there's always the first run to worry about. Was it relying on the fact that with a small row count, the first Transformer process will have finished reading the file before the second Transformer gets around to processing the first row? If so, it would be unreliable on large files, and only "reliable" on small files if the server isn't busy.
I created a test case of my own, that read in a fairly large file, and compared the DSGetLinkInfo call to @INROWNUM in the constraint. When I ran it, it output two rows - one with an INROWNUM of 300, and one with an INROWNUM of 9667 which was the input row count. I spanked the developer soundly and sent him back to re-design his job.
Has anyone come across this dubious practice before? He said he'd done it loads of times on different projects (he's a contractor) so maybe it's one to watch out for, I hope he won't be doing it again.
I created a test case of my own, that read in a fairly large file, and compared the DSGetLinkInfo call to @INROWNUM in the constraint. When I ran it, it output two rows - one with an INROWNUM of 300, and one with an INROWNUM of 9667 which was the input row count. I spanked the developer soundly and sent him back to re-design his job.
Has anyone come across this dubious practice before? He said he'd done it loads of times on different projects (he's a contractor) so maybe it's one to watch out for, I hope he won't be doing it again.
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
Re: Job reads its own link counts
PhilHibbs wrote:Was it relying on the fact that with a small row count, the first Transformer process will have finished reading the file before the second Transformer gets around to processing the first row?
![Confused :?](./images/smilies/icon_confused.gif)
There are much more better ways to generate summary information.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
Re: Job reads its own link counts
Yes, Row Buffering is turned on. I suggested that he process the file once into a hashed file with a single key storing the @INROWNUM and reference this hashed file comparing against @INROWNUM in a second pass of the file. He ended up using an Aggregator stage, but that's just a matter of personal preference. An Aggregator is more intentional, actually, so I think he made the right choice.chulett wrote:Unless you have row buffering turned on... With it turned on, all bets are off as to what record will be where when, run over run.PhilHibbs wrote:Was it relying on the fact that with a small row count, the first Transformer process will have finished reading the file before the second Transformer gets around to processing the first row?
There are much more better ways to generate summary information.
Update: I just tried the same thing with Row Buffering turned off, and the link count never equals @INROWNUM. It starts off at 0 and stays 0 for the first 100 rows, then from row 101 on it is 100, until row 301, where it jumps up to 300, then at 359 it jumps to 358, and that's the pattern. Every few hundred rows, it jumps up to being one less than the @INROWNUM. It never catches up, not even at the end, for the last few rows the row count is a hundred or so behind.
So, it can work for small files only if row buffering is turned on, but doesn't work for large files with or without row buffering. With buffering you risk duplicates, and without buffering the counts never match.
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
It never matches with row buffering off because (a) both Transformer stages are in the same process and (b) row counts are only updated periodically (you can see that by inspecting the generated source code).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
ArndW, chulett, I totally agree with what you're saying. I was just working out the mechanisms behind it. Interesting that it's periodic updates at the job end. I would have assumed it was latency in the database - an independent process updating the database and committing/flushing periodically, and the job process just picking the latest committed value from the database.
To Cr.Cezon, will you continue to use this technique? It's a tough call whether you should go back and redesign any jobs that do this, "if it ain't broke, don't fix it", especially if it's in production code. What do you premium folks think? Where is this on the richter scale of bad practice?
To Cr.Cezon, will you continue to use this technique? It's a tough call whether you should go back and redesign any jobs that do this, "if it ain't broke, don't fix it", especially if it's in production code. What do you premium folks think? Where is this on the richter scale of bad practice?
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
Actually, Ray's mention of 'periodic updates' was while the job was running. Arnd mentioned that the only safe place to get those stats was at 'end of job', meaning once it had completed, so I assume Ray was clarifying the why of the what.
I was curious what exact 'practice' Cr.Cezon was asserting to be using a lot of times, but hadn't gotten around to asking for a clarification quite yet. Was waiting for all this to die down.![Wink :wink:](./images/smilies/icon_wink.gif)
I was curious what exact 'practice' Cr.Cezon was asserting to be using a lot of times, but hadn't gotten around to asking for a clarification quite yet. Was waiting for all this to die down.
![Wink :wink:](./images/smilies/icon_wink.gif)
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
I am curoius to know the 'periodic updates' of the link counts for the read/write operation happening in the Job. How often this is being triggered? How far this function is helpful when we deal with the current Job Execution with regard to Link Counts?DataStage Release 7.5 Developer's Help wrote:DSGetLinkInfo Function:
Provides a method of obtaining information about a link on an active stage, which can be used generally as well as for job control. This routine may reference either a controlled job or the current job, depending on the value of JobHandle.
...
DSJ.LINKROWCOUNTInteger - number of rows that have passed down a link so far.
-
- Premium Member
- Posts: 1044
- Joined: Wed Sep 29, 2004 3:30 am
- Location: Nottingham, UK
- Contact:
I think the advice above is clear - do not rely on DSGetLinkInfo within a job. There may be exceptions to this, such as where there is a passive stage in between the link and the function call, but even then I would not rely on it. If you need the number of rows in a file or that match a condition, then count them up front as a separate process.
Phil Hibbs | Capgemini
Technical Consultant
Technical Consultant
Hi,PhilHibbs wrote:I think the advice above is clear - do not rely on DSGetLinkInfo within a job. There may be exceptions to this, such as where there is a passive stage in between the link and the function call, but even then I would not rely on it. If you need the number of rows in a file or that match a condition, then count them up front as a separate process.
Do you have a sample job of counting the number of rows in a file?
SamJutba
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: