Number of readers per node --

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Satwika
Participant
Posts: 45
Joined: Mon Jan 02, 2012 11:29 pm

Post by Satwika »

ArndW wrote:Run your job in a 1-node configuration and see if the error remains. If it is still there with 1-node then your partitioning is not at the root of the problem.
Hi Andrw

I don't have access to change it single node, i can create the configuration file but can't able to use in job. Can I have any suggestions on this ?

Thank you
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Yes, you should be able to add the parameter APT_CONFIG_FILE to your job and point it at a 1-node configuration at runtime.
Satwika
Participant
Posts: 45
Joined: Mon Jan 02, 2012 11:29 pm

Post by Satwika »

Hi Andrw

I tried with single node, but still problem exists... any other suggestions please......
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

This sounds rather strange indeed.

If you run your job with 1-reader-per-node from the designer with a 1-node $APT_CONFIG_FILE configuration then write down the number of records shown as coming out of the files 1, 2 and 3.

Then change the number of readers per node and re-run the job. Are the numbers displayed in the source stages now different or the same?
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

*Note* Message composed prior to OP's last update, which indicates that partitioning is not the problem, so please treat the following as a general suggestion on partitioning and not as a solution to actual problem.
ArndW wrote:To keep things simple, hash both input links to the join on "Col1" and then sort both on Col1, Col2, Col3, Col5 and Col6.
Whether or not this is good advice depends on the cardinality of Col1, by which I mean "how many distinct values the column contains", and also how long the value in the column is. The ideal column to partition on is something short that has very high cardinality. A postcode or ZIP code in a data set that covers a large geographical area is a good example of this - short, but with lots of different values. Account numbers, Employee IDs, these are good hash partitioning keys. Country Name, not so good, especially if the only values are "USA" and "Canada". If one of the values Col1 Col2 and Col3 matches the "good" criteria that I described, then pick that as your partitioning key, and sort by all three.

So if Col1 is "Country", Col2 is "ADDRESS_TEXT" and Col3 is "Zip" then Col3 is a good choice as long as it is populated for the large majority of your data set. Col1 is a reasonable choice if you have data for a lot of countries and the zip code is blank for a lot of your data. Col2 is the last resort as it is a long string value and thus expensive to calculate a hash value from, and should only be chosen if Col1 is low cardinality and Col3 is blank for a large portion of your data (more than, say, 10% at a guess).

You could partition on all three keys that the joins have in common, but in actuality you only need a subset of the keys that provide high cardinality.
Last edited by PhilHibbs on Mon Oct 01, 2012 5:55 am, edited 1 time in total.
Phil Hibbs | Capgemini
Technical Consultant
PhilHibbs
Premium Member
Premium Member
Posts: 1044
Joined: Wed Sep 29, 2004 3:30 am
Location: Nottingham, UK
Contact:

Post by PhilHibbs »

A quick-and-dirty way to run something on a single node is to add an integer column, set it always to 1, and partition on that. Particularly useful if there is just one Transformer in the middle of a job (or in a Shared Container) that you want to run on a single node without interfering with the rest of the job. Introduce the new column in one Transformer, partition on it on the way into the next Transformer (or Aggregator or whatever) and then partition on something more sensible on the way out. I used this in a Shared Container when I discovered that setting a Transformer to execute in Sequential mode was causing a downstream Aggregator stage to throw out a weird error message.

I prefer this to setting a node map constraint, as those are not visible on the canvas whereas the repartitioning icon is a visible indicator of what is happening, and the Transformer can be called something informative such as Tfm_AddDummyPartKey.

Again this isn't directly aimed at solving the actual problem, but is pertinent to the process of diagnosing issues like this.
Last edited by PhilHibbs on Mon Oct 01, 2012 6:39 am, edited 3 times in total.
Phil Hibbs | Capgemini
Technical Consultant
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Phil - in the last post the OP stated that the problem remains even with a 1-node configuration so the issue of hashing is a moot point (for the moment); while you are spot on regarding the cardinality of the hash keys I believe that the problems here are much more basic.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

The column you titled "Output Records" is the output count from the sequential file stage, correct? The number of records output from the sequential file stage should be the same, regardless of how many readers. Can you confirm that this is, or isn't, the case?
Post Reply