Number of readers per node --

Satwika · Post by **Satwika** » Fri Sep 28, 2012 8:50 am

ArndW wrote:Run your job in a 1-node configuration and see if the error remains. If it is still there with 1-node then your partitioning is not at the root of the problem.

Hi Andrw

I don't have access to change it single node, i can create the configuration file but can't able to use in job. Can I have any suggestions on this ?

Thank you

ArndW · Post by **ArndW** » Fri Sep 28, 2012 10:10 am

Yes, you should be able to add the parameter APT_CONFIG_FILE to your job and point it at a 1-node configuration at runtime.

Satwika · Post by **Satwika** » Mon Oct 01, 2012 4:12 am

Hi Andrw

I tried with single node, but still problem exists... any other suggestions please......

ArndW · Post by **ArndW** » Mon Oct 01, 2012 4:44 am

This sounds rather strange indeed.

If you run your job with 1-reader-per-node from the designer with a 1-node $APT_CONFIG_FILE configuration then write down the number of records shown as coming out of the files 1, 2 and 3.

Then change the number of readers per node and re-run the job. Are the numbers displayed in the source stages now different or the same?

PhilHibbs · Post by **PhilHibbs** » Mon Oct 01, 2012 4:54 am

*Note* Message composed prior to OP's last update, which indicates that partitioning is not the problem, so please treat the following as a general suggestion on partitioning and not as a solution to actual problem.

ArndW wrote:To keep things simple, hash both input links to the join on "Col1" and then sort both on Col1, Col2, Col3, Col5 and Col6.

Whether or not this is good advice depends on the cardinality of Col1, by which I mean "how many distinct values the column contains", and also how long the value in the column is. The ideal column to partition on is something short that has very high cardinality. A postcode or ZIP code in a data set that covers a large geographical area is a good example of this - short, but with lots of different values. Account numbers, Employee IDs, these are good hash partitioning keys. Country Name, not so good, especially if the only values are "USA" and "Canada". If one of the values Col1 Col2 and Col3 matches the "good" criteria that I described, then pick that as your partitioning key, and sort by all three.

So if Col1 is "Country", Col2 is "ADDRESS_TEXT" and Col3 is "Zip" then Col3 is a good choice as long as it is populated for the large majority of your data set. Col1 is a reasonable choice if you have data for a lot of countries and the zip code is blank for a lot of your data. Col2 is the last resort as it is a long string value and thus expensive to calculate a hash value from, and should only be chosen if Col1 is low cardinality and Col3 is blank for a large portion of your data (more than, say, 10% at a guess).

You could partition on all three keys that the joins have in common, but in actuality you only need a subset of the keys that provide high cardinality.

PhilHibbs · Post by **PhilHibbs** » Mon Oct 01, 2012 4:56 am

A quick-and-dirty way to run something on a single node is to add an integer column, set it always to 1, and partition on that. Particularly useful if there is just one Transformer in the middle of a job (or in a Shared Container) that you want to run on a single node without interfering with the rest of the job. Introduce the new column in one Transformer, partition on it on the way into the next Transformer (or Aggregator or whatever) and then partition on something more sensible on the way out. I used this in a Shared Container when I discovered that setting a Transformer to execute in Sequential mode was causing a downstream Aggregator stage to throw out a weird error message.

I prefer this to setting a node map constraint, as those are not visible on the canvas whereas the repartitioning icon is a visible indicator of what is happening, and the Transformer can be called something informative such as Tfm_AddDummyPartKey.

Again this isn't directly aimed at solving the actual problem, but is pertinent to the process of diagnosing issues like this.

ArndW · Post by **ArndW** » Mon Oct 01, 2012 5:16 am

Phil - in the last post the OP stated that the problem remains even with a 1-node configuration so the issue of hashing is a moot point (for the moment); while you are spot on regarding the cardinality of the hash keys I believe that the problems here are much more basic.

ArndW · Post by **ArndW** » Wed Oct 03, 2012 5:36 am

The column you titled "Output Records" is the output count from the sequential file stage, correct? The number of records output from the sequential file stage should be the same, regardless of how many readers. Can you confirm that this is, or isn't, the case?