Job running long: Changed Seq file to Dataset

vijayrc · Post by **vijayrc** » Mon Oct 23, 2006 1:13 pm

Hi,
Here's a scenario.
Designed a job with Seq file as input and job runs in a minute
Now getting the Input as Dataset and so changed Seq File to Dataset, and it takes 20minutes.
[PS: I tried a simple job, copying Dataset to Seq file, thinking reading the dataset is taking long, but it ran in few seconds]
No partitioning involved...All run with the same configuration.

Tried deleting the Seq File and associated link, and creaetd a Dataset with a new link, but didn't help.

Any light on this appreciated
Thanks,
Vijay

ArndW · Post by **ArndW** » Mon Oct 23, 2006 1:38 pm

How many nodes in your configuration file? Do the data files reside on the same disk volume as your sequential file?

vijayrc · Post by **vijayrc** » Mon Oct 23, 2006 1:45 pm

ArndW wrote:How many nodes in your configuration file? Do the data files reside on the same disk volume as your sequential file? ...

Thanks.
[1] 4 Nodes and
[2] NO - Datasets and Sequential file reside in different mountpoints[but on the same disk volume]

ArndW · Post by **ArndW** » Mon Oct 23, 2006 2:17 pm

What stage are you writing to in this job? Perhaps the dataset is repartitioning to match the output and therefore is taking longer.

vijayrc · Post by **vijayrc** » Mon Oct 23, 2006 2:30 pm

ArndW wrote:What stage are you writing to in this job? Perhaps the dataset is repartitioning to match the output and therefore is taking longer. ...

I have Dataset as Input, passed thru Transformer, Filter, SORT and Aggregator and finally funnelled thru to an output Dataset

kumar_s · Post by **kumar_s** » Mon Oct 23, 2006 7:00 pm

The test you made for Dataset to sequential file, dose it have the same input and output directories of you normal jobs has?
As Arnd suggested, maintain 'Same' partiton on all the stages as possilbe (neglect the warning for the case study).

vijayrc · Post by **vijayrc** » Tue Oct 24, 2006 7:34 am

kumar_s wrote:The test you made for Dataset to sequential file, dose it have the same input and output directories of you normal jobs has?
As Arnd suggested, maintain 'Same' partiton on all the stages as possilbe (neglect the warning for the case study).

Kumar, Yes the Datasets and seq file have the same input and output directories. I have changed the partition to be SAME, and still the same effect. I'm trying with RCP OFF on few stages...will keep you posted.
Thanks, Vijay

ArndW · Post by **ArndW** » Tue Oct 24, 2006 7:42 am

I am fairly certain that your job is doing a significant amount of I/O sorting and repartitioning and that the slowdown is due to these stages as opposed to a dataset. Can you enable your APT_DUMP_SCORE to see what processes you are actually running?

talk2shaanc · Post by **talk2shaanc** » Tue Oct 24, 2006 8:03 pm

add env variable APT_NO_SORT_INSERTION= true as job parameter and test ur job...with partition as "same" thru all the stages

ray.wurlod · Post by **ray.wurlod** » Wed Oct 25, 2006 7:23 am

ray.wurlod · Post by **ray.wurlod** » Wed Oct 25, 2006 7:25 am

Why?

You have not seen a score, so can not assert that sorts have been inserted, and do not know what partitioning has been used. In this job design, (Auto) should use the same partitioning right through, so forcing it to be Same achieves nothing.

talk2shaanc · Post by **talk2shaanc** » Wed Oct 25, 2006 11:12 am

ur job Dataset as Input, passed thru Transformer, Filter, SORT and Aggregator

Few assumption:
1. its a linear flow, just one stream, as u have given.
2. In transformer u have some derivations and in next stage ur are dropping/selecting some records.
3. SOrt is used to just group the rows before aggregating..

I would design it as:
Step 1: while creating the dataset in JOBA, I will sort the records on keys; which i will be using in aggragator for grouping. hash partition on the highest level of grouping key.

Step2: In JOB-B i will have
dataset >> Transformer > aggregator > o/p stage
1. I will use "same" partition throughout. so it wd hash thru out
2. combine the logic of transformer and filter stage. this would eliminate need of an extra stage(filter).
3. Since dataset is already sorted and partitioned on aggragator key, I dont have to insert a sort stage before aggregator.
4. I will use APT_NO_SORT_INSERTION, as its possible that DS inserts a sort before aggragator stage. You can check DUMP_SCORE, before adding this.

ray.wurlod · Post by **ray.wurlod** » Wed Oct 25, 2006 11:37 am

Please write in English, otherwise we will send Borat to your site.

talk2shaanc · Post by **talk2shaanc** » Wed Oct 25, 2006 12:22 pm

ray.wurlod wrote:Please write in English, otherwise we will send Borat to your site.

There is no word in English called "Borat"

If it's a slang, then please correct your english.

Secondly, we are here not to correct anybody's english but to correct Datastage understanding. If language used by somebody is abusive or insulting, then we should raise a concern.

Thirdly, if you are against abbreviation. Then protest and avoid using all the abbreviations in this world. Even won't as won't is contraction of will not. **You never know, the word's I am using now, becomes part of dictionary tomorrow.**

ray.wurlod · Post by **ray.wurlod** » Wed Oct 25, 2006 1:28 pm

Borat, since it has a capital letter, is a proper noun (in both senses). It is a person's name, albeit a fictitious person, an alter ego of Sacha Baron Cohen. I leave the remaining research as an exercise for the reader.