Page 1 of 2

Job running long: Changed Seq file to Dataset

Posted: Mon Oct 23, 2006 1:13 pm
by vijayrc
Hi,
Here's a scenario.
Designed a job with Seq file as input and job runs in a minute
Now getting the Input as Dataset and so changed Seq File to Dataset, and it takes 20minutes.
[PS: I tried a simple job, copying Dataset to Seq file, thinking reading the dataset is taking long, but it ran in few seconds]
No partitioning involved...All run with the same configuration.

Tried deleting the Seq File and associated link, and creaetd a Dataset with a new link, but didn't help.

Any light on this appreciated
Thanks,
Vijay

Posted: Mon Oct 23, 2006 1:38 pm
by ArndW
How many nodes in your configuration file? Do the data files reside on the same disk volume as your sequential file?

Posted: Mon Oct 23, 2006 1:45 pm
by vijayrc
ArndW wrote:How many nodes in your configuration file? Do the data files reside on the same disk volume as your sequential file? ...
Thanks.
[1] 4 Nodes and
[2] NO - Datasets and Sequential file reside in different mountpoints[but on the same disk volume]

Posted: Mon Oct 23, 2006 2:17 pm
by ArndW
What stage are you writing to in this job? Perhaps the dataset is repartitioning to match the output and therefore is taking longer.

Posted: Mon Oct 23, 2006 2:30 pm
by vijayrc
ArndW wrote:What stage are you writing to in this job? Perhaps the dataset is repartitioning to match the output and therefore is taking longer. ...
I have Dataset as Input, passed thru Transformer, Filter, SORT and Aggregator and finally funnelled thru to an output Dataset

Posted: Mon Oct 23, 2006 7:00 pm
by kumar_s
The test you made for Dataset to sequential file, dose it have the same input and output directories of you normal jobs has?
As Arnd suggested, maintain 'Same' partiton on all the stages as possilbe (neglect the warning for the case study).

Posted: Tue Oct 24, 2006 7:34 am
by vijayrc
kumar_s wrote:The test you made for Dataset to sequential file, dose it have the same input and output directories of you normal jobs has?
As Arnd suggested, maintain 'Same' partiton on all the stages as possilbe (neglect the warning for the case study).
Kumar, Yes the Datasets and seq file have the same input and output directories. I have changed the partition to be SAME, and still the same effect. I'm trying with RCP OFF on few stages...will keep you posted.
Thanks, Vijay

Posted: Tue Oct 24, 2006 7:42 am
by ArndW
I am fairly certain that your job is doing a significant amount of I/O sorting and repartitioning and that the slowdown is due to these stages as opposed to a dataset. Can you enable your APT_DUMP_SCORE to see what processes you are actually running?

Posted: Tue Oct 24, 2006 8:03 pm
by talk2shaanc
add env variable APT_NO_SORT_INSERTION= true as job parameter and test ur job...with partition as "same" thru all the stages

Posted: Wed Oct 25, 2006 7:23 am
by ray.wurlod
Why?

Posted: Wed Oct 25, 2006 7:25 am
by ray.wurlod
Why?

You have not seen a score, so can not assert that sorts have been inserted, and do not know what partitioning has been used. In this job design, (Auto) should use the same partitioning right through, so forcing it to be Same achieves nothing.

Posted: Wed Oct 25, 2006 11:12 am
by talk2shaanc
ur job Dataset as Input, passed thru Transformer, Filter, SORT and Aggregator

Few assumption:
1. its a linear flow, just one stream, as u have given.
2. In transformer u have some derivations and in next stage ur are dropping/selecting some records.
3. SOrt is used to just group the rows before aggregating..

I would design it as:
Step 1: while creating the dataset in JOBA, I will sort the records on keys; which i will be using in aggragator for grouping. hash partition on the highest level of grouping key.

Step2: In JOB-B i will have
dataset >> Transformer > aggregator > o/p stage
1. I will use "same" partition throughout. so it wd hash thru out
2. combine the logic of transformer and filter stage. this would eliminate need of an extra stage(filter).
3. Since dataset is already sorted and partitioned on aggragator key, I dont have to insert a sort stage before aggregator.
4. I will use APT_NO_SORT_INSERTION, as its possible that DS inserts a sort before aggragator stage. You can check DUMP_SCORE, before adding this.

Posted: Wed Oct 25, 2006 11:37 am
by ray.wurlod
Please write in English, otherwise we will send Borat to your site.

Posted: Wed Oct 25, 2006 12:22 pm
by talk2shaanc
ray.wurlod wrote:Please write in English, otherwise we will send Borat to your site.
There is no word in English called "Borat" :shock: If it's a slang, then please correct your english.

Secondly, we are here not to correct anybody's english but to correct Datastage understanding. If language used by somebody is abusive or insulting, then we should raise a concern.

Thirdly, if you are against abbreviation. Then protest and avoid using all the abbreviations in this world. Even won't as won't is contraction of will not. **You never know, the word's I am using now, becomes part of dictionary tomorrow.**

Posted: Wed Oct 25, 2006 1:28 pm
by ray.wurlod
Borat, since it has a capital letter, is a proper noun (in both senses). It is a person's name, albeit a fictitious person, an alter ego of Sacha Baron Cohen. I leave the remaining research as an exercise for the reader.