DSXchange

samyamkrishna

Has the job which creates this Dataset run fine?
Try running that agani and then run the jobs that read it.

samyamkrishna

Sorry Ray. I shouldnt have said sort.
That just confused everyone.

So will the Remove Duplicate in the first case do a partition again on the keys because its in Auto mode.
Thats supposed to be my question.

samyamkrishna

Just another thought on this. The job design is like this. - - - - >Sort(Partitioned: Hash)- - - - >RemoveDuplicate(Auto) In the above case will the RemoveDuplicate stage does a sort again because its Auto? -------> Sort(Partitioned: Hash)----------> RemoveDuplicate(Partitioned:Same) In this case wi...

samyamkrishna

I have asked the Admins to monitor the resources while the job runs today.

Its a fixed width file. with 1300 bytes and 120 milion records.
I have tried reading with multiple readers / node and read from multiple nodes.
I dont really see any improvements by that.

samyamkrishna

Thanks Ray and Stuart...

samyamkrishna

Hi Ray/chulett,

Thats great to know.
Its dosent answer sg33's question or my confusion.

Why do we need to partition if DS is intelligent enough to do it on its own?

samyamkrishna

If you specify partitioning then DS doesnt have to spend time and effort to identify which is most efficient way. It can just does what you have asked it to do thus saving time.

hope this helps.

samyamkrishna

Hi,

If the data that you are joining is big, its better to use hash Prtition and use the join keys as the keys for partitioning specially if your jobs are sunning on multiple nodes. This will also improve performance.

samyamkrishna

I tried to read 4 files of 40GB instead of one 160GB file.
Same reult.

Cat the 160 GB file same result. may be 10 mins faster.
Not sure what to do.

whats a good read time for 160 GB file?

samyamkrishna

Thanks for you suggestions. qt_ky, I have a local server admin. What should i be looking for asking him to do? and ArndW, The test job has only the seq file and peek. It starts very fast but it slows down in 5 mins and still takes 2 hours. Is there anything else i can try out? I am also planning to ...

samyamkrishna

I dont have access to director on Prod.
Try to get the access.

Will post my findings once i get hold of the logs.

samyamkrishna

Another thing to add.

when i read from multiple nodes or readers.

It starts reading really fast initiallly at 190000 to 200000 rows/sec
but after a while it really slows down to 50000 rows/sec.

This trial job only has a seq file stage------>Peek.

samyamkrishna

after the read its just doing a column import and writting into dataset. in the log as well the seqential file stage takes about 2 hours to complete and read the whole file. and after another 10 mins the column import finishes and the jobs completes. I have also tried reading the same file with just...

samyamkrishna

Hi All,

The job reading a fixed width seq file around 150GB.
Its runs for around 2 hours.

I have tried Multiple Nodes / Readers. Dosent seem to help.

Is there anything else i can do to improve the performance?

Regards,
Samyam

samyamkrishna

Stuart, I am worried about the execution time. Thanks for giving those hints on what to look at. Will look at them to get to a conclusion. rjdickson, Yes there are overides. but i am not sure if its for all the columns. will check that too. The question is not based on curiosity. we are having issue...

DSXchange

Search found 258 matches

Re: The partition was evidently corrupted

Re: Confusion on Partitioning for JOin stage

Seq File Performance