Page 1 of 1

Seq File Performance

Posted: Fri Dec 04, 2015 1:28 pm
by samyamkrishna
Hi All,

The job reading a fixed width seq file around 150GB.
Its runs for around 2 hours.

I have tried Multiple Nodes / Readers. Dosent seem to help.

Is there anything else i can do to improve the performance?

Regards,
Samyam

Posted: Fri Dec 04, 2015 2:04 pm
by chulett
I would imagine it's as much about what happens after the read as the reading itself. What makes you think the reading of the file is the bottleneck in your job?

Posted: Fri Dec 04, 2015 2:12 pm
by samyamkrishna
after the read its just doing a column import and writting into dataset.

in the log as well the seqential file stage takes about 2 hours to complete and read the whole file.

and after another 10 mins the column import finishes and the jobs completes.

I have also tried reading the same file with just a peek sateg after the seq file stage. Take 2 hours.

Posted: Fri Dec 04, 2015 2:48 pm
by samyamkrishna
Another thing to add.

when i read from multiple nodes or readers.

It starts reading really fast initiallly at 190000 to 200000 rows/sec
but after a while it really slows down to 50000 rows/sec.

This trial job only has a seq file stage------>Peek.

Posted: Fri Dec 04, 2015 7:21 pm
by qt_ky
So, your read throughput is around 21 MB/s? That sounds pretty slow. Do you have a local server admin you can work with on this?

Posted: Mon Dec 07, 2015 5:37 am
by ArndW
Make a copy of your job that just reads the file and puts it into a PEEK stage and see what the speed is. That will help you narrow down the potential problem to the sequential read itself if the speed remains slow in this test job.

Posted: Mon Dec 07, 2015 8:56 am
by samyamkrishna
Thanks for you suggestions.

qt_ky,

I have a local server admin. What should i be looking for asking him to do?

and ArndW,

The test job has only the seq file and peek. It starts very fast but it slows down in 5 mins and still takes 2 hours.

Is there anything else i can try out?
I am also planning to split the file into smaller chunks of 40GB and read them parallely.

Posted: Mon Dec 07, 2015 9:58 am
by ArndW
Does a "cat <file> > /dev/null" go any faster?

Posted: Mon Dec 07, 2015 10:28 am
by samyamkrishna
I tried to read 4 files of 40GB instead of one 160GB file.
Same reult.

Cat the 160 GB file same result. may be 10 mins faster.
Not sure what to do.

whats a good read time for 160 GB file?

Posted: Mon Dec 07, 2015 11:04 am
by chulett
I'm not sure there's an answer for that question as there are SO many factors that go into it. Never mind that one man's good is another man's great and yet another man's crap. :wink:

Posted: Tue Dec 08, 2015 10:47 am
by ArndW
If the "cat" took almost as long as the DataStage read, then the problem isn't in DataStage and nothing you will do there will significantly increase your speed.

Is there SAN involved? What filesystem is used? Does the speed change if you copy the files to another partition (e.g. /tmp) and try to process them from there?

Posted: Wed Dec 09, 2015 9:21 am
by qt_ky
Ask your local admin to monitor server resources and performance during your tests, help identify the bottleneck, and see if anything may be changed to improve it.

Is it a delimited file or fixed width? How many columns? How many records?

Posted: Wed Dec 09, 2015 12:57 pm
by samyamkrishna
I have asked the Admins to monitor the resources while the job runs today.

Its a fixed width file. with 1300 bytes and 120 milion records.
I have tried reading with multiple readers / node and read from multiple nodes.
I dont really see any improvements by that.