Seq File Performance

samyamkrishna · Post by **samyamkrishna** » Fri Dec 04, 2015 1:28 pm

Hi All,

The job reading a fixed width seq file around 150GB.
Its runs for around 2 hours.

I have tried Multiple Nodes / Readers. Dosent seem to help.

Is there anything else i can do to improve the performance?

Regards,
Samyam

chulett · Post by **chulett** » Fri Dec 04, 2015 2:04 pm

I would imagine it's as much about what happens after the read as the reading itself. What makes you think the reading of the file is the bottleneck in your job?

samyamkrishna · Post by **samyamkrishna** » Fri Dec 04, 2015 2:12 pm

after the read its just doing a column import and writting into dataset.

in the log as well the seqential file stage takes about 2 hours to complete and read the whole file.

and after another 10 mins the column import finishes and the jobs completes.

I have also tried reading the same file with just a peek sateg after the seq file stage. Take 2 hours.

samyamkrishna · Post by **samyamkrishna** » Fri Dec 04, 2015 2:48 pm

Another thing to add.

when i read from multiple nodes or readers.

It starts reading really fast initiallly at 190000 to 200000 rows/sec
but after a while it really slows down to 50000 rows/sec.

This trial job only has a seq file stage------>Peek.

qt_ky · Post by **qt_ky** » Fri Dec 04, 2015 7:21 pm

So, your read throughput is around 21 MB/s? That sounds pretty slow. Do you have a local server admin you can work with on this?

ArndW · Post by **ArndW** » Mon Dec 07, 2015 5:37 am

Make a copy of your job that just reads the file and puts it into a PEEK stage and see what the speed is. That will help you narrow down the potential problem to the sequential read itself if the speed remains slow in this test job.

samyamkrishna · Post by **samyamkrishna** » Mon Dec 07, 2015 8:56 am

Thanks for you suggestions.

qt_ky,

I have a local server admin. What should i be looking for asking him to do?

and ArndW,

The test job has only the seq file and peek. It starts very fast but it slows down in 5 mins and still takes 2 hours.

Is there anything else i can try out?
I am also planning to split the file into smaller chunks of 40GB and read them parallely.

ArndW · Post by **ArndW** » Mon Dec 07, 2015 9:58 am

Does a "cat <file> > /dev/null" go any faster?

samyamkrishna · Post by **samyamkrishna** » Mon Dec 07, 2015 10:28 am

I tried to read 4 files of 40GB instead of one 160GB file.
Same reult.

Cat the 160 GB file same result. may be 10 mins faster.
Not sure what to do.

whats a good read time for 160 GB file?

chulett · Post by **chulett** » Mon Dec 07, 2015 11:04 am

I'm not sure there's an answer for that question as there are SO many factors that go into it. Never mind that one man's good is another man's great and yet another man's crap.

ArndW · Post by **ArndW** » Tue Dec 08, 2015 10:47 am

If the "cat" took almost as long as the DataStage read, then the problem isn't in DataStage and nothing you will do there will significantly increase your speed.

Is there SAN involved? What filesystem is used? Does the speed change if you copy the files to another partition (e.g. /tmp) and try to process them from there?

qt_ky · Post by **qt_ky** » Wed Dec 09, 2015 9:21 am

Ask your local admin to monitor server resources and performance during your tests, help identify the bottleneck, and see if anything may be changed to improve it.

Is it a delimited file or fixed width? How many columns? How many records?

samyamkrishna · Post by **samyamkrishna** » Wed Dec 09, 2015 12:57 pm

I have asked the Admins to monitor the resources while the job runs today.

Its a fixed width file. with 1300 bytes and 120 milion records.
I have tried reading with multiple readers / node and read from multiple nodes.
I dont really see any improvements by that.