Seq File Performance

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Seq File Performance

Post by samyamkrishna »

Hi All,

The job reading a fixed width seq file around 150GB.
Its runs for around 2 hours.

I have tried Multiple Nodes / Readers. Dosent seem to help.

Is there anything else i can do to improve the performance?

Regards,
Samyam
Cheers,
Samyam
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I would imagine it's as much about what happens after the read as the reading itself. What makes you think the reading of the file is the bottleneck in your job?
-craig

"You can never have too many knives" -- Logan Nine Fingers
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

after the read its just doing a column import and writting into dataset.

in the log as well the seqential file stage takes about 2 hours to complete and read the whole file.

and after another 10 mins the column import finishes and the jobs completes.

I have also tried reading the same file with just a peek sateg after the seq file stage. Take 2 hours.
Cheers,
Samyam
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Another thing to add.

when i read from multiple nodes or readers.

It starts reading really fast initiallly at 190000 to 200000 rows/sec
but after a while it really slows down to 50000 rows/sec.

This trial job only has a seq file stage------>Peek.
Cheers,
Samyam
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

So, your read throughput is around 21 MB/s? That sounds pretty slow. Do you have a local server admin you can work with on this?
Choose a job you love, and you will never have to work a day in your life. - Confucius
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Make a copy of your job that just reads the file and puts it into a PEEK stage and see what the speed is. That will help you narrow down the potential problem to the sequential read itself if the speed remains slow in this test job.
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Thanks for you suggestions.

qt_ky,

I have a local server admin. What should i be looking for asking him to do?

and ArndW,

The test job has only the seq file and peek. It starts very fast but it slows down in 5 mins and still takes 2 hours.

Is there anything else i can try out?
I am also planning to split the file into smaller chunks of 40GB and read them parallely.
Cheers,
Samyam
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Does a "cat <file> > /dev/null" go any faster?
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

I tried to read 4 files of 40GB instead of one 160GB file.
Same reult.

Cat the 160 GB file same result. may be 10 mins faster.
Not sure what to do.

whats a good read time for 160 GB file?
Last edited by samyamkrishna on Mon Dec 07, 2015 12:16 pm, edited 1 time in total.
Cheers,
Samyam
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I'm not sure there's an answer for that question as there are SO many factors that go into it. Never mind that one man's good is another man's great and yet another man's crap. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If the "cat" took almost as long as the DataStage read, then the problem isn't in DataStage and nothing you will do there will significantly increase your speed.

Is there SAN involved? What filesystem is used? Does the speed change if you copy the files to another partition (e.g. /tmp) and try to process them from there?
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Ask your local admin to monitor server resources and performance during your tests, help identify the bottleneck, and see if anything may be changed to improve it.

Is it a delimited file or fixed width? How many columns? How many records?
Choose a job you love, and you will never have to work a day in your life. - Confucius
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

I have asked the Admins to monitor the resources while the job runs today.

Its a fixed width file. with 1300 bytes and 120 milion records.
I have tried reading with multiple readers / node and read from multiple nodes.
I dont really see any improvements by that.
Cheers,
Samyam
Post Reply