Performance issue in reading 6GB file

ethelvina · Post by **ethelvina** » Mon Oct 11, 2010 4:40 am

Hi, All,

I have Mainframe file as source and it is of 6GB size and have around 3500 fields. I tried to read this file with Sequential file stage using Schema File and RCP options and pass it to a dataset thru' transformer. It is taking around 25 minutes to read the data from dataset in another job (say 2nd job).So Iwrote it to Fileset instead DS (in first job) and tried reading data from File set (2nd job). There is no improvement. Are there any other way to improve the performance?
I used the same jobs when reading a mainframe file of ~3800 fields and it was taking 30 secs to complete it. But the size is minimal not even 1 GB data.

FYI...I tried reading using CFF but it is taking 45 minutes to get it done.

I'd like to thank if any one could help me out on this.

Thanks,
Ethel.

ArndW · Post by **ArndW** » Mon Oct 11, 2010 5:22 am

You have two problem areas, and I'm not sure which you consider:

a) reading from flat file and writing to a dataset
b) reading the dataset.

Is your mainframe file fixed length? If so, you might be able to improve your read performance by using multiple readers and most likely improve your dataset write performance by choosing an optimal APT_CONFIG_FILE configuration.

The second problem might also be addressed by using more parallel nodes. It does depend upon your DataStage server hardware setup, though.

ethelvina · Post by **ethelvina** » Mon Oct 11, 2010 5:33 am

ArndW wrote:You have two problem areas, and I'm not sure which you consider:

a) reading from flat file and writing to a dataset
b) reading the dataset.

Is your mainframe file fixed length? If so, you might ...

My Mainframe file is fixed. Here it goes my design for better understanding.

I have Mainframe fixed width file and I have used Sequential file stage to read it (the options i used is already in my earlier post). It is taking ~25 mins to read it. That is the reason I splitted the job in to like reading the data from Seq file and write it to Dataset and used the dataset as source for further process.

Pls correct me if I'm wrong anywhere.

Thanks.

ArndW · Post by **ArndW** » Mon Oct 11, 2010 5:54 am

How many CPUs does your system have and how many nodes are in your APT_CONFIG_FILE? Can you experiment on timing using multiple readers per node to see if you speed up reading your file (in order to test this, write to a PEEK stage instead of to the dataset)

ethelvina · Post by **ethelvina** » Mon Oct 11, 2010 6:17 am

ArndW wrote:How many CPUs does your system have and how many nodes are in your APT_CONFIG_FILE? Can you experiment on timing using multiple readers per node to see if you speed up reading your file (in order to test this, write to a PEEK stage instead of to the dataset)

We have 4 nodes in config file. I've raised readers per node to 4 when using Seq file stage.

mhester · Post by **mhester** » Mon Oct 11, 2010 6:32 am

This issue is probably more closely related to how you have defined the record layout and the output record. If you include the Group etc... then it will take a bunch of time to read the data especially with rows that wide. I just had this issue with an 18gb file and 1800 columns and I was able to get it to read very quickly. I need to look at what I did and pass that along to you....

Stand by :D

ArndW · Post by **ArndW** » Mon Oct 11, 2010 7:25 am

I would try turning on "Read from multiple Nodes".

ArndW · Post by **ArndW** » Mon Oct 11, 2010 7:26 am

Do you also have "Read from multiple Nodes" turned on?

ethelvina · Post by **ethelvina** » Mon Oct 11, 2010 8:15 am

mhester wrote:This issue is probably more closely related to how you have defined the record layout and the output record. If you include the Group etc... then it will take a bunch of time to read the data especially with rows that wide. I just had this issue with an 18gb file and 1800 columns and I was able to get it to read very quickly. I need to look at what I did and pass that along to you....

Stand by :D

I did not include Group in the schema file (but manually expanded it so as to avoid the subreord creation). Also, my source is Sequential file reading Mainframe fixed file passed to transformer and then to Column Inport stage and then to a target dataset. Column Import stage is RCP enabled.

ArndW · Post by **ArndW** » Mon Oct 11, 2010 9:35 am

What is your CPU usage during the job run? At the moment you don't know if I/O, I/O per process, or CPU Usage is the bottleneck.

ethelvina · Post by **ethelvina** » Tue Oct 12, 2010 2:42 am

ArndW wrote:What is your CPU usage during the job run? At the moment you don't know if I/O, I/O per process, or CPU Usage is the bottleneck. ...

I checked with DS admin for I/O process during that time and it was quite minimal it seems and so the CPU usage.

ethelvina · Post by **ethelvina** » Tue Oct 12, 2010 2:44 am

ArndW wrote:Do you also have "Read from multiple Nodes" turned on? ...

Yes.Its turned on.

ethelvina · Post by **ethelvina** » Tue Oct 12, 2010 2:46 am

mhester wrote:This issue is probably more closely related to how you have defined the record layout and the output record. If you include the Group etc... then it will take a bunch of time to read the data especially with rows that wide. I just had this issue with an 18gb file and 1800 columns and I was able to get it to read very quickly. I need to look at what I did and pass that along to you....

Stand by :D

Can you Pls update us how you did that in your job?

ArndW · Post by **ArndW** » Tue Oct 12, 2010 3:04 am

ethelvina wrote:...I checked with DS admin for I/O process during that time and it was quite minimal it seems and so the CPU usage.

Something isn't correct here - either your CPU or your I/O Bandwidth is going to max out in this case. Could your disk be on a SAN and thus heavy disk use might show up in the form of network I/O?

ethelvina · Post by **ethelvina** » Tue Oct 12, 2010 5:20 am

ArndW wrote:
ethelvina wrote:...I checked with DS admin for I/O process during that time and it was quite minimal it seems and so the CPU usage.
Something isn't correct here - either your CPU or your ...

I'm sorry..I meant to say "I checked with DS admin for I/O process during that time and it was quite minimal it seems and also the CPU usage"