Performance issue in reading 6GB file
Moderators: chulett, rschirm, roy
Performance issue in reading 6GB file
Hi, All,
I have Mainframe file as source and it is of 6GB size and have around 3500 fields. I tried to read this file with Sequential file stage using Schema File and RCP options and pass it to a dataset thru' transformer. It is taking around 25 minutes to read the data from dataset in another job (say 2nd job).So Iwrote it to Fileset instead DS (in first job) and tried reading data from File set (2nd job). There is no improvement. Are there any other way to improve the performance?
I used the same jobs when reading a mainframe file of ~3800 fields and it was taking 30 secs to complete it. But the size is minimal not even 1 GB data.
FYI...I tried reading using CFF but it is taking 45 minutes to get it done.
I'd like to thank if any one could help me out on this.
Thanks,
Ethel.
I have Mainframe file as source and it is of 6GB size and have around 3500 fields. I tried to read this file with Sequential file stage using Schema File and RCP options and pass it to a dataset thru' transformer. It is taking around 25 minutes to read the data from dataset in another job (say 2nd job).So Iwrote it to Fileset instead DS (in first job) and tried reading data from File set (2nd job). There is no improvement. Are there any other way to improve the performance?
I used the same jobs when reading a mainframe file of ~3800 fields and it was taking 30 secs to complete it. But the size is minimal not even 1 GB data.
FYI...I tried reading using CFF but it is taking 45 minutes to get it done.
I'd like to thank if any one could help me out on this.
Thanks,
Ethel.
You have two problem areas, and I'm not sure which you consider:
a) reading from flat file and writing to a dataset
b) reading the dataset.
Is your mainframe file fixed length? If so, you might be able to improve your read performance by using multiple readers and most likely improve your dataset write performance by choosing an optimal APT_CONFIG_FILE configuration.
The second problem might also be addressed by using more parallel nodes. It does depend upon your DataStage server hardware setup, though.
a) reading from flat file and writing to a dataset
b) reading the dataset.
Is your mainframe file fixed length? If so, you might be able to improve your read performance by using multiple readers and most likely improve your dataset write performance by choosing an optimal APT_CONFIG_FILE configuration.
The second problem might also be addressed by using more parallel nodes. It does depend upon your DataStage server hardware setup, though.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
My Mainframe file is fixed. Here it goes my design for better understanding.ArndW wrote:You have two problem areas, and I'm not sure which you consider:
a) reading from flat file and writing to a dataset
b) reading the dataset.
Is your mainframe file fixed length? If so, you might ...
I have Mainframe fixed width file and I have used Sequential file stage to read it (the options i used is already in my earlier post). It is taking ~25 mins to read it. That is the reason I splitted the job in to like reading the data from Seq file and write it to Dataset and used the dataset as source for further process.
Pls correct me if I'm wrong anywhere.
Thanks.
How many CPUs does your system have and how many nodes are in your APT_CONFIG_FILE? Can you experiment on timing using multiple readers per node to see if you speed up reading your file (in order to test this, write to a PEEK stage instead of to the dataset)
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
We have 4 nodes in config file. I've raised readers per node to 4 when using Seq file stage.ArndW wrote:How many CPUs does your system have and how many nodes are in your APT_CONFIG_FILE? Can you experiment on timing using multiple readers per node to see if you speed up reading your file (in order to test this, write to a PEEK stage instead of to the dataset)
This issue is probably more closely related to how you have defined the record layout and the output record. If you include the Group etc... then it will take a bunch of time to read the data especially with rows that wide. I just had this issue with an 18gb file and 1800 columns and I was able to get it to read very quickly. I need to look at what I did and pass that along to you....
Stand by :D
Stand by :D
Mike Hester
mhester@petra-ps.com
mhester@petra-ps.com
I would try turning on "Read from multiple Nodes".
Last edited by ArndW on Mon Oct 11, 2010 7:27 am, edited 2 times in total.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Do you also have "Read from multiple Nodes" turned on?
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I did not include Group in the schema file (but manually expanded it so as to avoid the subreord creation). Also, my source is Sequential file reading Mainframe fixed file passed to transformer and then to Column Inport stage and then to a target dataset. Column Import stage is RCP enabled.mhester wrote:This issue is probably more closely related to how you have defined the record layout and the output record. If you include the Group etc... then it will take a bunch of time to read the data especially with rows that wide. I just had this issue with an 18gb file and 1800 columns and I was able to get it to read very quickly. I need to look at what I did and pass that along to you....
Stand by :D
What is your CPU usage during the job run? At the moment you don't know if I/O, I/O per process, or CPU Usage is the bottleneck.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
Can you Pls update us how you did that in your job?mhester wrote:This issue is probably more closely related to how you have defined the record layout and the output record. If you include the Group etc... then it will take a bunch of time to read the data especially with rows that wide. I just had this issue with an 18gb file and 1800 columns and I was able to get it to read very quickly. I need to look at what I did and pass that along to you....
Stand by :D
Something isn't correct here - either your CPU or your I/O Bandwidth is going to max out in this case. Could your disk be on a SAN and thus heavy disk use might show up in the form of network I/O?ethelvina wrote:...I checked with DS admin for I/O process during that time and it was quite minimal it seems and so the CPU usage.
<a href=http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1 target="WCGWin">
</a>
</a>
I'm sorry..I meant to say "I checked with DS admin for I/O process during that time and it was quite minimal it seems and also the CPU usage"ArndW wrote:Something isn't correct here - either your CPU or your ...ethelvina wrote:...I checked with DS admin for I/O process during that time and it was quite minimal it seems and so the CPU usage.