Performance bench marks on External source stage
Moderators: chulett, rschirm, roy
Performance bench marks on External source stage
Hello,
We have done some performance bench marks on external source stage.
Process1:
Seq1:
Execute command activity(Perl script writing to a file)--> Job activity
Job:
Sequential file --> Peek stage
Process2:
External source stage--> peek.
Command in External source stage:
*:#$USR_DR_SCRIPT#/txl_preprocessor.pl #$USR_DR_PREPROC_FROM# #$USR_DR_PREPROC_TO#
Perl script in Process1 and process2 are same.
Perl script takes pre requite set of files like 10000 files cleans hex values starting of the record and STDOUT the data.In process1 we are redirecting it to one single file and reading it from sequentuial file stage.
In process 2 we are running the perl command from external source stage and streaming the STDOUT directly into datastage.
Test results are contradicting to general hard and fast rule that writing to a file and reading from a file is a costly operation than reading directly from STDOUT.Here are the test results.
STREAMING Process 2
Files per Batch Start Time End Time Elapsed Time(minutes) # of records
20000 15:16:38 15:20:18 3:40 13028722
15:28:45 15:32:16 3:31 13028722
15:33:03 15:36:33 3:30 13028722
16:05:21 16:08:54 3:33 13028722
16:09:48 16:13:19 3:31 13028722
WRITING TO FILE Process 1
Files per Batch Start Time End Time Elapsed Time(minutes) # of records
15:23:00 15:26:10 3:10 13028722
15:48:35 15:51:39 3:04 13028722
15:52:51 15:55:56 3:05 13028722
15:57:03 16:00:09 3:06 13028722
16:00:44 16:03:49 3:05 13028722
Both the process are runng in a 2 node confugeration.External source stage is made to run in one node because at the back end it is running two instances of perl script and duplicating the same data.
Server details where tests are conducted:
[wicdsadp@linuxLinux linux5960 2.6.32-431.20.3.el6.x86_64 #1 SMP Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
5960 ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256724
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 32768
cpu time (seconds, -t) unlimited
max user processes (-u) 20480
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Question: Any technical reason why writing to a file and reading from a file is faster than reading from STDOUT?I am ready to provide additional information.Can I do any tunning in external source stage.
Thank You!
Pavan
We have done some performance bench marks on external source stage.
Process1:
Seq1:
Execute command activity(Perl script writing to a file)--> Job activity
Job:
Sequential file --> Peek stage
Process2:
External source stage--> peek.
Command in External source stage:
*:#$USR_DR_SCRIPT#/txl_preprocessor.pl #$USR_DR_PREPROC_FROM# #$USR_DR_PREPROC_TO#
Perl script in Process1 and process2 are same.
Perl script takes pre requite set of files like 10000 files cleans hex values starting of the record and STDOUT the data.In process1 we are redirecting it to one single file and reading it from sequentuial file stage.
In process 2 we are running the perl command from external source stage and streaming the STDOUT directly into datastage.
Test results are contradicting to general hard and fast rule that writing to a file and reading from a file is a costly operation than reading directly from STDOUT.Here are the test results.
STREAMING Process 2
Files per Batch Start Time End Time Elapsed Time(minutes) # of records
20000 15:16:38 15:20:18 3:40 13028722
15:28:45 15:32:16 3:31 13028722
15:33:03 15:36:33 3:30 13028722
16:05:21 16:08:54 3:33 13028722
16:09:48 16:13:19 3:31 13028722
WRITING TO FILE Process 1
Files per Batch Start Time End Time Elapsed Time(minutes) # of records
15:23:00 15:26:10 3:10 13028722
15:48:35 15:51:39 3:04 13028722
15:52:51 15:55:56 3:05 13028722
15:57:03 16:00:09 3:06 13028722
16:00:44 16:03:49 3:05 13028722
Both the process are runng in a 2 node confugeration.External source stage is made to run in one node because at the back end it is running two instances of perl script and duplicating the same data.
Server details where tests are conducted:
[wicdsadp@linuxLinux linux5960 2.6.32-431.20.3.el6.x86_64 #1 SMP Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
5960 ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256724
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 32768
cpu time (seconds, -t) unlimited
max user processes (-u) 20480
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Question: Any technical reason why writing to a file and reading from a file is faster than reading from STDOUT?I am ready to provide additional information.Can I do any tunning in external source stage.
Thank You!
Pavan
Re: Performance bench marks on External source stage
It's an interesting discussion and test but to be honest I've never heard of this "general hard and fast rule". For whatever that is worth.pavan5035 wrote:Test results are contradicting to general hard and fast rule that writing to a file and reading from a file is a costly operation than reading directly from STDOUT.
Oh, and welcome.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
@Chullet
So is it correct to correct my assumption that out of two individual processes one writing to a file and other reading from that file is faster than one passing the data to the other through mermory.I was under an impression that writing disk I/O is always costlier greater than processing the data in memory among processes.
-
- Premium Member
- Posts: 1735
- Joined: Thu Mar 01, 2007 5:44 am
- Location: Troy, MI
You are anyways reading a file in perl script and sending data to stdout, it seems. Correct me if the assumption is wrong. so that requires the same read with one additional step. even if perl is generating the record on fly, it requires additional processing.
Priyadarshi Kunal
Genius may have its limitations, but stupidity is not thus handicapped.
Genius may have its limitations, but stupidity is not thus handicapped.
This all sounds like a comparison that doesn't really make sense. There are far too many variables. Certainly, we always try to avoid disk I/O when we can, but how the data is written, what is in it, how big is it, whether it is encrypted or not, record lengths, memory passing strategies, what program is doing the writing, how much time you actually have for the movement, etc. etc. etc. and much much more are going to come into play. There isn't a "general rule" you can apply here. Take a big picture look at the Job and see if the overall task is being approached in the right way for the best performance. ...and then you can decide where you can make the most gains if it isn't moving data as fast as you would like, or if you even need to.
Ernie
Ernie
Ernie Ostic
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
blogit!
<a href="https://dsrealtime.wordpress.com/2015/0 ... ere/">Open IGC is Here!</a>
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The subject of your Premium Membership was moved here and any further conversation on that topic needs to happen there. Let's leave this one for your External Source performance topic.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers