Performance bench marks on External source stage

pavan5035 · Post by **pavan5035** » Sat Aug 08, 2015 12:13 pm

Hello,

We have done some performance bench marks on external source stage.

Process1:
Seq1:

Execute command activity(Perl script writing to a file)--> Job activity

Job:

Sequential file --> Peek stage

Process2:

External source stage--> peek.

Command in External source stage:
*:#$USR_DR_SCRIPT#/txl_preprocessor.pl #$USR_DR_PREPROC_FROM# #$USR_DR_PREPROC_TO#

Perl script in Process1 and process2 are same.

Perl script takes pre requite set of files like 10000 files cleans hex values starting of the record and STDOUT the data.In process1 we are redirecting it to one single file and reading it from sequentuial file stage.
In process 2 we are running the perl command from external source stage and streaming the STDOUT directly into datastage.

Test results are contradicting to general hard and fast rule that writing to a file and reading from a file is a costly operation than reading directly from STDOUT.Here are the test results.

STREAMING Process 2
Files per Batch Start Time End Time Elapsed Time(minutes) # of records

20000 15:16:38 15:20:18 3:40 13028722
15:28:45 15:32:16 3:31 13028722
15:33:03 15:36:33 3:30 13028722
16:05:21 16:08:54 3:33 13028722
16:09:48 16:13:19 3:31 13028722

WRITING TO FILE Process 1
Files per Batch Start Time End Time Elapsed Time(minutes) # of records

15:23:00 15:26:10 3:10 13028722
15:48:35 15:51:39 3:04 13028722
15:52:51 15:55:56 3:05 13028722
15:57:03 16:00:09 3:06 13028722
16:00:44 16:03:49 3:05 13028722

Both the process are runng in a 2 node confugeration.External source stage is made to run in one node because at the back end it is running two instances of perl script and duplicating the same data.

Server details where tests are conducted:

[wicdsadp@linuxLinux linux5960 2.6.32-431.20.3.el6.x86_64 #1 SMP Fri Jun 6 18:30:54 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux
5960 ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256724
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 32768
cpu time (seconds, -t) unlimited
max user processes (-u) 20480
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Question: Any technical reason why writing to a file and reading from a file is faster than reading from STDOUT?I am ready to provide additional information.Can I do any tunning in external source stage.

Thank You!
Pavan

chulett · Post by **chulett** » Sat Aug 08, 2015 3:41 pm

pavan5035 wrote:Test results are contradicting to general hard and fast rule that writing to a file and reading from a file is a costly operation than reading directly from STDOUT.

It's an interesting discussion and test but to be honest I've never heard of this "general hard and fast rule". For whatever that is worth.

Oh, and welcome.

pavan5035 · Post by **pavan5035** » Sun Aug 09, 2015 6:58 pm

So is it correct to correct my assumption that out of two individual processes one writing to a file and other reading from that file is faster than one passing the data to the other through mermory.I was under an impression that writing disk I/O is always costlier greater than processing the data in memory among processes.

priyadarshikunal · Post by **priyadarshikunal** » Mon Aug 10, 2015 1:12 am

You are anyways reading a file in perl script and sending data to stdout, it seems. Correct me if the assumption is wrong. so that requires the same read with one additional step. even if perl is generating the record on fly, it requires additional processing.

eostic · Post by **eostic** » Mon Aug 10, 2015 6:06 am

This all sounds like a comparison that doesn't really make sense. There are far too many variables. Certainly, we always try to avoid disk I/O when we can, but how the data is written, what is in it, how big is it, whether it is encrypted or not, record lengths, memory passing strategies, what program is doing the writing, how much time you actually have for the movement, etc. etc. etc. and much much more are going to come into play. There isn't a "general rule" you can apply here. Take a big picture look at the Job and see if the overall task is being approached in the right way for the best performance. ...and then you can decide where you can make the most gains if it isn't moving data as fast as you would like, or if you even need to.

Ernie

chulett · Post by **chulett** » Mon Aug 10, 2015 7:43 am

Emphasis mine:

eostic wrote:There are far too many variables.

If there was anything that I wanted to add to my previous post, this was it. And Ernie's full post details it nicely.

pavan5035 · Post by **pavan5035** » Mon Aug 10, 2015 7:57 am

I am sorry I am unable to see the the premium post.Can it be made free for one time.

Thanks in advance!

ray.wurlod · Post by **ray.wurlod** » Mon Aug 10, 2015 4:32 pm

How many more "for one time" requests will you make? Get a premium membership; this is the mechanism that keeps DSXchange alive (funded).

chulett · Post by **chulett** » Tue Aug 11, 2015 9:21 am

The subject of your Premium Membership was moved here and any further conversation on that topic needs to happen there. Let's leave this one for your External Source performance topic.

DSXchange

Performance bench marks on External source stage

Performance bench marks on External source stage

Re: Performance bench marks on External source stage

@Chullet