Real-time file processing - Design question

sri1dhar · Post by **sri1dhar** » Wed Feb 14, 2007 4:15 pm

Hi,

We have a requirement where we need to process data files received from external vendors real-time as soon as they arrive. Till now we are batch loading these files at pre-scheduled times using AutoSys scheduling tool.

Couple of options we are considering are:

Option 1:
----------
Use Autosys file watcher to sense arrival of a new file and call DataStage job sequence. That means one instance of job sequence for each file. Issues we have with this option are monitoring & scalability.

Monitoring - With our current batch load framework, Autosys executes (thru dsjob -run) DataStage job sequence and waits unitl it is completed to get the status back so the job status can be moitored and production support team can be notified of failures. Real-time processing means AutoSys job can no loger wait for the DS job to finish as it needs to continue with other files. That would mean we need another mechanism for monitoring the jobs started by Autosys.

Scalability - What if we get 50 files at the same time. That would mean 50 instance of job seuquence and subsequent load jobs running at the same time.

Option 2:
----------
RTI. Is RTI suitable for this kind of processing?
Most of my knowledge about RTI is from manuals and from this forum. I understand there are limitiations on which jobs can be RTI-enabled, especially with job sequences and jobs where you have input,output links to passive stages. Moreover we didn't licence RTI but are willing to if I can truely convince management that RTI is the right solution.

Any feedback or new options are welcome.

Regards
Sri

eostic · Post by **eostic** » Wed Feb 14, 2007 7:45 pm

Hi Sri....

From what you've outlined so far, the multi-job instancing feature of DataStage was originally designed for your goals. Because it allows you to use the same meta data (a single job), concurrently with different sources/targets, you don't have to process all 50 files serially.

It may be problematic to maintain the 50 + instances of running jobs, but it's probably the most creative solution.

RTI is not going to help here, except maybe as an alternative way of kicking off the multiple instances (using a SOAP client that detects the arrival of files instead of Autosys)..... but for this type of application, Autosys is probably the far better solution.

One other possibility might be to use a message queue, such as MQ Series, and have some application (DataStage perhaps) start up when the file arrives, and then stuff the file into a message queue that a "single" DataStage job is listening to..... when the file appears in the queue, it will be pulled in and processed, and then the DS job will go back to waiting on the queue. Provided the file is fixed length and has CRLF's to separate the rows, you could write a DS job that will read in one MQ message and spit out "n" rows for further processing. Only limits here are what you are doing in the job.... with MQ jobs you have to careful about certain blocking stages, such as the aggegator. Anyway, this would allow you to use only "one" main job and another (or some other application program) to put the file into the queue. (you might also have to be concerned with max file size for the message queue. On some platforms and releases MQ is still limited to 100M messages).

Ernie

kcbland · Post by **kcbland** » Wed Feb 14, 2007 9:28 pm

No matter what, there's a latency involved. Between the files arriving, detection, process startup, commit points, etc, there's a lot of "play" in the processing stream.

Depending on volume, you may find that you get a faster start-to-finish by micro-batch processing because of volume based processing, larger commit points, bulk loading, etc, than trickling data into the target tables. An analogy is 10K rows to process by either DML with commit after every row versus bulk load - the bulk type processing is probably going to win every time. Scalability also is an issue.

I'd simply have the DataStage jobstream change from being single-file centric to being multiple file centric. Run the jobstream continously all day long, but instead of processing a single file have it process all new files concatenated together. If the runtime is 10 minutes for the jobstream for 10 files concatenated, then the maximum a new file will wait is 10 minutes.

You could also consider having the startup logic limit itself to process only up to 10 files so that the process doesn't go over its tuning boundaries, or use a row-count based limiter.

You might see that 50 simultaneous executions of the jobstream take longer than 1 large execution of 50 files concatenated together because of resource contention and thread management and log management and everything else. You're probably getting better throughput if you figure out where the "sweet spot" configuration lies.

Maybe 10 executions of 5 files each is your optimal performanc given things like database resources, CPU utilization, hashed file sizes, networking, temp and rollback space, etc. Then you simply have your jobstream pickup the most latest number of files that keep you within optimal processing tolerances, and get the leftovers on the next processing cycle.

sri1dhar · Post by **sri1dhar** » Thu Feb 15, 2007 8:29 am

Ernie, Kenneth,

Thanks for the response. One thing I didn't make clear is that the record format of each of these files could be different which rules out the concatenation otpion. We already have 17 types of files and expecting to add more. Based on the file name, the job sequence determines which load job to call.

Regards
Sri