Running multi-instance job for two file

manishmaheshwari2608 · Fri Jan 06, 2012 12:28 pm

Hi ,

I have a multiinstance job which run for multiple files which comes in various directories,
But how should I handle the scenario of running a job in multiinstance when two files comes in same directory.
I have to implement this scenario without loop, as when I will use loop then it will run one after another but my requirement is that I have to run the sequence Parallel for both the files.
For example in a source directory two file comes with name abc.txt and abc1.txt , then I have to execute my jobs in parallel to process both files.

Please suggest the best way .

Thanks

pandeesh · Post by **pandeesh** » Fri Jan 06, 2012 12:55 pm

a)Are you passing filenames as parameters?Then no issues should be.

b) Do You want to implement some wait for file mechanism? Once the files are there in the directory , should the job process those in parallel?

zulfi123786 · Post by **zulfi123786** » Fri Jan 06, 2012 1:08 pm

The whole concept of multi instance is to be able to run the process parallely but with different sets of data. To accomplish the parallel run you need to parameterize the file name and run the job with different invocation id's.
Guess you don't have any dependency across the processing of different files.

qt_ky · Post by **qt_ky** » Fri Jan 06, 2012 3:35 pm

What about using a sequential file stage with a file pattern? It doesn't have to be a multi-instance job in that case.

pandeesh · Post by **pandeesh** » Fri Jan 06, 2012 11:17 pm

The files will be processed one by one in case of file pattern method.
But,he wants to process all the files in parallel.

qt_ky · Post by **qt_ky** » Sat Jan 07, 2012 9:16 am

Good point, Pandeesh.

You can run two instances of the multi-instance job at once on different files in the same directory. Have you tried it and are you finding any problem?

nagarjuna · Post by **nagarjuna** » Sat Jan 07, 2012 12:48 pm

pandeesh wrote:The files will be processed one by one in case of file pattern method.
But,he wants to process all the files in parallel.

I believe using a file pattern & reading more than 1 file process the data in parallel . If there are 5 files all those are read in parallel & pass to the down stream stages .With this method you may need to put additional logic to know which records belong to which file .

pandeesh · Post by **pandeesh** » Sat Jan 07, 2012 1:06 pm

As per the documentation:

File pattern
Specifies a group of files to import. Specify file containing a list of files or a job parameter representing
the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a
list of file names.

So,if there are 3 files 1.txt,2.txt and 3.txt, then in file pattern method how it will get processed?whether each file gets processed separately or all the records get combined(e.g:cat 1.txt 2.txt 3.txt>final.txt) and(final.txt) processed in a single run?

Can anyone shed some light on this?

Thanks

nagarjuna · Post by **nagarjuna** » Sat Jan 07, 2012 1:20 pm

all the records in the 3 files are combined ,read in parallel, processed by downstream stage

qt_ky · Post by **qt_ky** » Sat Jan 07, 2012 1:54 pm

Using a file pattern won't concatenate all the files together before reading. As far as I know each matching file name will be opened for reading. I had assumed each file is read in parallel, but not really sure.

nagarjuna · Post by **nagarjuna** » Sat Jan 07, 2012 2:24 pm

Eric ,
I didn't mean to say they would be combined before reading , All 3 files are read simultaneously and after that all the records from these 3 files are passed to later stages . You need to include additional logic if at all you want to findout which record belongs to which file

pandeesh · Post by **pandeesh** » Sun Jan 08, 2012 6:48 am

nagarjuna wrote:You need to include additional logic if at all you want to findout which record belongs to which file

The additional logic also has been discussed recently.
(Source file name property and $APT_IMPORT_PATTERN_USES_FILESET=TRUE).

jwiles · Post by **jwiles** » Mon Jan 09, 2012 12:30 am

By default, a list of files whether generated by a file pattern, command output or hardcoded will be concatenated together and read into the Sequential File stage sequentially. In order to read the files in parallel with each other, add the APT_IMPORT_PATTERN_USES_FILESET environment variable mentioned by pandeesh. This variable is discussed in the product documentation.

Regards,

nagarjuna · Post by **nagarjuna** » Mon Jan 09, 2012 4:59 am

Suppose you are trying to read all the files starting with the name datastage_file , you would be specifying datastage_file*. I think if you are not setting the variable APT_IMPORT_PATTERN_USES_FILESET & using the file pattern , It will check for the file with the name 'datastage_file*'.
You need to set that variable to make it work as expected .Please correct me if i am wrong .

qt_ky · Post by **qt_ky** » Sun Apr 27, 2014 9:11 am

James is correct. By default, without the env var set, it will expand any wildcards into the matching filenames and cat those files together (sequential). With the env var set, if there are multiple files, then it will process the files in parallel.

Either way, it will expand the wildcard and read the same files.