Page 1 of 2

Running multi-instance job for two file

Posted: Fri Jan 06, 2012 12:28 pm
by manishmaheshwari2608
Hi ,

I have a multiinstance job which run for multiple files which comes in various directories,
But how should I handle the scenario of running a job in multiinstance when two files comes in same directory.
I have to implement this scenario without loop, as when I will use loop then it will run one after another but my requirement is that I have to run the sequence Parallel for both the files.
For example in a source directory two file comes with name abc.txt and abc1.txt , then I have to execute my jobs in parallel to process both files.

Please suggest the best way .

Thanks

Posted: Fri Jan 06, 2012 12:55 pm
by pandeesh
a)Are you passing filenames as parameters?Then no issues should be.

b) Do You want to implement some wait for file mechanism? Once the files are there in the directory , should the job process those in parallel?

Posted: Fri Jan 06, 2012 1:08 pm
by zulfi123786
The whole concept of multi instance is to be able to run the process parallely but with different sets of data. To accomplish the parallel run you need to parameterize the file name and run the job with different invocation id's.
Guess you don't have any dependency across the processing of different files.

Posted: Fri Jan 06, 2012 3:35 pm
by qt_ky
What about using a sequential file stage with a file pattern? It doesn't have to be a multi-instance job in that case.

Posted: Fri Jan 06, 2012 11:17 pm
by pandeesh
The files will be processed one by one in case of file pattern method.
But,he wants to process all the files in parallel.

Posted: Sat Jan 07, 2012 9:16 am
by qt_ky
Good point, Pandeesh.

You can run two instances of the multi-instance job at once on different files in the same directory. Have you tried it and are you finding any problem?

Posted: Sat Jan 07, 2012 12:48 pm
by nagarjuna
pandeesh wrote:The files will be processed one by one in case of file pattern method.
But,he wants to process all the files in parallel.
I believe using a file pattern & reading more than 1 file process the data in parallel . If there are 5 files all those are read in parallel & pass to the down stream stages .With this method you may need to put additional logic to know which records belong to which file .

Posted: Sat Jan 07, 2012 1:06 pm
by pandeesh
As per the documentation:
File pattern
Specifies a group of files to import. Specify file containing a list of files or a job parameter representing
the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a
list of file names.
So,if there are 3 files 1.txt,2.txt and 3.txt, then in file pattern method how it will get processed?whether each file gets processed separately or all the records get combined(e.g:cat 1.txt 2.txt 3.txt>final.txt) and(final.txt) processed in a single run?

Can anyone shed some light on this?

Thanks

Posted: Sat Jan 07, 2012 1:20 pm
by nagarjuna
all the records in the 3 files are combined ,read in parallel, processed by downstream stage

Posted: Sat Jan 07, 2012 1:54 pm
by qt_ky
Using a file pattern won't concatenate all the files together before reading. As far as I know each matching file name will be opened for reading. I had assumed each file is read in parallel, but not really sure.

Posted: Sat Jan 07, 2012 2:24 pm
by nagarjuna
Eric ,
I didn't mean to say they would be combined before reading , All 3 files are read simultaneously and after that all the records from these 3 files are passed to later stages . You need to include additional logic if at all you want to findout which record belongs to which file

Posted: Sun Jan 08, 2012 6:48 am
by pandeesh
nagarjuna wrote:You need to include additional logic if at all you want to findout which record belongs to which file
The additional logic also has been discussed recently.
(Source file name property and $APT_IMPORT_PATTERN_USES_FILESET=TRUE).

Posted: Mon Jan 09, 2012 12:30 am
by jwiles
By default, a list of files whether generated by a file pattern, command output or hardcoded will be concatenated together and read into the Sequential File stage sequentially. In order to read the files in parallel with each other, add the APT_IMPORT_PATTERN_USES_FILESET environment variable mentioned by pandeesh. This variable is discussed in the product documentation.

Regards,

Posted: Mon Jan 09, 2012 4:59 am
by nagarjuna
Suppose you are trying to read all the files starting with the name datastage_file , you would be specifying datastage_file*. I think if you are not setting the variable APT_IMPORT_PATTERN_USES_FILESET & using the file pattern , It will check for the file with the name 'datastage_file*'.
You need to set that variable to make it work as expected .Please correct me if i am wrong .

Posted: Sun Apr 27, 2014 9:11 am
by qt_ky
James is correct. By default, without the env var set, it will expand any wildcards into the matching filenames and cat those files together (sequential). With the env var set, if there are multiple files, then it will process the files in parallel.

Either way, it will expand the wildcard and read the same files.