Running multi-instance job for two file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

manishmaheshwari2608
Participant
Posts: 5
Joined: Fri Dec 30, 2011 2:51 am

Running multi-instance job for two file

Post by manishmaheshwari2608 »

Hi ,

I have a multiinstance job which run for multiple files which comes in various directories,
But how should I handle the scenario of running a job in multiinstance when two files comes in same directory.
I have to implement this scenario without loop, as when I will use loop then it will run one after another but my requirement is that I have to run the sequence Parallel for both the files.
For example in a source directory two file comes with name abc.txt and abc1.txt , then I have to execute my jobs in parallel to process both files.

Please suggest the best way .

Thanks
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

a)Are you passing filenames as parameters?Then no issues should be.

b) Do You want to implement some wait for file mechanism? Once the files are there in the directory , should the job process those in parallel?
pandeeswaran
zulfi123786
Premium Member
Premium Member
Posts: 730
Joined: Tue Nov 04, 2008 10:14 am
Location: Bangalore

Post by zulfi123786 »

The whole concept of multi instance is to be able to run the process parallely but with different sets of data. To accomplish the parallel run you need to parameterize the file name and run the job with different invocation id's.
Guess you don't have any dependency across the processing of different files.
- Zulfi
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

What about using a sequential file stage with a file pattern? It doesn't have to be a multi-instance job in that case.
Choose a job you love, and you will never have to work a day in your life. - Confucius
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

The files will be processed one by one in case of file pattern method.
But,he wants to process all the files in parallel.
pandeeswaran
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Good point, Pandeesh.

You can run two instances of the multi-instance job at once on different files in the same directory. Have you tried it and are you finding any problem?
Choose a job you love, and you will never have to work a day in your life. - Confucius
nagarjuna
Premium Member
Premium Member
Posts: 533
Joined: Fri Jun 27, 2008 9:11 pm
Location: Chicago

Post by nagarjuna »

pandeesh wrote:The files will be processed one by one in case of file pattern method.
But,he wants to process all the files in parallel.
I believe using a file pattern & reading more than 1 file process the data in parallel . If there are 5 files all those are read in parallel & pass to the down stream stages .With this method you may need to put additional logic to know which records belong to which file .
Nag
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

As per the documentation:
File pattern
Specifies a group of files to import. Specify file containing a list of files or a job parameter representing
the file. The file could also contain be any valid shell expression, in Bourne shell syntax, that generates a
list of file names.
So,if there are 3 files 1.txt,2.txt and 3.txt, then in file pattern method how it will get processed?whether each file gets processed separately or all the records get combined(e.g:cat 1.txt 2.txt 3.txt>final.txt) and(final.txt) processed in a single run?

Can anyone shed some light on this?

Thanks
pandeeswaran
nagarjuna
Premium Member
Premium Member
Posts: 533
Joined: Fri Jun 27, 2008 9:11 pm
Location: Chicago

Post by nagarjuna »

all the records in the 3 files are combined ,read in parallel, processed by downstream stage
Nag
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Using a file pattern won't concatenate all the files together before reading. As far as I know each matching file name will be opened for reading. I had assumed each file is read in parallel, but not really sure.
Choose a job you love, and you will never have to work a day in your life. - Confucius
nagarjuna
Premium Member
Premium Member
Posts: 533
Joined: Fri Jun 27, 2008 9:11 pm
Location: Chicago

Post by nagarjuna »

Eric ,
I didn't mean to say they would be combined before reading , All 3 files are read simultaneously and after that all the records from these 3 files are passed to later stages . You need to include additional logic if at all you want to findout which record belongs to which file
Nag
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

nagarjuna wrote:You need to include additional logic if at all you want to findout which record belongs to which file
The additional logic also has been discussed recently.
(Source file name property and $APT_IMPORT_PATTERN_USES_FILESET=TRUE).
pandeeswaran
jwiles
Premium Member
Premium Member
Posts: 1274
Joined: Sun Nov 14, 2004 8:50 pm
Contact:

Post by jwiles »

By default, a list of files whether generated by a file pattern, command output or hardcoded will be concatenated together and read into the Sequential File stage sequentially. In order to read the files in parallel with each other, add the APT_IMPORT_PATTERN_USES_FILESET environment variable mentioned by pandeesh. This variable is discussed in the product documentation.

Regards,
- james wiles


All generalizations are false, including this one - Mark Twain.
nagarjuna
Premium Member
Premium Member
Posts: 533
Joined: Fri Jun 27, 2008 9:11 pm
Location: Chicago

Post by nagarjuna »

Suppose you are trying to read all the files starting with the name datastage_file , you would be specifying datastage_file*. I think if you are not setting the variable APT_IMPORT_PATTERN_USES_FILESET & using the file pattern , It will check for the file with the name 'datastage_file*'.
You need to set that variable to make it work as expected .Please correct me if i am wrong .
Nag
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

James is correct. By default, without the env var set, it will expand any wildcards into the matching filenames and cat those files together (sequential). With the env var set, if there are multiple files, then it will process the files in parallel.

Either way, it will expand the wildcard and read the same files.
Choose a job you love, and you will never have to work a day in your life. - Confucius
Post Reply