Sequencial file name capture in a job

trammohan · Post by **trammohan** » Wed Apr 26, 2006 4:01 pm

Hi,

I am using sequential file stage to read data from *XYZ*.txt files. when I select file name column in the output file name is coming as *XYZ*.txt ..
My question is how to get the actual file name while reading the data from sequential file..

Thanks in advance....
trm

roy · Post by **roy** » Thu Apr 27, 2006 1:15 am

Hi,
In your output link change the Read Method to File Pattern,
this will let you use wildcards for the file name

IHTH,

trammohan · Post by **trammohan** » Thu Apr 27, 2006 6:58 am

Hi Roy,

Thanks...I want to include the file name in the column list while reading the data.......I want the actual file not the file name with wild char..

Thanks
trm

ray.wurlod · Post by **ray.wurlod** » Thu Apr 27, 2006 4:01 pm

As far as I am aware this is not possible, but would be happy to be proven wrong. Each row in the stream of rows that is being processed may have come from any of the files. Reading from the files that match a pattern is like using cat to make a single stream in a filter, except that you can get some parallelism happening.

thebird · Post by **thebird** » Sun Apr 30, 2006 1:46 am

Hi,

There is an Environment Variable - APT_IMPORT_PATTERN_USES_FILESET, which when set to TRUE, returns the exact file name from which the record is being read.

There was a post regarding this in the Developer net forum, which was answered by Danny Owen.

I have used this in one scenario, and it does work fine with the File Pattern option. But there was 1 issue - if there are no files matching the pattern mentioned, then the job aborts.

Hope this helps.

Regards,

The Bird.

trammohan · Post by **trammohan** » Sun Apr 30, 2006 9:03 am

Hi The Bird,

When I set APT_IMPORT_PATTERN_USES_FILESET this parameter to TRUE it is printing the output file name not the input file name ...

is there any other param to set for input filename?

trm

thebird · Post by **thebird** » Mon May 01, 2006 2:31 am

Hi trm,

There is no other parameters/variables that you have to set for this. If this variable is set and -

1. File pattern option set in your source sequential file stage to read the multiple source files

2. The File name column option chosen in the Source sequential file stage and the additional column (for the Source File Name) defined in the Columns tab

you should be able to see the corresponding source file name from which the record is read, when you do a View Data on the source stage. And this column, you should be able to carry forward to the downstream stages.

Hope this solves your problem.

Regards,

The Bird.

trammohan · Post by **trammohan** » Mon May 01, 2006 8:54 am

Hi Brid,

I have 2 input files ( trm1.txt and trm2.txt ). It is picking up the trm2.txt file name and putting in the file_name column even for trm1.txt records...
trm

anton · Post by **anton** » Thu Dec 28, 2006 4:09 pm

this is precisely my experience as well - APT_IMPORT_PATTERN_USES_FILESET causes each node from apt config to pick up a file name and use that for all the files it processes.

so in my case i have two nodes and 200 files, as a result (if i have APT_IMPORT_PATTERN_USES_FILESET set to true for the job) i get a file name column populated by the sequential stage, but there are only two unique values in it instead of 200).

file1.dat,data1,data2
file2.dat,data3,date4
file1.dat,data5,data6 <-- this actually came from file3.dat
file2.dat,data7,date8 <-- this actually came from file4.dat
...

alternatively, if in a naive assumption that things would work in a "common sense" way (without setting any variables), i would specify the file name column in sequential stage, and specify a file pattern in a read method, and feed it the wildcard corresponding to my files, every single row would have my wildcard, not the actual expanded file name.

*.dat,data1,data2 <-- this actually came from file1.dat
*.dat,data3,date4 <-- this actually came from file2.dat
*.dat,data5,data6 <-- this actually came from file3.dat
*.dat,data7,date8 <-- this actually came from file4.dat

therefore file name column option in sequential file stage is pretty much useless and misleading, as well as APT_IMPORT_PATTERN_USES_FILESET variable.

so, the question remains - is there a simple (config-time) option to preserve the file name from the pattern-based files read by the sequential file name stage?

thank you.

trammohan wrote:Hi Brid,

I have 2 input files ( trm1.txt and trm2.txt ). It is picking up the trm2.txt file name and putting in the file_name column even for trm1.txt records...
trm

ray.wurlod · Post by **ray.wurlod** » Thu Dec 28, 2006 4:53 pm

The Sequential File stage can generate two additional columns, one containing the file name of the file currently being read, the other containing the line number within that file of the record currently being read.

But, as noted, it may be wise to set APT_IMPORT_PATTERN_USES_FILESET to False. Or at the very least to experiment. That reported behaviour suggests a small bug.

anton · Post by **anton** » Thu Dec 28, 2006 5:02 pm

ray.wurlod wrote:The Sequential File stage can generate two additional columns, one containing the file name of the file currently being read, the other containing the line number within that file of the record currently being read.

But, as noted, it may be wise to set APT_IMPORT_PATTERN_USES_FILESET to False. Or at the very least to experiment. That reported behaviour suggests a small bug.

thank you for your response, but i am afraid you did not read my post correctly or the post i was replying to.

let me try again.

given in sequential file stage:
- "file name column" is set under "options"
- file pattern is set to /dir/file*
- read method is set to "file pattern"

APT_IMPORT_PATTERN_USES_FILESET is not present or explicitly set to false:
- i get /dir/file* as the value of the file name column for all records in every file

APT_IMPORT_PATTERN_USES_FILESET is set to true
- i get just one unique file name as the value of the file name column for all records in every file. if i run under 2-node configuration, i get two unique file names, etc. so if i have 100 different files, only two file names will ever be used.

once again, both "file name column" and APT_IMPORT_PATTERN_USES_FILESET do not work in this situation.

ray.wurlod · Post by **ray.wurlod** » Thu Dec 28, 2006 8:21 pm

The "file name column" property must refer to a column (type VarChar probably) that is defined on the output link. Is this the case with your design?

The same is true for the file row number property, if you use that.

anton · Post by **anton** » Fri Dec 29, 2006 9:02 am

ray.wurlod wrote:The "file name column" property must refer to a column (type VarChar probably) that is defined on the output link. Is this the case with your design?

The same is true for the file row number property, if you use that.

yes, and, as i mentioned, it gets populated - just with the wrong data.

ray.wurlod · Post by **ray.wurlod** » Fri Dec 29, 2006 3:36 pm

If that's the case (I have not have a chance to check yet) you need to report the bug through your support provider. They will also demand a reproducible case, so have that ready so they can't stall you.

anton · Post by **anton** » Tue Jan 02, 2007 10:34 am

according to IBM this is fixed in patch 96576 for DS EE 7.5.1A; we are yet to try it in our environment.