Print Filenames from Seq File using FilePattern

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Print Filenames from Seq File using FilePattern

Post by VCInDSX »

Hi,
I would appreciate your inputs on how to get the filenames that are selected by a sequential file stage in an output store (Peek stage for now)

Here is what I am trying to do in a test job. This is an RCP enabled job.

Code: Select all

SequentialFileStage ==> Peek
Sequential file stage is set to read using "File Pattern" for "Read Method" property.
The "File Name Column" property is set to be fileNameColumn.
The column definition has been updated to be VarChar(128).

The "File Pattern" property is a job parameter so that i can supply different folder names during execution.

When i run this job with an input value of D:/Temp/TestFiles/*.txt which has 5 ".txt" files, it shows a different count altogether in the log
The file names are not logged and instead i see

Code: Select all

D:/Temp/TestFiles/*.txt
D:/Temp/TestFiles/*.txt
D:/Temp/TestFiles/*.txt
....
....
....
10 times as the PEEK stage in my job is set to 10 rows by default.
I have left the remaining settings of the Seq file stage intact.

Is this how it would work or am i missing any additional setting/logic?

When i use this logic in a program to consume text files into a database, the data are loaded correctly. It is just that i am unable to print the file name portion.

Let me know if any additional details to help you help me.

Thanks in advance for your time
-V
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

OK, it works fine in 7.5.2 without RCP. So is it version 8 or is it RCP? Can you run a test (maybe a separate job) that does not use RCP? If that works, then RCP is the culprit: if it fails the same way it's a "feature" they seem to have introduced in the new version.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
JoshGeorge
Participant
Posts: 612
Joined: Thu May 03, 2007 4:59 am
Location: Melbourne

Post by JoshGeorge »

Set environment variable APT_IMPORT_PATTERN_USES_FILESET to TRUE and write your output to another Sequential file and see the result.
Joshy George
<a href="http://www.linkedin.com/in/joshygeorge1" ><img src="http://www.linkedin.com/img/webpromo/bt ... _80x15.gif" width="80" height="15" border="0"></a>
VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Post by VCInDSX »

Ray,
Thanks for the input. I tried this job without RCP and had the same outcome. I then tried the NON-RCP design in a 7.5.2 box and could not get the file names either. So something is wrong in the way i have set the stage.

The only changes i have done to the source seq file stage after dragging this from the palette on the canvas are
1. Add the file pattern property and the rest of them are left intact.
2. added a new column for the output link "fileColumnName"

Upon reviewing the director log, i found the following warnings in the 7.5.2 instance.

Sequential_File_0,0: Import consumed only 0bytes of the record's 9 bytes (no further warnings will be generated from this partition)
Sequential_File_0,0: Import warning at record 0.


JoshGeorge,
Thanks for your input as well. I removed the PEEK stage and put a File stage for the output. I added APT_IMPORT_PATTERN_USES_FILESET and set it to TRUE. This was done in my 8.0.1. version

However, when i try to execute the job, the following errors were reported .
main_program: For createFilesetFromPattern(), could not find any available nodes in node pool "".
Sequential_File_0: At least one filename or data source must be set in APT_FileImportOperator before use.
main_program: Could not check all operators because of previous error(s)


While looking up this error message APT_FileImportOperator using the Search feature, i stumbled upon a post (viewtopic.php?t=112970&highlight=APT_FileImportOperator) where Ray and others had helped another OP. Based on that, i setup the Path name as a parameter. When i execute the job, it still failed with the same warnings and error as stated above.

Interestingly, with the APT_IMPORT_PATTERN_USES_FILESET added to the job, i was able to use the "View Data" on the source seq file stage. However, the View Data option showed the same file name repeated 10 times as the default viewing count was 10. When i increased this to 24, i was able to see the remaining files, but the first 10 were same, second 10 were same and then it showed a few files that matched the pattern.

P.S. Even when i removed the suggested environment variable, the source seq file stage was able to do a "View Data" (some file names repeating several times) but only if the Path name is supplied via Job Parameter.

Are there any other settings that one should look for in the source input file stage?

Thanks again
-V
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

No surprise about 10 rows yielding 10 identical file names; you look at the first file first - unless it has fewer than 10 rows that's exactly what I'd expect to see. You could change the sampling and skip on View Data to see what I mean.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
VCInDSX
Premium Member
Premium Member
Posts: 223
Joined: Fri Apr 13, 2007 10:02 am
Location: US

Post by VCInDSX »

Thanks for the followup Ray. I did adjust the values for the "Skip" and "Period" and verified the outcome. Thanks for the insight and it makes sense.

Also, our server being in NLS mode, i see from the logged schema that the fileNameColumn is generated as ustring.
I enhanced my job to do the following

Code: Select all

SeqFile ==> Copy ==> PEEK
            ||
            ||
            \/ 
        Seq File
The copy stage copies to target columns of type varchar(128) UNICODE(Extended). With this setting, i see a warning as follows.
Copy_3: When checking operator: On output data set 0: When binding output schema variable "outRec": When binding output interface field "fileNameColumn" to field "fileNameColumn": Implicit conversion from source type "ustring" to result type "ustring[max=128]": Possible truncation of variable length string. [api\interface_rep.C:6177]
Is there any way to configure the internal ustring to accept a size?

In another instance of this job, i have added an ODBC stage to the copy stage to check if i can write the file name into the table. I had the following error.

ODBC_Enterprise_7,0: Failure during execution of operator logic. [api\operator_rep.C:376]
ODBC_Enterprise_7,0: Fatal Error: Not bounded length. [type\basic\string.C:167]
Copy_3,0: Internal Error: (shbuf): iomgr\iomgr.C: 1880
node_node1: Player 4 terminated unexpectedly. [processmgr\player.C:157]

Again, suspecting the data type in the copy and ODBC stage, I changed the datatype of this column in the DB Table to NVarchar(128) and also in the copy stage for this link (ODBC).

All is well now and i can view the file name in the DB Table.

Thanks a lot for your invaluable time and input.
-V
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Well done on completing the diagnosis! :D

Now please mark the thread as resolved.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply