How to read a text file of 5GB

stefanfrost1 · Post by **stefanfrost1** » Mon Aug 17, 2009 3:01 am

Rajee,

Try what ArndW is saying

How many columns does the file have and how long does it take to read the file when the only other stage in your job is a copy stage (or a peek stage)?

To see how long the actual file reader stage executes. If it is indeed 35 min then you know for sure that the sff-stage is what causes your performance problem.

ArndW: I've heard that in versions 7.5.3 and up it is indeed possible to read variable field length files with multiple nodes if your schema is correct. I haven't tried it myself and I am gladly corrected if you know the truth....

chulett · Post by **chulett** » Mon Aug 17, 2009 6:29 am

stefanfrost1 wrote:ArndW: I've heard that in versions 7.5.3 and up it is indeed possible to read variable field length files with multiple nodes if your schema is correct. I haven't tried it myself and I am gladly corrected if you know the truth....

Ray corrected me recently and said that our 'fixed-width only' statement was 'no longer true' but I didn't get a response to my follow-up query of 'no longer true since when?'. The answer could very well be since 7.5.3, however.

ArndW · Post by **ArndW** » Mon Aug 17, 2009 6:55 am

I tested it after the post a couple of weeks(running 8) the restriction is still there. But perhaps I didn't get it quite right:

(A) One can define multiple readers on a single node with variable lenght records. I played around today and see that one can increase read speed by specifying multiple readers (assuming the other stages are fast as well).

(B) One cannot define multiple nodes on a variable length file, i.e. one is restricted to a single node with n-readers. If one tries to change that, the following error message is displayed at runtime:

Error executing View Data command:
##E IIS-DSEE-TOIX-00172 14:43:37(007) <Sequential_File_0> The multinode option requires fixed length records.

chulett · Post by **chulett** » Mon Aug 17, 2009 7:03 am

Ah... perhaps that's the distinction being made. Multiple readers on a single node are allowed for variable length records but multiple nodes requires a fixed-width file.

ArndW · Post by **ArndW** » Mon Aug 17, 2009 7:22 am

I should have added that I tested it today with a 3Gb big file (project export .dsx file). If it can handle 3Gb then 5Gb should not be an issue.

chulett · Post by **chulett** » Mon Aug 17, 2009 7:31 am

The only "barrier to size" that I'm aware of is your operating system.

ArndW · Post by **ArndW** » Mon Aug 17, 2009 8:54 am

Yep, usually that magical 2Gb limit, but the OP opined that DS couldn't do this.

dxk9 · Post by **dxk9** » Wed Aug 19, 2009 2:34 am

I have used sequential file stage to read data more than 2GB. Not sure of the maximum size though.

stefanfrost1 · Post by **stefanfrost1** » Thu Aug 20, 2009 12:59 am

tested it after the post a couple of weeks(running the restriction is still there. But perhaps I didn't get it quite right:

(A) One can define multiple readers on a single node with variable lenght records. I played around today and see that one can increase read speed by specifying multiple readers (assuming the other stages are fast as well).

(B) One cannot define multiple nodes on a variable length file, i.e. one is restricted to a single node with n-readers. If one tries to change that, the following error message is displayed at runtime:
Code: Select all
Error executing View Data command: 
##E IIS-DSEE-TOIX-00172 14:43:37(007) <Sequential_File_0> The multinode option requires fixed length records.

I've been playing around with a variable length file ;-separated in 7.5.3 on AIX... I've found that I need to use Number of Readers Per Node and i set it to 10. According to monitor my partition is made on 10 nodes and I can preserve it throughout the flow. My (small) test showed a 6 times faster read using 10 nodes than using 1 node...

My file only had 22M rows at a total size of 3GB.

Furthermore! The size limitation that you , Rajee , is experiencing could be at your lookup if your not partitioning it properly since each node (at least in 7.5.x) has a OP limit of 2GB memory.....

zhzhs · Post by **zhzhs** » Fri Aug 21, 2009 12:43 am

i just to say
u can create one job just make sequential file to dataset file.