How to read a text file of 5GB

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

stefanfrost1
Premium Member
Premium Member
Posts: 99
Joined: Mon Sep 03, 2007 7:49 am
Location: Stockholm, Sweden

Post by stefanfrost1 »

Rajee,

Try what ArndW is saying
How many columns does the file have and how long does it take to read the file when the only other stage in your job is a copy stage (or a peek stage)?
To see how long the actual file reader stage executes. If it is indeed 35 min then you know for sure that the sff-stage is what causes your performance problem.

ArndW: I've heard that in versions 7.5.3 and up it is indeed possible to read variable field length files with multiple nodes if your schema is correct. I haven't tried it myself and I am gladly corrected if you know the truth....
-------------------------------------
http://it.toolbox.com/blogs/bi-aj
my blog on delivering business intelligence using agile principles
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

stefanfrost1 wrote:ArndW: I've heard that in versions 7.5.3 and up it is indeed possible to read variable field length files with multiple nodes if your schema is correct. I haven't tried it myself and I am gladly corrected if you know the truth....
Ray corrected me recently and said that our 'fixed-width only' statement was 'no longer true' but I didn't get a response to my follow-up query of 'no longer true since when?'. The answer could very well be since 7.5.3, however.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I tested it after the post a couple of weeks(running 8) the restriction is still there. But perhaps I didn't get it quite right:

(A) One can define multiple readers on a single node with variable lenght records. I played around today and see that one can increase read speed by specifying multiple readers (assuming the other stages are fast as well).

(B) One cannot define multiple nodes on a variable length file, i.e. one is restricted to a single node with n-readers. If one tries to change that, the following error message is displayed at runtime:
Error executing View Data command:
##E IIS-DSEE-TOIX-00172 14:43:37(007) <Sequential_File_0> The multinode option requires fixed length records.
Last edited by ArndW on Mon Aug 17, 2009 7:12 am, edited 1 time in total.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Ah... perhaps that's the distinction being made. Multiple readers on a single node are allowed for variable length records but multiple nodes requires a fixed-width file.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I should have added that I tested it today with a 3Gb big file (project export .dsx file). If it can handle 3Gb then 5Gb should not be an issue.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

The only "barrier to size" that I'm aware of is your operating system.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Yep, usually that magical 2Gb limit, but the OP opined that DS couldn't do this.
dxk9
Participant
Posts: 105
Joined: Wed Aug 19, 2009 12:46 am
Location: Chennai, Tamil Nadu

Post by dxk9 »

I have used sequential file stage to read data more than 2GB. Not sure of the maximum size though.
stefanfrost1
Premium Member
Premium Member
Posts: 99
Joined: Mon Sep 03, 2007 7:49 am
Location: Stockholm, Sweden

Post by stefanfrost1 »

tested it after the post a couple of weeks(running 8) the restriction is still there. But perhaps I didn't get it quite right:

(A) One can define multiple readers on a single node with variable lenght records. I played around today and see that one can increase read speed by specifying multiple readers (assuming the other stages are fast as well).

(B) One cannot define multiple nodes on a variable length file, i.e. one is restricted to a single node with n-readers. If one tries to change that, the following error message is displayed at runtime:

Code: Select all

Error executing View Data command: 
##E IIS-DSEE-TOIX-00172 14:43:37(007) <Sequential_File_0> The multinode option requires fixed length records.
I've been playing around with a variable length file ;-separated in 7.5.3 on AIX... I've found that I need to use Number of Readers Per Node and i set it to 10. According to monitor my partition is made on 10 nodes and I can preserve it throughout the flow. My (small) test showed a 6 times faster read using 10 nodes than using 1 node...

My file only had 22M rows at a total size of 3GB.

Furthermore! The size limitation that you , Rajee , is experiencing could be at your lookup if your not partitioning it properly since each node (at least in 7.5.x) has a OP limit of 2GB memory.....
-------------------------------------
http://it.toolbox.com/blogs/bi-aj
my blog on delivering business intelligence using agile principles
zhzhs
Participant
Posts: 13
Joined: Mon Nov 13, 2006 10:40 pm
Location: china

Post by zhzhs »

i just to say
u can create one job just make sequential file to dataset file.
I have a baby
Post Reply