Dataset retrieval

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
jamshid_kunhalu
Participant
Posts: 5
Joined: Wed Aug 31, 2005 9:19 pm
Location: Mumbai
Contact:

Dataset retrieval

Post by jamshid_kunhalu »

hi all

this is regarding the dataset stage. i am not able to retrieve the source data file from the server to datset . its showing orchestrate error.. but the same source file is retreived by sequential file.

Also when i run a seperate job taking SEQ ----> Dataset. the job runs successfully. i saved file in dataset as target.ds.

i have taken target.ds as source for dataset in earlier job it was working fine...what i want to know is only .ds extension file can be loaded in dataset..if not what's the alternative to use dataset as source stage instead of sequential file..


jamshid
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Hello jamshid,

datasets in PX can be called by any filename and can be located anywhere on the system. They only contain schema and other information pointing to that actual data files and thus are small. I don't quite understand your problem or question, especially where you are getting an orchestrate error - what is the error?

You cannot take a sequential file, rename it as a .ds file and then read it using the PX dataset file type; but I am not sure if that is your question.
RAJEEV KATTA
Participant
Posts: 103
Joined: Wed Jul 06, 2005 12:29 am

Re: Dataset retrieval

Post by RAJEEV KATTA »

Hi Jamshid,
Can you tell us in more precise way like what exactly are you trying to do.

Cheers,
Rajeev.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

A Data Set contains data in internal (binary) format. A persistent Data Set (one that is on disk) must have been created with a Data Set stage - there is no other way. There is one or more data files on each processing node; the control file (the one whose name ends in ".ds") describes the location and number of these data files (each is max 2GB). The control file for a virtual Data Set (which is in memory) has a name ending in ".v"; you can see the use of these by inspecting generated osh script.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
jamshid_kunhalu
Participant
Posts: 5
Joined: Wed Aug 31, 2005 9:19 pm
Location: Mumbai
Contact:

Post by jamshid_kunhalu »

hi

i am trying to load a file from my unix server(/auto/user/jstage/country.dat) to dataset stage. which is my source in my job..when i am trying to directly use the dataset my mentioning above path in the stage.....its showing the orchestrate error while running the job...its saying that path mentioned is missing from orchestate framework....

but same is retreivable using a sequential file.....and the job run successfully...but when i am using the dataset instead of sequential file as my source stage ..its showing above problem....



thanx

jamshid
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Jamshid,

you cannot read a flat file using the dataset stage. Read it using the sequential file stage and write it to a dataset stage.
jamshid_kunhalu
Participant
Posts: 5
Joined: Wed Aug 31, 2005 9:19 pm
Location: Mumbai
Contact:

Post by jamshid_kunhalu »

ArndW wrote:Jamshid,

you cannot read a flat file using the dataset stage. Read it using the sequential file stage and write it to a dataset stage.

hi Arnd

So exactly when's the scenario that we need to use dataset stages..when exactly it helps in incresing the performance....

and if i am sticking to sequential stage itself for my job..it will affect my performance??....or i need to go for any other alternative?...



thanx
jamshid
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

If your source is a sequential file of variable length records then you will not experience any gains by first writing it to a dataset and then processing it. The PX speed performance comes from it's ability to do things in parallel - but a sequential file read cannot be processed in parallel (unless it is of fixed record length).

Datasets are used to store data and read data that would be done in a sequential file in Server jobs. These files can be read and written very quickly in Px. They can also be used directly as lookups. Think of datasets as parallel sequential files and try to use them where possible instead of sequential files.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Persistent (on disk) Data Sets are intended to be used for those occasions where one parallel job prepares and stages data for a subsequent parallel job to use.

If no staging is required, persistent Data Sets are not required; the data can be passed to and fro between virtual (in memory) Data Sets.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
gbusson
Participant
Posts: 98
Joined: Fri Oct 07, 2005 2:50 am
Location: France
Contact:

Post by gbusson »

you can improve performance if you set the option numbers of Readers per node in the Sequential file stage >1.
Post Reply