Dataset retrieval

jamshid_kunhalu · Post by **jamshid_kunhalu** » Tue Oct 25, 2005 12:38 am

hi all

this is regarding the dataset stage. i am not able to retrieve the source data file from the server to datset . its showing orchestrate error.. but the same source file is retreived by sequential file.

Also when i run a seperate job taking SEQ ----> Dataset. the job runs successfully. i saved file in dataset as target.ds.

i have taken target.ds as source for dataset in earlier job it was working fine...what i want to know is only .ds extension file can be loaded in dataset..if not what's the alternative to use dataset as source stage instead of sequential file..

jamshid

ArndW · Post by **ArndW** » Tue Oct 25, 2005 12:46 am

Hello jamshid,

datasets in PX can be called by any filename and can be located anywhere on the system. They only contain schema and other information pointing to that actual data files and thus are small. I don't quite understand your problem or question, especially where you are getting an orchestrate error - what is the error?

You cannot take a sequential file, rename it as a .ds file and then read it using the PX dataset file type; but I am not sure if that is your question.

RAJEEV KATTA · Post by **RAJEEV KATTA** » Tue Oct 25, 2005 12:49 am

Hi Jamshid,
Can you tell us in more precise way like what exactly are you trying to do.

Cheers,
Rajeev.

ray.wurlod · Post by **ray.wurlod** » Tue Oct 25, 2005 1:15 am

A Data Set contains data in internal (binary) format. A persistent Data Set (one that is on disk) must have been created with a Data Set stage - there is no other way. There is one or more data files on each processing node; the control file (the one whose name ends in ".ds") describes the location and number of these data files (each is max 2GB). The control file for a virtual Data Set (which is in memory) has a name ending in ".v"; you can see the use of these by inspecting generated osh script.

jamshid_kunhalu · Post by **jamshid_kunhalu** » Tue Oct 25, 2005 1:34 am

hi

i am trying to load a file from my unix server(/auto/user/jstage/country.dat) to dataset stage. which is my source in my job..when i am trying to directly use the dataset my mentioning above path in the stage.....its showing the orchestrate error while running the job...its saying that path mentioned is missing from orchestate framework....

but same is retreivable using a sequential file.....and the job run successfully...but when i am using the dataset instead of sequential file as my source stage ..its showing above problem....

thanx

jamshid

ArndW · Post by **ArndW** » Tue Oct 25, 2005 1:37 am

Jamshid,

you cannot read a flat file using the dataset stage. Read it using the sequential file stage and write it to a dataset stage.

jamshid_kunhalu · Post by **jamshid_kunhalu** » Tue Oct 25, 2005 1:53 am

ArndW wrote:Jamshid,

you cannot read a flat file using the dataset stage. Read it using the sequential file stage and write it to a dataset stage.

hi Arnd

So exactly when's the scenario that we need to use dataset stages..when exactly it helps in incresing the performance....

and if i am sticking to sequential stage itself for my job..it will affect my performance??....or i need to go for any other alternative?...

thanx
jamshid

ArndW · Post by **ArndW** » Tue Oct 25, 2005 2:12 am

If your source is a sequential file of variable length records then you will not experience any gains by first writing it to a dataset and then processing it. The PX speed performance comes from it's ability to do things in parallel - but a sequential file read cannot be processed in parallel (unless it is of fixed record length).

Datasets are used to store data and read data that would be done in a sequential file in Server jobs. These files can be read and written very quickly in Px. They can also be used directly as lookups. Think of datasets as parallel sequential files and try to use them where possible instead of sequential files.

ray.wurlod · Post by **ray.wurlod** » Tue Oct 25, 2005 4:48 am

Persistent (on disk) Data Sets are intended to be used for those occasions where one parallel job prepares and stages data for a subsequent parallel job to use.

If no staging is required, persistent Data Sets are not required; the data can be passed to and fro between virtual (in memory) Data Sets.

gbusson · Post by **gbusson** » Tue Oct 25, 2005 8:45 am

you can improve performance if you set the option numbers of Readers per node in the Sequential file stage >1.

DSXchange

Dataset retrieval

Dataset retrieval

Re: Dataset retrieval