Dataset and schemas
Posted: Wed Jun 01, 2011 12:48 am
I was wondering whether I can somehow use a schema to define the input from a data set.
I was planning to use data sets to pass the data between my jobs, as the documentation suggests "using data sets wisely can be key to good performance in a set of linked jobs".
However I can't find in the doco how to actually do it.
Job 1 looks like this:
SeqFile --> ColumnImport -> Transformer --> Dataset
\
\--> SeqFileOut
ColumnImport and SeqFileOut are using the same schema file. RCP is enabled for all ouput links. This job works fine. I can't verify exaclty what is stored in a dataset, but the contents of the SeqFileOut has all the expected columns and values.
Job 2 should be able to read in the dataset created by Job2. Something like this:
Dataset --> <Some processing> --> DBTable
I'm not sure how I can retrieve the data from the Dataset created in Job1 using the same schema. Can I instruct dataset to use schema at all?
I suspect that it might not be, as I can't find any option to set in the Dataset stage.
Are there any alternative ways to read the dataset using a schema file?
I was planning to use data sets to pass the data between my jobs, as the documentation suggests "using data sets wisely can be key to good performance in a set of linked jobs".
However I can't find in the doco how to actually do it.
Job 1 looks like this:
SeqFile --> ColumnImport -> Transformer --> Dataset
\
\--> SeqFileOut
ColumnImport and SeqFileOut are using the same schema file. RCP is enabled for all ouput links. This job works fine. I can't verify exaclty what is stored in a dataset, but the contents of the SeqFileOut has all the expected columns and values.
Job 2 should be able to read in the dataset created by Job2. Something like this:
Dataset --> <Some processing> --> DBTable
I'm not sure how I can retrieve the data from the Dataset created in Job1 using the same schema. Can I instruct dataset to use schema at all?
I suspect that it might not be, as I can't find any option to set in the Dataset stage.
Are there any alternative ways to read the dataset using a schema file?