using reference and output as the same file

dnat · Post by **dnat** » Tue Feb 10, 2009 3:14 am

Hi,

We have a job in server where we are using the reference and the output as the same file.

For example.

We have the following records in the file

key1,rectype,update
1111,AA,1
1111,BB,2
1112,BB,3
1113,AA,3

In this file Record type(rectype) AA is an insert record and BB is an update record.

Once we process this file we update it in a db.

The record 1111,BB,2 is followed by a record type AA. So, the update of 1111,BB,2 will be overlayed on the record 1.
For the third record (1112,BB,3) it will check whether a record is already present in the DB since it is a BB type(update record). If already present then update other wise error out.
Fourth record (1113,AA,3) is a direct insert record.

So in the DB we can find the following values

key1,update

1111,2
1112,3(assuming that a record was already present in the DB)
1113,3.

We used hashed file as a reference as well as the output.
i.e Once we have the insert record in the output it will be present in the reference as well and will be considered for lookup to find out for the following update record.

Can we implement the same kind of change in parallel. is there any way for it?

ray.wurlod · Post by **ray.wurlod** » Tue Feb 10, 2009 5:06 am

This is called a "blocking operation". It is entirely forbidden in parallel jobs, as it disrupts pipeline parallelism.

kiran259 · Post by **kiran259** » Tue Feb 10, 2009 10:03 pm

How about splitting into two jobs and making it as a sequence as look up reads all the data into memory and then starts processing.I am not sure,but it is not recommended for huge amounts of data.

uegodawa · Post by **uegodawa** » Wed Feb 11, 2009 1:14 pm

Same business logic can be achieved by creating a server job also. Same hash file is used for Reference and Output.

Input ---- Transformaer ---- Hash File
|
|
Hash File

ray.wurlod · Post by **ray.wurlod** » Wed Feb 11, 2009 4:34 pm

This fact was mentioned in the original post.