Server-to-Parallel question on aggregate-and-lookup

PhilHibbs · Post by **PhilHibbs** » Mon Jul 26, 2010 10:33 am

In a Server job, if I want to aggregate a data set and then compare the original data set against the aggregate, I could have a link from a Sequential File going through an Aggregator into a Hashed File, and then have another link from another Sequential File stage that actually points to the same source file, and pull in a reference link from the hashed file to do the look-up.

In the Land of Parallel, what would be the canonical solution to this requirement? Very similar, one job reading the same file twice but with a Lookup Data Set instead of a Hashed File? Two jobs, one loading a Lookup Data Set much like the Hashed File creation part of the Server Job, and then a second job doing the Lookup? Some other solution involving one job that does it all in parallel by some magic?

kris007 · Post by **kris007** » Mon Jul 26, 2010 11:59 am

You can use a copy stage after your Sequential stage and define two output links from the Copy stage-- one for the lookup and one for the aggregation.

creatingfusion · Post by **creatingfusion** » Mon Jul 26, 2010 1:47 pm

Adding copy stage and getting two links out of that being appropriate here as mentioned by kris007 and also you need to replace hash file stage by a data set if you want to use the same data again as data dictionary, else you can directly pull up the link from the aggregator to the lookup stage.

Thanks
Abhijit.

PhilHibbs · Post by **PhilHibbs** » Tue Jul 27, 2010 2:52 am

creatingfusion wrote:Adding copy stage and getting two links out of that being appropriate here as mentioned by kris007 and also you need to replace hash file stage by a data set if you want to use the same data again as data dictionary, else you can directly pull up the link from the aggregator to the lookup stage.

Interesting. How does that work? It has to process the entire data set (or at least, the entire subset for any given lookup key) before it can start doing the lookups. Is that just part of the magic of Enterprise Edition, that it knows how to cache the data until the aggregation is done, which Server Jobs can't do?

priyadarshikunal · Post by **priyadarshikunal** » Tue Jul 27, 2010 3:19 am

PhilHibbs wrote: Interesting. How does that work? It has to process the entire data set (or at least, the entire subset for any given lookup key) before it can start doing the lookups. Is that just part of the magic of Enterprise Edition, that it knows how to cache the data until the aggregation is done, which Server Jobs can't do?

Yes, you are on right track. Lookup won't process data unless it has fetched all records in reference link.

ray.wurlod · Post by **ray.wurlod** » Tue Jul 27, 2010 3:53 am

Actually server jobs can do it, if you specify use of the read cache.

In parallel jobs it's very obvious what's going on if you look at the score. A Lookup stage generates a composite operator containing the two operators LUT_CreateOp (which loads the reference data set into memory and creates an index on the key), and LUT_ProcessOp (which actually performs the lookups).

DSXchange

Server-to-Parallel question on aggregate-and-lookup

Server-to-Parallel question on aggregate-and-lookup

Re: Server-to-Parallel question on aggregate-and-lookup

Re: Server-to-Parallel question on aggregate-and-lookup

Re: Server-to-Parallel question on aggregate-and-lookup

Re: Server-to-Parallel question on aggregate-and-lookup