40 Lookups for each row

endyl13 · Post by **endyl13** » Sat Jul 19, 2003 2:32 am

Dear all,

I have a source table with 60 fields and 4.6million rows, 40 fields need to be lookup to one lookup table. If during lookup the values from the source table are not found then write to a reject table.
I put the lookup table into a hash file, and then I created a Routine to do a lookup to the hash file and also write the rejected values to a hash file which will be loaded to a table after the job finish.
So, all the 40 fields will use this routines, which means for each row will use this routine 40 times, read the same hash 40 times and (possibly) writing the reject value to a hash file 40 times.
In our AIX machine (only one processor), the performance is quite slow, only 60 rows/seconds. So the whole process will take around 21 hours.
I've also tried to write the source table to the hash first, but still not much improvement.
Is there a way to improve the performance? Maybe some kind of loading all the hash to the memory somehow?
I cannot change the structure of the source data structure because it's coming from the source application which is out of my control.

Thanks in advance...

Regards

ND

kcbland · Post by **kcbland** » Sat Jul 19, 2003 6:58 am

Why do your rejects have to go to a hash file? Why not a sequential file, where there's no hashing overhead and DataStage can simply append a row to a sequential file? Did I misunderstand you but if a column fails the validation lookup/routine it will go to the exception file, so if a row has 40 failed lookups you will get 40 exception rows? Why are you using a routine lookup to the hash file for all 40 columns? How did you write this routine? Is the routine opening (and closing) the hash file each time the routine is called or did you put the hash file handle into COMMON?

You have (1) cpu so therefore you cannot instantiate your job to spread across many cpus. Is this a development server, where in production you would have more cpus? If so, you should set the job to be instantiate-able and then you can expect a linear improvement (8 cpus in production = 8X faster).

How big is the lookup table? Are there tens, hundreds, thousands, millions of rows? Do I ask about preload to memory or is the lookup too big? Why use a function, you have eliminated the point of a graphical metaphor to describe the logic. Why in fact even use DataStage, when you could write what you're doing in perl, c, or anything else? Why not put this into a table somewhere and join?

I'd first change your process to only generate a single exception write to file, which could contain all your exceptions. By using stage variables, you can build an exception output string that contains all the exception rows created from the one source row. By concatenating to each exception row derived, when you write to a sequential text file the one output row, it really is a tested series of rows and you get the same effect I think you get now.

Then, I'd switch your lookups to DataStage hash lookups using the DataStage graphical design metaphor and choose preload to memory. I'd do 40 joins in the designer and live with the ugly job. If you're using a routine lookup I don't know how you're using hash caching so a commonly looked up row is fetched each and every time.

Lastly, I would switch from reading from the source database table directly into transformation processing and instead have the source table write directly to a sequential file. Now you can see the performance of the database spooling to file. Switch your transform job to read that file, and then you can see the performance of the lookups processing and not have to wait on database spooling.

Kenneth Bland

endyl13 · Post by **endyl13** » Sun Jul 20, 2003 2:12 am

Kenneth,

Actually, previously I tried to write to one sequential file and using the graphic design. But the because then I have to define the same file 40 times in the design, but the output the seq file will only come from the last column that writes to the seq file, so instead of having 40 rows for each column, I will only have 1 row.
That's why I change the design to use the hash instead, because hash will have key fields that can make all rows unique. All the rejected values will then be written to a Reject Table in the database so that the users can view them from their reporting tool.

Right now the routine looks something like this:

COMMON /Lookup LUFile,RejFile
* Key1 & Key2 = The Key Fields from Source
Open "Field_LU" to LUFile
Open "RjctFile" to RejFile
Read Output from LUFile,Arg Else
Write Arg To RjctFile,Key1:Key2
Else Ans=-1
End
Output="Invalid Input"
End
Ans=Output

My idea of doing the lookup using routine is to simplify the job design, and so that all the other developer can use the same standard when doing the lookup. The lookup is a hash file with around 10000 rows, copy of a slowly changing dimension table.
In my job I also write all the rows from the source table to a seq file, then before writing to the target, I also write to another seq file.

As for our current machine, it is a developing machine, our production machine will have 4 processors, and all the processors will also have twice the speed of our dev machine. So I expect the job will run faster in the production machine. But since the production machine is still not around so I cannot tested

The other thing is that, the source table will only have 4.6 millions row at the initial load, after that the daily run will only have the changes with around 10000 rows. So the daily process will be very quick.

Kindly advise...[:)]
And thank you for the link that you've sent to me [:D]

Regards

ND

roy · Post by **roy** » Sun Jul 20, 2003 6:03 am

Hi,
if I understood correctly you have 1 lookup which 40 columns depend on.
in that case you could have 1 lookup to a preloded to memory hash file (in the hash stage in the output change to enabled) all 40 columns should be lookup derived.
the reg output constraint should be lookuplink.NOTFOUND = @FALSE
and the other output should check the reject box to automatically get the lines which did not pass the other link/s constraint.
this should be simple enough to implement and would work fine.

if you have 40 lookups to different files the the same logic could apply with a bit more complex constraint (you can strech out the lookup stage to look nicer with the mass of links on it in case you need 40 different lookups to the same file).

I any way I think the best optimized way is to use the graphic solution in your case.

I hope this helps,

Roy R.

kcbland · Post by **kcbland** » Sun Jul 20, 2003 7:39 am

It sounds like you are doing bread-and-butter foreign surrogate key substitution. You cannot possibly be using any caching because of your function call choice rather than hash stage. You're also in dangerous waters if you're setting this as a standard. Your capabilities of instantiate-ing this job are limited, so even having 4 cpus is the same as having 1 when your job design limits you to 1 unless you do very special things.

I repeat that you should use a sequential output file as your exceptions file. They are easily browse-able, you can email them, you can format them for anywhere. If you put it to a hash file, then you still have to get the data out to a presentation format.

As for your function, I know you typed that up quickly as pseudo-code. I hope you are checking to see if the COMMON file handles are initialized and if not THEN you open them and set the variables. If your logic posted does not include this (if you put the code in a post Ray Wurlod is the wizard and he can correct any mistakes) then you are opening and closing the hash file every single call.

I'm not sure I understand your description of what the job is supposed to do (40 output files?) Can someone help me understand, because I think you're doing something you probably don't have to do. I've NEVER resorted to using function based lookups, and I like to consider myself pretty knowledgable about DataStage and its capabilities. I think I've seen just about every job type, and if you have to output 40 files (or N files for that matter), then it sounds more like a pivoting exercise.

Can you please describe the source data (paste a row), define what a column has to do in the lookup (paste a lookup row), and describe the output row (past a row). I think we then can help best if you tell us your DataStage version and Unix or NT.

Thanks,

Kenneth Bland