Page 1 of 1

Flat file versus database for lookups

Posted: Tue Mar 04, 2008 7:30 am
by whenry6000
All,
I am doing a lookup where I have one table with a set of data (around 50,000 rows) and I need to get values out of a table that potentially has about 1,000,000 rows.

It was suggested that it is faster to write the 1,000,000 rows to a flat file and do my lookup from there, as Datastage (or any other ETL process) processes large amounts of data faster from a file than from a database.

The person suggested that the flat file would be read into memory for processing the lookup, and that the read itself was faster than reading from a database table as it could process the read of the flat file in parallel threads. Also, once it was in memory, the lookup would be faster.

Does this sound correct??

Posted: Tue Mar 04, 2008 7:48 am
by ArndW
If you limiting yourself to a lookup stage, then you have 2 phases - the first is getting the data from the source system into DataStage and the second is applying the lookup query to that result set. Phase 2 will be identical regardless of source (assuming you don't try a sparse lookup).

The sequential file read is most likely going to be faster. But most of the time the data is not going to be in a file but in a table - plus if you have a complex SQL query it will work much faster from a table.

Posted: Tue Mar 04, 2008 9:30 am
by whenry6000
Is there any benefit, if the data is already in a table, to sending it to a flat file for the purposes of the lookup?? I can't think of one. I can write to the flat file as I create my table (I need the table for other purposes other than just the lookup), but is there a point to doing so??

Also, this person was saying that a sorted file works faster. I had some concerns about having to update/insert on a sorted flat file evey day with the new records that will be added.

If the data already came in a flat file, then I would be all for it, but I don't see any point to creating a flat file just for the purposes of performin a lookup.
ArndW wrote:If you limiting yourself to a lookup stage, then you have 2 phases - the first is getting the data from the source system into DataStage and the second is applying the lookup query to that result set. Phase 2 will be identical regardless of source (assuming you don't try a sparse lookup).

The sequential file read is most likely going to be faster. But most of the time the data is not going to be in a file but in a table - plus if you have a complex SQL query it will work much faster from a table.

Posted: Tue Mar 04, 2008 10:16 am
by ArndW
There is no benefit in duplicating the data in a flat file. Also, if your table is partitioned you might actually get better throughput from the parallel database access than by using a single-threaded sequential file.

I don't think that sorting would make a difference here.

Luckily, all of these questions are easily answered by writing a test job with test data and doing empirical studies.

Posted: Tue Mar 04, 2008 12:04 pm
by whenry6000
ArndW wrote:There is no benefit in duplicating the data in a flat file. Also, if your table is partitioned you might actually get better throughput from the parallel database access than by using a single-threaded sequential file.

I don't think that sorting would make a difference here.

Luckily, all of these questions are easily answered by writing a test job with test data and doing empirical studies.
Yes, you are correct. I just figured someone had a quick answer. I wasunder the impression you were based on my experience, but the person in question won't take my word for it. I will have to do it and show him.

Thanks!

Posted: Tue Mar 04, 2008 1:47 pm
by jmarutz
We've been using the lookup file stage instead of doing lookups from our databases and we're seeing an increase in performance (this is not using a flat file for a lookup though). You can also partition the lookup fileset if the lookup is large to keep your memory usage down. For large lookups that are used often we've seen a large increase in performance.