Long running SQL server bulk load

JDionne · Post by **JDionne** » Fri Feb 27, 2004 8:53 am

I have a job that bulk loads to a SQL server. The flat file is 8 million + rows. If i do this through SQL server it takes 20 mins. In DS its taken 14 Hours!!!! Why is it taking soo long to do this?
Jim

kcbland · Post by **kcbland** » Fri Feb 27, 2004 8:58 am

Because the DS bulk loader stages have performance issues in that they re-verify the data content while spooling the load file. The stages are meant to make life simple for those not experienced with bulk loaders.

If you have the skill to insure a proper load file, then skip using the Stage and just write out a sequential file. Then, initiate the bulk load from a script and be done with it.

Your runtime comparisons are right in line with what one would expect. You're discovering that the tool, in its efforts to be all encompassing and do all things, sacrifices performance in tragic ways sometimes.

shawn_ramsey · Post by **shawn_ramsey** » Fri Feb 27, 2004 9:56 am

Ken "right in line" give me a break. Are you seriously suggesting that DataStage would actually only bulk load the data at 185 rows a second as apposed to the native bulk loader 6000+ rows a second? There is obviously something else going on here since we are loading some pretty wide tables at 2600 rows a second and some narrower ones near 9000 rows a second.

First of all there are two bulk load stages for SQL Server and you should use the SQL Bulk and not the BCP. I would also look at the parameters that you are running with and play around with the commit points.

kcbland · Post by **kcbland** » Fri Feb 27, 2004 10:22 am

Shawn, let's be quite clear to everyone out there that DataStage does not bulk load data. Its bulk loader stages either invoke the respective database bulk loader command to perform the load, or write the necessary control files to use with the command line bulk loader.

In either case, DataStage spools the output link data into a sequential file that will be used by the command line bulk loader invocation operation. It is in the act of spooling that data that DataStage logic inserts itself and puts tremendous drag on that flow. This is where data is re-typed, scaled, in some cases the dates are messed with (internalized), etc. This drag is so significant that Ascential Technical Support has for years recommended to people to actually spool their data to another sequential file, and limit the spooling to the bulk loader stage to just one row. This way, you get the control file generated, but the data is in another file. You then do an after-job edit operation to the control file to point to the full data file, rather than the 1-row datafile generated by the bulk loader stage.

I made no assumptions regarding which stage was being used. I simply was pointing out that as a developer gets more experienced using DataStage and their familiarity with bulk loading increases, you will become dissatisfied with the things DataStage does and you will, in my experience, end up just spooling the data yourself and writing your own bulk loader scripts and directly invoke the process so that you can handle all the erroring, logging, and reporting yourself.

From a performance standpoint, when someone says that using DataStage jobs generic design just like the salesman showed them are 10-20X slower than if they did some manual intervention at key points and just blasted the data themselves, I say "Right on!". There are times to step outside the tool and do what seems right, as opposed to always staying within the tool. Hope this clarifies my point.

ogmios · Post by **ogmios** » Fri Feb 27, 2004 12:47 pm

And this is called "programming around the DataStage limitations" instead of "programming with DataStage".

And the more you work with the DataStage the more you realize there's an awful lot of things to work around. After a few years you sometimes just wonder why you even bother programming it via DataStage.

We can do it probably cheaper, faster (in design and execution) and more reliable with a good Perl framework. It's just that Ascential is so good in locking in customers

The ranting for this evening is finished. :D

Ogmios

kcbland · Post by **kcbland** » Fri Feb 27, 2004 12:57 pm

ogmios wrote: After a few years you sometimes just wonder why you even bother programming it via DataStage.

Now I did not even come close to suggesting this. I think a perspective must be kept and that not everything has to be inside the tool. Your toolbelt should have as many tools in it and your judgement should be exercised when you decide which tool(s) are utilized. You should emphasize the strengths of every tool in your belt, and accommodate the weaknesses.

However, as I've said in the past...The worst programmers can make the best tool look like crap. The worst tool can look great in the hands of a skilled programmer/architect.

ogmios · Post by **ogmios** » Fri Feb 27, 2004 1:30 pm

ROFL :D, you've got to admit that in some areas DataStage itself is not of the best quality.

I once overheard someone describe DataStage as an "overgrown PacBase for SQL with featuritis of which not all features work as they should".

But now we're getting very off-topic

. E.g. About the bulk loaders, what prohibits Ascential to make the bulk-loader stages more as they should be? But they would rather implement some high tech new feature they can show off on their flyers.

Ogmios

kcbland · Post by **kcbland** » Fri Feb 27, 2004 2:40 pm

I shall reserve my comments for private communications. However, your term featuritis is what I call "shiny objects". They dazzle the eye and distract the beholder. Wow! Auto-column derivation wizard! Cool!