DB2 Bulk load filename creation

denzilsyb · Post by **denzilsyb** » Thu Sep 09, 2004 3:24 am

Example..


JobName: Job1
Table to be loaded: CUSTOMER

SOURCE ------- TFM -------- DB2BULK

In the DB2BULK stage, the .sh, .clp and .dat files created are:

uload.sh
cmd.clp
inpdata.dat

Code: Select all


JobName: Job2
Table to be loaded: PARTNER

SOURCE ------- TFM -------- DB2BULK

In the DB2BULK stage, the .sh, .clp and .dat files created are:

uload.sh
cmd.clp
inpdata.dat

the problem guys is that should I run these in paralell, Job1 or Job2 will complain about the filenames being created are locked or will overwrite the existing filenames hsould I run them in series.

Is there a way I can parametise these names? With other bulk load stages the .dat etc are called TABLENAME.dat, which I believe is a good way of doing things.

ogmios · Post by **ogmios** » Thu Sep 09, 2004 4:30 am

I remember something about putting them in another directory: same names other directory.

I don't use the DB2 Bulk load anymore, I just have the file to be loaded created via a Sequential stage and then fire of my own DB2 load scripts. This all of the times makes for a big BIG boost in performance.

Ogmios.

denzilsyb · Post by **denzilsyb** » Thu Sep 09, 2004 4:37 am

Thanks ogmios

I was reading about performance related issues with the bulk stage. I am going to suggest then that we write out SEQ files and use a generic load script parsing the .dat and table names to the script and have the script load the data.

I wonder if it was a DB2 issue that they decided to name the files the same regardless of the tablename.

ogmios · Post by **ogmios** » Thu Sep 09, 2004 6:17 am

denzilsyb wrote: I wonder if it was a DB2 issue that they decided to name the files the same regardless of the tablename.

My guess... this part was developed on monday mornings and friday afternoons

You'll be amazed at how much faster your own load script will run.

I also don't use FTP-stages, pivot stage, and a few others I forget. I even only use hashfiles when I'm really really stuck... I let most of the work get done by the database.

I don't even use sequencers, I like my old BASIC scheduling jobs too much.

Ogmios

denzilsyb · Post by **denzilsyb** » Thu Sep 09, 2004 6:43 am

ogmios wrote:I like my old BASIC scheduling jobs too much.

the code is great, but if you develop the code and then leave the poor client with BASIC code that is complicated to the untrained eye, you arent doing them any favours.

Ive seen a pretty complicated sequencer a colleague of mine put together; and it worked. I guess its just a matter of changing how we die hards think about the sequencer.

You say that you let the DB do just about everything. Why do you think DS is performing so bad on your side (except for aggregator and db2 load)?

ogmios · Post by **ogmios** » Thu Sep 09, 2004 10:02 am

denzilsyb wrote: the code is great, but if you develop the code and then leave the poor client with BASIC code that is complicated to the untrained eye, you arent doing them any favours.

We will have to agree to disagree on that one. I'm absolutely for KISS. We've got one routine to run jobs, a couple of utility function to create/delete trigger files and that's it. The scheduling jobs just consist of a sequence of "run job A", "run job B", it doesn't get any simpler than that. You want to add a new job: copy and paste 5 lines of code and change the name of the job to execute.

denzilsyb wrote: Ive seen a pretty complicated sequencer a colleague of mine put together; and it worked. I guess its just a matter of changing how we die hards think about the sequencer.

I've seen sequencers that were also very complicated, and then one upgrade they just didn't work anymore as they should.

Some of the things we now do couldn't be done with sequencers anyway (or would make the sequencers enormously complex).

denzilsyb wrote: You say that you let the DB do just about everything. Why do you think DS is performing so bad on your side (except for aggregator and db2 load)?

Never said performance is bad. It isn't because we do most of our stuff on the database side. If e.g. you select 1.000.000 rows from the database and use a simple constraint in DataStage to select 50.000 rows, or you just select the 50.000 rows already in the database, I can assure you the last version is faster

Aggregating in DataStage or database... don't have to think 2 seconds on it.
The way I write my jobs I most of the times I don't even use lookups or hashfiles... I sandbox my dimensions with business key/surrogate id in the staging area and just "left outer join" away in the SQL input query... 90% of the time faster than using a hashfile/lookup. So you could say I use DataStage as a glorified data mover.

Besides the major databases Oracle/DB2/SQL-server have a lot more customers than Ascential so as side-effect they contain a lot less bugs/quirks (before I install them anyway

)

I did grow kind of used to DataStage and its quirks, but if tomorrow someone wants me to implement a datamart/datawarehouse in Informatica or DecisionBase I won't lose sleep over not using DataStage. DataStage is just a tool, the important thing some developers seem to forget is the overall design of your ETL.

My 2 cents
Ogmios

denzilsyb · Post by **denzilsyb** » Fri Sep 10, 2004 2:13 am

ogmios wrote:I've seen sequencers that were also very complicated, and then one upgrade they just didn't work anymore as they should. Some of the things we now do couldn't be done with sequencers anyway (or would make the sequencers enormously complex).

I'll keep my eyes open for this one - the sequener will have to be well documented, because if change is iminent, then you want to know what you are returning to when the client is upgrading.

ogmios wrote:If e.g. you select 1.000.000 rows from the database and use a simple constraint in DataStage to select 50.000 rows, or you just select the 50.000 rows already in the database, I can assure you the last version is faster
Aggregating in DataStage or database... don't have to think 2 seconds on it.

it all depends on the scenario I guess - how many times we need to aggregate, where the data resides and what we are trying to do. for example - we are taking daily input files and aggregating them (daily) within datastage using HASH stages because the current setup on the database doing exactly the same has serious performance issues. DS is winning hands down in this case. At the end of the month we take the aggregated values, pump them to the DB and the data is ready for the users. Same process using different methods with different results.

ogmios · Post by **ogmios** » Fri Sep 10, 2004 3:58 am

denzilsyb wrote: it all depends on the scenario I guess - how many times we need to aggregate, where the data resides and what we are trying to do. for example - we are taking daily input files and aggregating them (daily) within datastage using HASH stages because the current setup on the database doing exactly the same has serious performance issues. DS is winning hands down in this case. At the end of the month we take the aggregated values, pump them to the DB and the data is ready for the users. Same process using different methods with different results.

What I would probably do if I would write your job. Every day load the input file to a database table. Put a materialized view in place (in Oracle), or a automated summary table (in DB2) to do the aggregation. And look ma... no hands... aggregated values at the end of the month.

I agree the database job may be a bit slower if e.g. your database server is in Sydney and your DataStage server is in Holland because of the slowness of the data transportation.

I don't use hashfiles to contain data which need to be kept longer than 1 cycle. Had a few problems with that in the past.

Ogmios

denzilsyb · Post by **denzilsyb** » Fri Sep 10, 2004 4:17 am

ogmios wrote:What I would probably do if I would write your job. Every day load the input file to a database table. Put a materialized view in place (in Oracle), or a automated summary table (in DB2) to do the aggregation. And look ma... no hands... aggregated values at the end of the month

trust me - I would do this if we had the processing power on the server (which they do, but cannot utilise), so we needed to think outside the box.

ogmios wrote:I don't use hashfiles to contain data which need to be kept longer than 1 cycle. Had a few problems with that in the past.

I was a little worried about this as well. concerns over the HASH file limit on size, speed of HASH file generation and aggregation and security of the data was a concern. With size limitations we can avoid 32bit HASH files and use 64bit HASH files which nullifies the 2GB HASH size limit. The speed of HASH file generation is also not a problem - just this morning we hit 40'000 rows per second. The final issue - the security of the data required a bit of out-of-the-box thinking again, but we have managed to secure the data once it has been processed; a bit of a work around to compensate for the DB inabilities at this time, but working nonetheless.

DSXchange

DB2 Bulk load filename creation

DB2 Bulk load filename creation

Re: DB2 Bulk load filename creation