Reading DB2 sequencers during Bulk Load....

kaps · Post by **kaps** » Fri Mar 18, 2011 9:53 am

We have a job which does inserts to a table reading from a file. Job design is this...

SeqFile----LookupStage----Xformer----DB2APIStage

This job is failing because of the error...

Lukup,0: Could not map table file "/datastage/prd/Datasets/lookuptable.20110318.feldurb (size 1134819352 bytes)": Not enough space
Error finalizing / saving table /datastage/tmp/dynLUT3199226b80c2818

We have enough space and memory. We have faced the same issue on similar jobs which I changed the Lookup stage with Join stage and
they ran fine. But I want to use Bulk Load stage for this for 2 reasons.

1. Bulk load will be faster.
2. If I use too many jobs with Join stgae then it will hurt other running jobs for memory.

I don't want to do sparse lookup as the input will be around 2 million.

Lookup will have around 17 million records. This is for restartability.

We use DB2 sequencers to genrate our surrogate keys and we use the nextval command in our insert statement. but If I use Bulk Load I can't do that. How do I handle this situation ?

I can query the seq max value and then insert that in transformer + @outrownum but now I have to reset the DB2 sequencer after the job is done. so I waw wonderinbg if there is anyother way...

Appreciate any input...

Thanks

chulett · Post by **chulett** » Fri Mar 18, 2011 9:59 am

Do an exact search for "Could not map table file", it's been discussed here quite a bit. Your lookup is "too big" and I'll wager you're on a 32bit system, which is where I've seen this message in the past.

jwiles · Post by **jwiles** » Fri Mar 18, 2011 10:16 am

Why do you think that using the Join stage will hurt your memory usage?

The error is telling you that the Lookup stage is running out of space trying to handle a reference table that is 1GB in size. That Lookup is attempting to use much more memory than the equivalent join stage would.

When you're running the job, how much free memory is available on the system and how much space is available in the mount points in which /datastage/prd/Datasets and /datastage/tmp are located? One option is that you run the lookup partitioned (reference and data partitioned on the lookup key or portion of it) rather than the reference table being entire partitioned. This is if you feel you must avoid Joins at all costs.

Bulk Load and Join/Lookup don't have any real relationship.

Not sure on the DB sequence with DS 7x. The Stage interface (per the docs) doesn't support a DB sequence as the source, like it does in 8x.

Regards,

kaps · Post by **kaps** » Fri Mar 18, 2011 11:10 am

Craig/James - We are on 64 bit AIX(SMP) and have around 380 GB free space on /datastage/ node and around 8 GB of Memory. Lookup has 4 fields and 17 million records. So it's hard to believe it's out of space or memory.

Reason I am worried about using lookup/join in lot of the jobs is that they will tyr to load the lookup table to memory and it may cuase problem for other job which does need memory.

Is there a way we can trace the momory available when this job runs ?

chulett · Post by **chulett** » Fri Mar 18, 2011 12:52 pm

Just to be anal - are you running a 64bit version of DataStage as well?

jwiles · Post by **jwiles** » Fri Mar 18, 2011 1:31 pm

Joins will not load the tables into memory...Lookups will, although they will do memory mapping against files when possible in most cases.

Typical Unix/AIX system monitoring. vmstat, iostat, nmon, sar if available and so on will help monitor available memory, cpu usage and so on. What type of usage does the server typically see? Number of simultaneous jobs, etc.? Is it used for any other applications (non-DataStage)?

4 fields in your lookup table, but what are the size of the fields? That, and the hashing overhead may add up to be larger than you imagine. The error itself indicates at least a 1GB lookup table.

Is everything under /datastage/ on the same mount point, or do other mounts supply /datastage/tmp and/or /datastage/prd/datasets?

kaps · Post by **kaps** » Fri Mar 18, 2011 3:45 pm

Craig - How do I find out If I am running 32 bit/64 bit Datastage ?

James - Yes. Everything is on the same mount point. All the 4 fields together may come upto max of 50 chars...I am suspecting number of simultaneous jobs.

It's good to know that Join does not store in the memory. So it stores the data in the disk in ETL server ? Does it store in the resource disk or resource scratch disk mentioned in config file ? If it goes to the disk then there will be I/O always..Correct ? They why does the documentation says like it does I/O only if the data is big ?

A join does a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all highly optimized and sequential.

Thanks for all the valuable input...

ray.wurlod · Post by **ray.wurlod** » Fri Mar 18, 2011 3:49 pm

Join does not store data anywhere. It relies on the fact that both inputs are sorted to be able to process just one key value at a time.

jwiles · Post by **jwiles** » Fri Mar 18, 2011 4:13 pm

Looking back at the docs, that reference probably could use a rewrite to more accurately describe the difference between Joins and Lookups and how to choose between them.

As Ray indicates, a Join stage requires that data be sorted (and partitioned if running in parallel) correctly in order for the matching process to work correctly. Unfortunately, the document you quoted inferred that Join was performing the sorting which is incorrect. The Sort operator performs the sorting. Orchestrate will insert sort and partition operators (unless directed otherwise) where it decides it is necessary to meet the conditions of certain stages (sort/merge/aggregator and a few others).

Each instance of a sort operator uses an allotment of memory as a work area...20MB is the default If the data being sorted fits in 1/2 of that work area, no disk I/O is performed. Otherwise when 1/2 full, that half of the work area is written to disk and the sort continues processing incoming data. The other half of the memory is primarily used as a scratch area as data is shuffled around. When all data has been processed, then the files are read and merged together. The write/read is where the I/O comes in. The majority of sort products on most platforms work in a very similar fashion.

kaps · Post by **kaps** » Fri Mar 18, 2011 4:47 pm

Thanks for explaining how the sort works. You mentioned in your previous post

Joins will not load the tables into memory...Lookups will, although they will do memory mapping against files when possible in most cases

Can you please elobarate on memory mapping against files ?

If the join stage does not store the data anywhere meaning does it act like a sparse lookup ? Every input record issues a select statement to the DB server ? If so then whats the difference between EE stage and this other than the data does not need to be sorted...

I appreciate the patience shown in answering the questions...

jwiles · Post by **jwiles** » Fri Mar 18, 2011 5:32 pm

I think we are beginning to push your boundaries of understanding how DataStage EE works, and this forum is not the appropriate medium for in-depth education on DataStage. There are other avenues for that, both here on DSXchange.com and training available from IBM and other parties, as well as coworkers.

The join operator does not interact directly with any physical data source, be it a database, a flat file, a parallel dataset or so on. It simply matches two or more streams of input data (records) together on a set of defined columns of data, and writes the results of the matches to a single output stream. Where the input data comes from the Join stage doesn't know and doesn't care, so long as it's sorted and partitioned correctly.

For information on memory mapping, enter "unix file memory mapping" in google and read through the returned sites.

Regards,

chulett · Post by **chulett** » Fri Mar 18, 2011 5:35 pm

kaps wrote:Craig - How do I find out If I am running 32 bit/64 bit Datastage ?

Never mind, seems to be a moot point. I went back and reread my old post where I was seeing the same error message, there you can see a message from IBM. In my case it was because we had a 32bit O/S in production. Worked fine in our 64bit dev environment.

kaps · Post by **kaps** » Mon Mar 21, 2011 9:10 am

James - Thanks for the reply. Your reply brought up more questions to me about how stages work internally in DataStage ? IBM training did not help as they taught only the basics and we are new to EE so co-workers can't help.

Where can I get more details about training in dsxhange ?

jwiles · Post by **jwiles** » Mon Mar 21, 2011 9:59 am

On DSXchange there is the Learning Center, which I think has a few videos, including one from Ray on transitioning from Server to Parallel. You need to have a premium membership in order to receive the full videos IIRC.

I expect that you've probably taken the DataStage Essentials course from IBM. There is also an Advanced course available.

A good reference for you would be the IBM Redbook "InfoSphere DataStage Parallel Framework Standard Practices" which you can download as a PDF from IBM.com. Also the redbook "IBM InfoSphere DataStage Data Flow and Job Design". Both are written from the perspective of Information Server 8x, but the job design methodology is applicable to 7x as well.

Regards,