Restartable ETL Jobs

chulett · Post by **chulett** » Mon Mar 28, 2011 12:46 pm

Restartability of an ETL job can also mean it picks up where it left off. And it's a workflow in Informatica, not a workload. The other difference is a workflow is required to run a single mapping, a sequence job isn't.

olgc · Post by **olgc** » Mon Mar 28, 2011 1:35 pm

Yes, it's workflow in Informatica, thanks for correcting.

Does "it picks up where it left off" mean the same as "it restarts from the failed point"? It's only a part of control job restartability, another part is "it restarts from the designated point". This one is harder to implement than the failed point. Both are only applied to control job, but not E.T.L. job.

Thanks,

chulett · Post by **chulett** » Mon Mar 28, 2011 3:02 pm

My "pick up where it left off" comment was specifically directed to ETL jobs, not at the job control level. It may not be typical but it can certainly be done.

olgc · Post by **olgc** » Tue Mar 29, 2011 6:44 am

chulett wrote:My "pick up where it left off" comment was specifically directed to ETL jobs, not at the job control level. It may not be typical but it can certainly be done. ...

That's interesting, very interesting. Let's look at an example for me to understand how you implement '"pick up where it left off" ETL jobs. If I understand right: it's about concrete ETL job. If a loading job is failed at the loading 123rd records and the transaction size is 50, can you show me how you "pick up" which record and continue the job, and finish loading with the rest of records. Let's say the entire load conatins 100,134 records.

Thanks,

chulett · Post by **chulett** » Tue Mar 29, 2011 6:58 am

High level... first you need a static source. After that it is a matter of marking your progress in the job, typically at each commit point, so you know the last successful one. That 'marker' row count gets set to zero at the end of a successful run. Each time the job runs, the marker is passed in as a parameter and that number of rows are read but constrained / filtered from passing to the output.

Multi-node PX jobs severely complicate this, as you could imagine.

olgc · Post by **olgc** » Tue Mar 29, 2011 9:15 am

Okay, that sounds complicated. Absolutely, designing restartable ETL job is a very sophisticated and difficult issue. It's worth an entire chapter of a book to address it, if not a book dedicated to it. Here is an article on it: www.uiis.net/etl/index.php. Any comment and feedback is appreciated.

Thanks,

ray.wurlod · Post by **ray.wurlod** » Wed Mar 30, 2011 2:03 pm

I disagree with the assertion about "most" important. I believe that prevention is better than cure.

vmcburney · Post by **vmcburney** » Wed Mar 30, 2011 5:58 pm

You can get restartability in a DataStage job against a dynamic, not static, source if you combine DataStage with InfoSphere CDC. The CDC bookmark functions let you compare a source table to a target table to keep them in synch and DataStage can be the engine for transforming and writing the data. This takes care of the complications of the DataStage parallel engine. This boosts CDC as CDC can be slow in synching a table initially or for a large volume so it makes CDC more scalable, it boosts DataStage by providing the restart and delta capabilities.

olgc · Post by **olgc** » Thu Mar 31, 2011 7:01 am

Very good point, vmcburney, I like this, I'll add it to the article for an approach of restartable ETL job, many thanks. But CDC is only used to handle slow change dimension table. If it's used for other tables, such as fact tables, the performance could be unbearable, unless your fact table is small. And for others, maybe CDC is too pricey.
Please check www.uiis.net/etl/index.php for Design Restartable ETL jobs

olgc · Post by **olgc** » Thu Mar 31, 2011 7:24 am

Thanks, ray.wurlod. Do we talk the same thing here? I have a gut feeling we don't.

ray.wurlod · Post by **ray.wurlod** » Thu Mar 31, 2011 3:31 pm

Probably not. I'm talking about eliminating the need for restartability within jobs.

ray.wurlod · Post by **ray.wurlod** » Fri Apr 01, 2011 2:00 pm

Avoid timeout errors by controlling the number (actually the workload) of jobs that can be running simultanously, having heed of other workload on the machine.

Avoid locking errors by good design.

I agree network down or database down look hard but they're easily handled before a job starts (a small job to "test the connection" before the main job starts). Losing power/network/database while the job is running is handled by usual high availability techniques such as uninterruptable power supplies, redundant components, and so on.

ray.wurlod · Post by **ray.wurlod** » Sat Apr 02, 2011 4:58 pm

Neither of those affected any of my Information Server installations. Even one in Tokyo (which has alternate servers in Switzerland and Australia) was able to keep going, even with some staff relocating to other cities farther west in Japan and working remotely.

And no doomsaying will affect my belief that prevention is better than cure.

Most of the sites in which I'm involved have had no unscheduled downtime in that period. We always set up communication channels with DBAs, system administrators, etc., so that we're advised about their plans for downtime. So we don't do any processing in those times.