Testing methodologies for DS jobs

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
raju_chvr
Premium Member
Premium Member
Posts: 165
Joined: Sat Sep 27, 2003 9:19 am
Location: USA

Testing methodologies for DS jobs

Post by raju_chvr »

I would like to know if any of you have ever come accross or come up with testing methodologies for DS jobs. I am sure Kim, Kenneth, Ray and Teej may have their own ways of testing DS before exporting/moving them to Testing or Production boxes.

Care to share them here..


Thanks
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Re: Testing methodologies for DS jobs

Post by Teej »

raju_chvr wrote:I would like to know if any of you have ever come accross or come up with testing methodologies for DS jobs. I am sure Kim, Kenneth, Ray and Teej may have their own ways of testing DS before exporting/moving them to Testing or Production boxes.
Hmmm....

Open Director. Run Job. Check results.

The problem here -- we have PX for 6.0, and that does not have its own debug mode available on the GUI interface. 7.0 obviously would be different for us -- the ability to step through the rows and stages would be very helpful for us.

I use rejects for the Transform stages during testing -- if anything is rejected, it usually means that I forgot to put an "If IsNull()" somewhere.

We were suggested to do an incremential build by an Ascential consultant, where we add a stage, add the requirement, then test intensively, then add another stage... Honestly, with the vast ease of building jobs in PX, I find it just a massive waste of time.

I do wish that some things are 'noticed' during compile time. For example -- lets say you are using a join stage, and you are hashing/sorting the key fields. Wouldn't it make sense that they would notice whether or not you are hashing the right fields? That tripped me up one time while testing something. I was hashing on "PARENT_WHOLESALER_NAME" and merging on "PARENT_WHOLESALER_NUMBER". No results found on an 4 node run, naturally.

We found over time that the error messages are about 90% our fault, 9.999% data fault and 0.001% DataStage's fault. Right now, we have one of those 0.001% issue. :) This is the kind of error you DON'T want to see at all. But this kind of guideline help us stay focused when we are debugging our programs -- the GUI is so easy, yet so yucky in some areas, that we can easily miss some things. For example -- why is the Transform Stage always split down the middle perfectly, when the input fields are narrow, and the output fields are wide? Spread it out! Why don't the transform stage word-wrap the transformations? Therefore, it's easy to overlook an error like:

Code: Select all

If IsNull(Input.W_KEY_FIELD) Then 
    SetNull() 
Else 
    Input.A_KEY_FIELD
If it's wrapped, this would be easier to spot. I only noticed it because when I check the links for the input stages, it was pointing where it wasn't supposed to be pointing to.

Color coding the links would be so NICE! Instead of using ---->, have a RED bold ===>.

But I wouldn't give this up and go back to C++ only. Ever since I worked on ETL, C++ became fun for me once again, but only in good small doses. :) Plus it's easier to debug on C++ than DataStage PX. :)

So... testing. Really, it all varies -- I trust the tools to do its job, and I compare it with a direct SQL query on a small dataset.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I don't know how many times I have said this to clients.

An ETL tool can never replace proper SDLC (that would be Software Design Life Cycle) methodology. Part of that methodology should be comprehensive regression testing. Just because you have a ETL tool the rules of application development don't change.

Whether it's a program, a job, a script, or manual effort, the test cases as part of a test plan have to be constructed.

The only issue you have in testing to figure out is how to break down the job execution so that test cases can be tested in a manner that makes it easy for the QA team. For example, an all-in-one job that reads from a source table and writes to a target table complicates the mechanisms by which a tester can repeat the same test multiple times. You have this problem no matter what. The specific issue to ETL for a test team will be the execution of individual jobs, and the execution of the job stream.

If the job stream is a scheduler operated activity, then you have to be able to replicate the production schedule as part of the final round of user acceptance testing. You will also have to figure out how to test specific jobs as part of unit, then system testing.

There are no easy answers. Depending on how you architected your framework, different types of scenarios apply. I prefer to architect an ETL framework that includes how to test into the design framework.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

I tried to figure out how to show a project plan, but this will have to do. I have yet to speak on this forum about managing a release lifecycle in a data warehouse environment. I shall be posing those questions later.

In the meantime, I wanted to show how I manage a release lifecycle. Since I firmly believe a warehouse should be managed the same as any application development project, it will undergo period revision and enhancement. The tasks of expanding functionality, adding new tables, and fixing existing data issues need to be handled in a controlled, methodical, planned manner. Releases tend to be 6-8 weeks in length, with a few minor bug fixes in between releases. This allows you to plan for having appropriate test data created in a test database. It also allows you to plan a migration from a development environment to a system/user acceptance test environment.

The steps in a release lifecycle that I manage usually are (sorry, couldn't fit in parallel streams, predecessor information, etc, but you should get the idea, anyone wants a .mpp file I'll send it to you):

Code: Select all

ID	Task_Name
1	 Planning (Rolled up task)
2	    Release Scoping
3	    Project Plan Creation
4	 Design (Rolled up task)
5	    Technical Design
6	    Technical Design Review Meeting
7	    System Test Plan Preparation
8	    UAT Database Function Requirements Documented
9	    Implementation Plan Created
10	   Release Kickoff Meeting
11	Development (Rolled up task)
12	   Formal DDL Handoff
13	   Development DDL Implementation
14	   Development Environment Creation
15	   Development
16	   UAT Database Preparation
17	System Test (Rolled up task)
18	   Pre-System Test Meeting
19	   Development Post-Implementation DDL Executed
20	   UAT Release Packaging
21	   UAT Release Migration
22	   UAT Pre-Implementation DDL Executed
23	   UAT Pre-Implementation Processes
24	   UAT Implementation DDL Executed
25	   UAT Implementation Processes
26	   System Test
27	Production Readiness (Rolled up task)
28	   Production Pre-Implementation DDL Executed
29	   Production Pre-Implementation Processes
30	   Production Pre-Implementation Operations Setup
31	   Production Pre-Implementation BO Setup
32	   Release Pre-Implementation Reminder
33	   UAT Post-Implementation DDL Executed
34	   User Acceptance Test
35	   Production Release Migration
36	   Release Implementation Meeting
37	   Release Implementation Reminder
38	Production Implementation (Rolled up task)
39	   Production Implementation DDL Executed
40	   Production Implementation Processes
41	   Production Release Activation
42	   Post-Implementation Test
43	   Operations Schedule Change Activation
44	   Production Implementation BO Tasks
45	   Release Implementation Announcement
46	Post Implementation (Rolled up task)
47	   Production Post-Implementation DDL Executed
48	   Production Post-Implementation Processes
49	   Production Post-Implementation Operations Tasks
50	   Production Post-Implementation BO Tasks
51	   Post-Implementation Review Meeting
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Teej
Participant
Posts: 677
Joined: Fri Aug 08, 2003 9:26 am
Location: USA

Post by Teej »

kcbland wrote:An ETL tool can never replace proper SDLC (that would be Software Design Life Cycle) methodology.
I agree with that. I was more focused on what to do with the ETL world. Naturally, one have to build an Unit Test Plan, and have it throughly reviewed by those who know and understand the project's requirements.

But within the ETL tool, especially PX -- there is not much more than just running the jobs, and checking for results. Peek stages can be used, but can run the risk of errors being introduced during changes.

-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
raju_chvr
Premium Member
Premium Member
Posts: 165
Joined: Sat Sep 27, 2003 9:19 am
Location: USA

Post by raju_chvr »

Appreciate your input Kenneth.
trobinson
Participant
Posts: 208
Joined: Thu Apr 11, 2002 6:02 am
Location: Saint Louis
Contact:

Post by trobinson »

Two types of testing were already mentioned; Unit and Regression. There is also Integration and User Acceptance. The most important to my mind is regression. That is, automated Regression. We are using DataStage itself to help with the automation of regression testing. We've built jobs and sequences - The Test Harness, in which we plug-in the actual ETL jobs we are testing. One of the tasks of development is to continually add to the test cases. The whole test suite is run automatically at given intervals and a automated report of pass/fail tests is generated. ALL must pass, everytime. I'm not sure this has revelance to the batch world but it works for the real-time world. I think I would continually build and add to input data in which all links, contstaints, routines, transforms of all jobs has input data that directly tests the code. It should yield a readily identifiable result in the target. It should be automated, quick and easy to use.

To echo Teej, I often picture myself taking a #10 ballpeen hammer to the transformer stage and just whacking the bejeezus out of it.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I think these are all good answers but I worked for Blue Cross for eight years. Twice a year we had a system audit. If you were written up on the same thing twice then you could lose your job. In a system audit your numbers have to match. I have never used Quality Manager but I am told this is what it was designed to do. If you have 3 source systems and there are 10,000 claims total in all three then there better be 10,000 claims in the data warehouse. If your claims expense total was 10 million in the source systems then the warehouse should match. So once a week run a series of reports to total critical data in all fact tables. I would also do row counts in dimension tables to see if they match. Dimensions may get combined so they maybe hard to audit. If a hospital exists in 2 source systems then only one dimension should exist. Their numbers should be combined.

Post these audit reports where all users can get to them. It will build their confidence. Run these reports weekly or daily if need be. If you are doing change data capture then this reports are even more critical. Also all aggregated data should be audited. Automate this in Quality Manager and you will be way ahead of most EDW environments.

Kim.
Mamu Kim
Post Reply