Testing methodologies for DS jobs

raju_chvr · Post by **raju_chvr** » Wed Nov 26, 2003 10:09 am

I would like to know if any of you have ever come accross or come up with testing methodologies for DS jobs. I am sure Kim, Kenneth, Ray and Teej may have their own ways of testing DS before exporting/moving them to Testing or Production boxes.

Care to share them here..

Thanks

Teej · Post by **Teej** » Wed Nov 26, 2003 12:40 pm

raju_chvr wrote:I would like to know if any of you have ever come accross or come up with testing methodologies for DS jobs. I am sure Kim, Kenneth, Ray and Teej may have their own ways of testing DS before exporting/moving them to Testing or Production boxes.

Hmmm....

Open Director. Run Job. Check results.

The problem here -- we have PX for 6.0, and that does not have its own debug mode available on the GUI interface. 7.0 obviously would be different for us -- the ability to step through the rows and stages would be very helpful for us.

I use rejects for the Transform stages during testing -- if anything is rejected, it usually means that I forgot to put an "If IsNull()" somewhere.

We were suggested to do an incremential build by an Ascential consultant, where we add a stage, add the requirement, then test intensively, then add another stage... Honestly, with the vast ease of building jobs in PX, I find it just a massive waste of time.

I do wish that some things are 'noticed' during compile time. For example -- lets say you are using a join stage, and you are hashing/sorting the key fields. Wouldn't it make sense that they would notice whether or not you are hashing the right fields? That tripped me up one time while testing something. I was hashing on "PARENT_WHOLESALER_NAME" and merging on "PARENT_WHOLESALER_NUMBER". No results found on an 4 node run, naturally.

We found over time that the error messages are about 90% our fault, 9.999% data fault and 0.001% DataStage's fault. Right now, we have one of those 0.001% issue.

This is the kind of error you DON'T want to see at all. But this kind of guideline help us stay focused when we are debugging our programs -- the GUI is so easy, yet so yucky in some areas, that we can easily miss some things. For example -- why is the Transform Stage always split down the middle perfectly, when the input fields are narrow, and the output fields are wide? Spread it out! Why don't the transform stage word-wrap the transformations? Therefore, it's easy to overlook an error like:

Code: Select all

If IsNull(Input.W_KEY_FIELD) Then 
    SetNull() 
Else 
    Input.A_KEY_FIELD

If it's wrapped, this would be easier to spot. I only noticed it because when I check the links for the input stages, it was pointing where it wasn't supposed to be pointing to.

Color coding the links would be so NICE! Instead of using ---->, have a RED bold ===>.

But I wouldn't give this up and go back to C++ only. Ever since I worked on ETL, C++ became fun for me once again, but only in good small doses.

Plus it's easier to debug on C++ than DataStage PX.

So... testing. Really, it all varies -- I trust the tools to do its job, and I compare it with a direct SQL query on a small dataset.

-T.J.

kcbland · Post by **kcbland** » Wed Nov 26, 2003 1:02 pm

I don't know how many times I have said this to clients.

An ETL tool can never replace proper SDLC (that would be Software Design Life Cycle) methodology. Part of that methodology should be comprehensive regression testing. Just because you have a ETL tool the rules of application development don't change.

Whether it's a program, a job, a script, or manual effort, the test cases as part of a test plan have to be constructed.

The only issue you have in testing to figure out is how to break down the job execution so that test cases can be tested in a manner that makes it easy for the QA team. For example, an all-in-one job that reads from a source table and writes to a target table complicates the mechanisms by which a tester can repeat the same test multiple times. You have this problem no matter what. The specific issue to ETL for a test team will be the execution of individual jobs, and the execution of the job stream.

If the job stream is a scheduler operated activity, then you have to be able to replicate the production schedule as part of the final round of user acceptance testing. You will also have to figure out how to test specific jobs as part of unit, then system testing.

There are no easy answers. Depending on how you architected your framework, different types of scenarios apply. I prefer to architect an ETL framework that includes how to test into the design framework.

kcbland · Post by **kcbland** » Wed Nov 26, 2003 1:31 pm

I tried to figure out how to show a project plan, but this will have to do. I have yet to speak on this forum about managing a release lifecycle in a data warehouse environment. I shall be posing those questions later.

In the meantime, I wanted to show how I manage a release lifecycle. Since I firmly believe a warehouse should be managed the same as any application development project, it will undergo period revision and enhancement. The tasks of expanding functionality, adding new tables, and fixing existing data issues need to be handled in a controlled, methodical, planned manner. Releases tend to be 6-8 weeks in length, with a few minor bug fixes in between releases. This allows you to plan for having appropriate test data created in a test database. It also allows you to plan a migration from a development environment to a system/user acceptance test environment.

The steps in a release lifecycle that I manage usually are (sorry, couldn't fit in parallel streams, predecessor information, etc, but you should get the idea, anyone wants a .mpp file I'll send it to you):

Code: Select all

ID	Task_Name
1	 Planning (Rolled up task)
2	    Release Scoping
3	    Project Plan Creation
4	 Design (Rolled up task)
5	    Technical Design
6	    Technical Design Review Meeting
7	    System Test Plan Preparation
8	    UAT Database Function Requirements Documented
9	    Implementation Plan Created
10	   Release Kickoff Meeting
11	Development (Rolled up task)
12	   Formal DDL Handoff
13	   Development DDL Implementation
14	   Development Environment Creation
15	   Development
16	   UAT Database Preparation
17	System Test (Rolled up task)
18	   Pre-System Test Meeting
19	   Development Post-Implementation DDL Executed
20	   UAT Release Packaging
21	   UAT Release Migration
22	   UAT Pre-Implementation DDL Executed
23	   UAT Pre-Implementation Processes
24	   UAT Implementation DDL Executed
25	   UAT Implementation Processes
26	   System Test
27	Production Readiness (Rolled up task)
28	   Production Pre-Implementation DDL Executed
29	   Production Pre-Implementation Processes
30	   Production Pre-Implementation Operations Setup
31	   Production Pre-Implementation BO Setup
32	   Release Pre-Implementation Reminder
33	   UAT Post-Implementation DDL Executed
34	   User Acceptance Test
35	   Production Release Migration
36	   Release Implementation Meeting
37	   Release Implementation Reminder
38	Production Implementation (Rolled up task)
39	   Production Implementation DDL Executed
40	   Production Implementation Processes
41	   Production Release Activation
42	   Post-Implementation Test
43	   Operations Schedule Change Activation
44	   Production Implementation BO Tasks
45	   Release Implementation Announcement
46	Post Implementation (Rolled up task)
47	   Production Post-Implementation DDL Executed
48	   Production Post-Implementation Processes
49	   Production Post-Implementation Operations Tasks
50	   Production Post-Implementation BO Tasks
51	   Post-Implementation Review Meeting

Teej · Post by **Teej** » Wed Nov 26, 2003 3:27 pm

kcbland wrote:An ETL tool can never replace proper SDLC (that would be Software Design Life Cycle) methodology.

I agree with that. I was more focused on what to do with the ETL world. Naturally, one have to build an Unit Test Plan, and have it throughly reviewed by those who know and understand the project's requirements.

But within the ETL tool, especially PX -- there is not much more than just running the jobs, and checking for results. Peek stages can be used, but can run the risk of errors being introduced during changes.

-T.J.

raju_chvr · Post by **raju_chvr** » Wed Nov 26, 2003 3:29 pm

Appreciate your input Kenneth.

trobinson · Post by **trobinson** » Fri Nov 28, 2003 8:08 am

Two types of testing were already mentioned; Unit and Regression. There is also Integration and User Acceptance. The most important to my mind is regression. That is, automated Regression. We are using DataStage itself to help with the automation of regression testing. We've built jobs and sequences - The Test Harness, in which we plug-in the actual ETL jobs we are testing. One of the tasks of development is to continually add to the test cases. The whole test suite is run automatically at given intervals and a automated report of pass/fail tests is generated. ALL must pass, everytime. I'm not sure this has revelance to the batch world but it works for the real-time world. I think I would continually build and add to input data in which all links, contstaints, routines, transforms of all jobs has input data that directly tests the code. It should yield a readily identifiable result in the target. It should be automated, quick and easy to use.

To echo Teej, I often picture myself taking a #10 ballpeen hammer to the transformer stage and just whacking the bejeezus out of it.

kduke · Post by **kduke** » Fri Nov 28, 2003 4:24 pm

I think these are all good answers but I worked for Blue Cross for eight years. Twice a year we had a system audit. If you were written up on the same thing twice then you could lose your job. In a system audit your numbers have to match. I have never used Quality Manager but I am told this is what it was designed to do. If you have 3 source systems and there are 10,000 claims total in all three then there better be 10,000 claims in the data warehouse. If your claims expense total was 10 million in the source systems then the warehouse should match. So once a week run a series of reports to total critical data in all fact tables. I would also do row counts in dimension tables to see if they match. Dimensions may get combined so they maybe hard to audit. If a hospital exists in 2 source systems then only one dimension should exist. Their numbers should be combined.

Post these audit reports where all users can get to them. It will build their confidence. Run these reports weekly or daily if need be. If you are doing change data capture then this reports are even more critical. Also all aggregated data should be audited. Automate this in Quality Manager and you will be way ahead of most EDW environments.

Kim.

DSXchange

Testing methodologies for DS jobs

Testing methodologies for DS jobs

Re: Testing methodologies for DS jobs