I would like to know if any of you have ever come accross or come up with testing methodologies for DS jobs. I am sure Kim, Kenneth, Ray and Teej may have their own ways of testing DS before exporting/moving them to Testing or Production boxes.
Care to share them here..
Thanks
Testing methodologies for DS jobs
Moderators: chulett, rschirm, roy
Re: Testing methodologies for DS jobs
Hmmm....raju_chvr wrote:I would like to know if any of you have ever come accross or come up with testing methodologies for DS jobs. I am sure Kim, Kenneth, Ray and Teej may have their own ways of testing DS before exporting/moving them to Testing or Production boxes.
Open Director. Run Job. Check results.
The problem here -- we have PX for 6.0, and that does not have its own debug mode available on the GUI interface. 7.0 obviously would be different for us -- the ability to step through the rows and stages would be very helpful for us.
I use rejects for the Transform stages during testing -- if anything is rejected, it usually means that I forgot to put an "If IsNull()" somewhere.
We were suggested to do an incremential build by an Ascential consultant, where we add a stage, add the requirement, then test intensively, then add another stage... Honestly, with the vast ease of building jobs in PX, I find it just a massive waste of time.
I do wish that some things are 'noticed' during compile time. For example -- lets say you are using a join stage, and you are hashing/sorting the key fields. Wouldn't it make sense that they would notice whether or not you are hashing the right fields? That tripped me up one time while testing something. I was hashing on "PARENT_WHOLESALER_NAME" and merging on "PARENT_WHOLESALER_NUMBER". No results found on an 4 node run, naturally.
We found over time that the error messages are about 90% our fault, 9.999% data fault and 0.001% DataStage's fault. Right now, we have one of those 0.001% issue. This is the kind of error you DON'T want to see at all. But this kind of guideline help us stay focused when we are debugging our programs -- the GUI is so easy, yet so yucky in some areas, that we can easily miss some things. For example -- why is the Transform Stage always split down the middle perfectly, when the input fields are narrow, and the output fields are wide? Spread it out! Why don't the transform stage word-wrap the transformations? Therefore, it's easy to overlook an error like:
Code: Select all
If IsNull(Input.W_KEY_FIELD) Then
SetNull()
Else
Input.A_KEY_FIELD
Color coding the links would be so NICE! Instead of using ---->, have a RED bold ===>.
But I wouldn't give this up and go back to C++ only. Ever since I worked on ETL, C++ became fun for me once again, but only in good small doses. Plus it's easier to debug on C++ than DataStage PX.
So... testing. Really, it all varies -- I trust the tools to do its job, and I compare it with a direct SQL query on a small dataset.
-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
I don't know how many times I have said this to clients.
An ETL tool can never replace proper SDLC (that would be Software Design Life Cycle) methodology. Part of that methodology should be comprehensive regression testing. Just because you have a ETL tool the rules of application development don't change.
Whether it's a program, a job, a script, or manual effort, the test cases as part of a test plan have to be constructed.
The only issue you have in testing to figure out is how to break down the job execution so that test cases can be tested in a manner that makes it easy for the QA team. For example, an all-in-one job that reads from a source table and writes to a target table complicates the mechanisms by which a tester can repeat the same test multiple times. You have this problem no matter what. The specific issue to ETL for a test team will be the execution of individual jobs, and the execution of the job stream.
If the job stream is a scheduler operated activity, then you have to be able to replicate the production schedule as part of the final round of user acceptance testing. You will also have to figure out how to test specific jobs as part of unit, then system testing.
There are no easy answers. Depending on how you architected your framework, different types of scenarios apply. I prefer to architect an ETL framework that includes how to test into the design framework.
An ETL tool can never replace proper SDLC (that would be Software Design Life Cycle) methodology. Part of that methodology should be comprehensive regression testing. Just because you have a ETL tool the rules of application development don't change.
Whether it's a program, a job, a script, or manual effort, the test cases as part of a test plan have to be constructed.
The only issue you have in testing to figure out is how to break down the job execution so that test cases can be tested in a manner that makes it easy for the QA team. For example, an all-in-one job that reads from a source table and writes to a target table complicates the mechanisms by which a tester can repeat the same test multiple times. You have this problem no matter what. The specific issue to ETL for a test team will be the execution of individual jobs, and the execution of the job stream.
If the job stream is a scheduler operated activity, then you have to be able to replicate the production schedule as part of the final round of user acceptance testing. You will also have to figure out how to test specific jobs as part of unit, then system testing.
There are no easy answers. Depending on how you architected your framework, different types of scenarios apply. I prefer to architect an ETL framework that includes how to test into the design framework.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
I tried to figure out how to show a project plan, but this will have to do. I have yet to speak on this forum about managing a release lifecycle in a data warehouse environment. I shall be posing those questions later.
In the meantime, I wanted to show how I manage a release lifecycle. Since I firmly believe a warehouse should be managed the same as any application development project, it will undergo period revision and enhancement. The tasks of expanding functionality, adding new tables, and fixing existing data issues need to be handled in a controlled, methodical, planned manner. Releases tend to be 6-8 weeks in length, with a few minor bug fixes in between releases. This allows you to plan for having appropriate test data created in a test database. It also allows you to plan a migration from a development environment to a system/user acceptance test environment.
The steps in a release lifecycle that I manage usually are (sorry, couldn't fit in parallel streams, predecessor information, etc, but you should get the idea, anyone wants a .mpp file I'll send it to you):
In the meantime, I wanted to show how I manage a release lifecycle. Since I firmly believe a warehouse should be managed the same as any application development project, it will undergo period revision and enhancement. The tasks of expanding functionality, adding new tables, and fixing existing data issues need to be handled in a controlled, methodical, planned manner. Releases tend to be 6-8 weeks in length, with a few minor bug fixes in between releases. This allows you to plan for having appropriate test data created in a test database. It also allows you to plan a migration from a development environment to a system/user acceptance test environment.
The steps in a release lifecycle that I manage usually are (sorry, couldn't fit in parallel streams, predecessor information, etc, but you should get the idea, anyone wants a .mpp file I'll send it to you):
Code: Select all
ID Task_Name
1 Planning (Rolled up task)
2 Release Scoping
3 Project Plan Creation
4 Design (Rolled up task)
5 Technical Design
6 Technical Design Review Meeting
7 System Test Plan Preparation
8 UAT Database Function Requirements Documented
9 Implementation Plan Created
10 Release Kickoff Meeting
11 Development (Rolled up task)
12 Formal DDL Handoff
13 Development DDL Implementation
14 Development Environment Creation
15 Development
16 UAT Database Preparation
17 System Test (Rolled up task)
18 Pre-System Test Meeting
19 Development Post-Implementation DDL Executed
20 UAT Release Packaging
21 UAT Release Migration
22 UAT Pre-Implementation DDL Executed
23 UAT Pre-Implementation Processes
24 UAT Implementation DDL Executed
25 UAT Implementation Processes
26 System Test
27 Production Readiness (Rolled up task)
28 Production Pre-Implementation DDL Executed
29 Production Pre-Implementation Processes
30 Production Pre-Implementation Operations Setup
31 Production Pre-Implementation BO Setup
32 Release Pre-Implementation Reminder
33 UAT Post-Implementation DDL Executed
34 User Acceptance Test
35 Production Release Migration
36 Release Implementation Meeting
37 Release Implementation Reminder
38 Production Implementation (Rolled up task)
39 Production Implementation DDL Executed
40 Production Implementation Processes
41 Production Release Activation
42 Post-Implementation Test
43 Operations Schedule Change Activation
44 Production Implementation BO Tasks
45 Release Implementation Announcement
46 Post Implementation (Rolled up task)
47 Production Post-Implementation DDL Executed
48 Production Post-Implementation Processes
49 Production Post-Implementation Operations Tasks
50 Production Post-Implementation BO Tasks
51 Post-Implementation Review Meeting
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
I agree with that. I was more focused on what to do with the ETL world. Naturally, one have to build an Unit Test Plan, and have it throughly reviewed by those who know and understand the project's requirements.kcbland wrote:An ETL tool can never replace proper SDLC (that would be Software Design Life Cycle) methodology.
But within the ETL tool, especially PX -- there is not much more than just running the jobs, and checking for results. Peek stages can be used, but can run the risk of errors being introduced during changes.
-T.J.
Developer of DataStage Parallel Engine (Orchestrate).
Two types of testing were already mentioned; Unit and Regression. There is also Integration and User Acceptance. The most important to my mind is regression. That is, automated Regression. We are using DataStage itself to help with the automation of regression testing. We've built jobs and sequences - The Test Harness, in which we plug-in the actual ETL jobs we are testing. One of the tasks of development is to continually add to the test cases. The whole test suite is run automatically at given intervals and a automated report of pass/fail tests is generated. ALL must pass, everytime. I'm not sure this has revelance to the batch world but it works for the real-time world. I think I would continually build and add to input data in which all links, contstaints, routines, transforms of all jobs has input data that directly tests the code. It should yield a readily identifiable result in the target. It should be automated, quick and easy to use.
To echo Teej, I often picture myself taking a #10 ballpeen hammer to the transformer stage and just whacking the bejeezus out of it.
To echo Teej, I often picture myself taking a #10 ballpeen hammer to the transformer stage and just whacking the bejeezus out of it.
I think these are all good answers but I worked for Blue Cross for eight years. Twice a year we had a system audit. If you were written up on the same thing twice then you could lose your job. In a system audit your numbers have to match. I have never used Quality Manager but I am told this is what it was designed to do. If you have 3 source systems and there are 10,000 claims total in all three then there better be 10,000 claims in the data warehouse. If your claims expense total was 10 million in the source systems then the warehouse should match. So once a week run a series of reports to total critical data in all fact tables. I would also do row counts in dimension tables to see if they match. Dimensions may get combined so they maybe hard to audit. If a hospital exists in 2 source systems then only one dimension should exist. Their numbers should be combined.
Post these audit reports where all users can get to them. It will build their confidence. Run these reports weekly or daily if need be. If you are doing change data capture then this reports are even more critical. Also all aggregated data should be audited. Automate this in Quality Manager and you will be way ahead of most EDW environments.
Kim.
Post these audit reports where all users can get to them. It will build their confidence. Run these reports weekly or daily if need be. If you are doing change data capture then this reports are even more critical. Also all aggregated data should be audited. Automate this in Quality Manager and you will be way ahead of most EDW environments.
Kim.
Mamu Kim