c-routines :how to develop?

jasper · Post by **jasper** » Mon Apr 11, 2005 8:07 am

untill now we've been mostly using the basic-transformer in parallel jobs.(mostly because of use of old-routines). We are now trying to change all routines to C because the use of the basic-transformer is discouraged.
I'm not a c-expert, but what I gather from the example c-routine on the install-cd is that we have to do following steps:
-create the C-routine
-compile it
-create a parallel-routine(which is actually just a link to the c-routine)
-use this in a job.

This is a very long process during testing. Is there anyone who has experience with development, and how do you handle this?

half-related question: I can find a lot of standard basic-routines, there seem to be very little c-routines available however. Is there no-one who would like to share a basic-set of routines?

vmcburney · Post by **vmcburney** » Wed Apr 13, 2005 8:52 pm

Good luck. I have got very little help with this from Ascential support. There isn't much on this forum or on Devnet uploads or in the documentation. Ascential expect people to take on the C programming language for advanced transformations however the support in the form of documentation and samples is lacking.

bcarlson · Post by **bcarlson** » Thu Apr 14, 2005 1:44 pm

BuildOps are actually pretty easy, and you don't have to do them completely by hand (manual compile, etc).

You can build them either in Manager or Designer. Right Click on Stage Types and go to New Parallel Stage/Build.... This opens a new window for building your new Stage. There are 4 main tabs, but I tend to ignore all but the first and last (General for naming your stage, and Build for coding).

In the Build Tab, there are lots of options as well. However, if you are simply encapsulating ETL type logic, you really only need the Interfaces and Logic tabs.

In the Build/Interfaces tab, you'll assign an input schema/table definition of what the record looks like coming into the stage, and one for the output to define what it looks like coming out. These are just normal table definitions, not too different from the definition you put together for importing a file. These definitions are assigned in the Input and Output tabs, respectively. We hardly ever use the transfer tab.

In the Build/Logic tab you will do your coding. Definitions - put any temporary variables you'll need here, so they don't get redefined every time you process a record. Also any functions you may want (for example, we have a function that we use that converts non-printable hex data in a string to spaces).

In the Pre-loop tab, add code that you want to run before any records are processed, like intialization type stuff. Conversely, use the Post-loop tab to add code that runs after all records are processed. To be honest, we have never used these. You could use pre to initialize a counter and the post to print it to the logs.

The Per-Record tab is the most important, this is what you want to happen to each record that is processed. Each field of your input schema can be referenced with in.fieldname. Same with output: out.outputfield.

Most ETL type function you'll need are provided, but you'll definitely need to read the documentation (Parallet Job Dev Guide, chapter 'Specifying Your Own Parallel Stage, sections 'Defining Build Stages' and 'Build Stage Macros'. Also Functions, Appendix B.

We have a lot of programmers here that had little or no C background and they have picked it up pretty quickly. Hopefully you will have a similar experience. We get much better performance with the Buildops than we do with Transforms, so we just use Buildops exclusively. Works great!

vmcburney · Post by **vmcburney** » Thu Apr 14, 2005 5:46 pm

Does this mean in any buildop you need to hard code your input and output table definitions? Most parallel stages have dynamic metadata, you can choose your input and output columns and your relevent key or value columns and use the same stage for many types of data. Is there a way of building a custom stage where you can process changing column definitions? Or perhaps have a small number of named columns and then any number of additional unnamed columns?

bcarlson · Post by **bcarlson** » Fri Apr 15, 2005 10:44 am

No, they don't need to be hard coded. Our shop started out using Torrent, the predecessor to the DS PX, and it was much easier to write the buildops with hardcoded schemas because we generated everything.

In DataStage, it is much easier to simply propogate fields, so we are hard-coding the schemas a lot less now. However, in some cases we have enormous records coming in, but small records coming out. There are numerous ways to get rid of unwanted fields, using the buildop with the hardcoded schemas is one such way.

On the other hand, when only a fraction of the input fields need a transformation - the rest are just pass through, then why define the entire schema? You need to specify the fields in the input that will be referenced in the code, and specify the fields in the output that are created within the code.

For example, say you are constructing an identifer field cust_acct_id that is comprised of 3 input fields, plus some calculated field. You would define the 3 input fields in your input schema. The calculated field is probably a temporary variable (created in the Build/Logic/Definition tab). The per-record code constructs your output field, out.cust_acct_id, which is defined in your output schema.

On the other hand, you may analyze one input field acct_balance_am, and populate 5 new fields for the output. Your input schema would contain acct_balance_am and your output schema would contain the definitions of the 5 new fields.

Is that what you were looking for?