Processing large numbers of input files

peternolan9 · Post by **peternolan9** » Thu Apr 29, 2004 1:46 am

Hi All,
I was wondering if anyone had any hints about how to do this.

We will have large numbers of Call Detail Record files coming at us each day. We are getting a file per switch each time a switch fills up it's 12MB file. We will put the files for each switch type into it's own directory. We have written a C++ decoder for the switches (they come at us in a binary unformatted format). The decoder will take the name of an input file, the switch type, and the name of the output file and decode the file.

We have written another program to search a directory which is a parameter and to pass the name of the input files to the first program to decode the switches. I'm considering combining the two programs into one so we just have one program call and it will read all the files in the directory, decode them, and then pass them along to the decoded file directory.

After this we must run datastage jobs to pick up the decoded files and load them into the staging area.

To do this last bit I was thinking that what I could do was to write C++ code to call the datastage job from the C++ API. It all looks emminently doable. And learning the DS C++ API would not be an all bad thing.

But the thought just struck me (some days I am slow

) that someone on here might have done something like this previously. The whole process of processing large numbers of input files. I recall a job I work on 4 years ago processed a lot of files but it did not call a pre-requisite C++ program.

Has anyone else out there done something like this? Accept large volumes of CDR files (or any other types of files) decode them using something other than DS, then load the decoded file using a DS job?

I'd be very interested to know if anyone is willing to share something like this.

Thanks

hemant · Post by **hemant** » Thu Apr 29, 2004 3:10 am

hi!

Decoding we have TTI tool so i can say nothing about that but once decoding is done u can write a shell script in unix to load the data into yur repository.
and then using after and before sub routine in the job properties u can load the data for stage.Rest u are acclimatized ,plz. get back to me ,i don't know whether i am to the depth of the problem or not ?But if i am correct kindly revert .

Regards
Hemant

peternolan9 wrote:Hi All,
I was wondering if anyone had any hints about how to do this.

We will have large numbers of Call Detail Record files coming at us each day. We are getting a file per switch each time a switch fills up it's 12MB file. We will put the files for each switch type into it's own directory. We have written a C++ decoder for the switches (they come at us in a binary unformatted format). The decoder will take the name of an input file, the switch type, and the name of the output file and decode the file.

We have written another program to search a directory which is a parameter and to pass the name of the input files to the first program to decode the switches. I'm considering combining the two programs into one so we just have one program call and it will read all the files in the directory, decode them, and then pass them along to the decoded file directory.

After this we must run datastage jobs to pick up the decoded files and load them into the staging area.

To do this last bit I was thinking that what I could do was to write C++ code to call the datastage job from the C++ API. It all looks emminently doable. And learning the DS C++ API would not be an all bad thing.

But the thought just struck me (some days I am slow ) that someone on here might have done something like this previously. The whole process of processing large numbers of input files. I recall a job I work on 4 years ago processed a lot of files but it did not call a pre-requisite C++ program.

Has anyone else out there done something like this? Accept large volumes of CDR files (or any other types of files) decode them using something other than DS, then load the decoded file using a DS job?

I'd be very interested to know if anyone is willing to share something like this.

Thanks

holgi02 · Post by **holgi02** » Thu Apr 29, 2004 6:32 am

Why not have a look at using the folder stage and use before/after subroutines to do the call the decode software. This way around you have DS in control and calling your decode software instead of attempting to call DS from a C++ environment. If you want to automate the whole thing even further try using a Job Sequence and a Wait for file activity stage to pull in the incoming CDR files from the directory that gets loaded from the switch.

There must be hundreds of ways of skinning this cat!

roy · Post by **roy** » Thu Apr 29, 2004 7:42 am

Hi,
you can use the API, however, IMHO, it will be more simple to use a DS routine to invoke the decoder and when it's done simply continue to process the files.
IMHO a Basic job control will be far more flexible then a before/after routine.

Good Luck

peternolan9 · Post by **peternolan9** » Thu Apr 29, 2004 8:55 am

Hi All,
thanks for the input, I'm playing around with a few ways....and I'll get Tom to try out some of your other suggestions on Saturday (that's a working day here...)

Currently I have a C++ program working that scans the input folders, performs the decoding, zips, saves, has a wait for file semaphore as well as a kill semaphore.....if it calls DS to load the formatted file that's all I will need.....when we try a few things we'll pick the one we like most....there are a lot of ways to skini this cat....we also have downstream dependencies because when the CDRs are being extracted from the staging area we do not want any more being added....

Anyway, one other quick question.

To try out the C++API I am cutting/pasting the demo program from the server job developer guide appendix A. But when I cut/paste from the PDF I lose all the formatting...I started out reformatting everything but it's quite a pain.....has anybody already reformatted this program and have the source? And willing to share?

Thanks

peternolan9 · Post by **peternolan9** » Thu Jul 29, 2004 2:44 am

Hi All,
I forgot to paste waht we finally did....

The problem was the telco has 10 different switch input formats, many of which are binary, some of which have varying length records in them, almost non of which have newlines.......Each switch in the network (hundreds of them) generate files and throw them to mediation and we get them from mediation...

So the problem was to catch these incoming files, decode them and get them into the oracle staging area....

Being the good DWA person I am, we wanted to do as much as possible in DS.

The solution we finished with was to write a C++ switch decoder where we could start N instances telling it the switch type the instance was to process. From the switch type it would figure out where to look for the files and know which jobs to submit.

We also added the ability to put any single instance to sleep using files for semaphors etc so that the DS job that took the switch records from the staging area and put it into the DW could stop the loads happening in the staging area while the switch records were extracted and flagged as being updated in the staging area....

All in all, a very neat solution where all pieces work together very nicely.

If anyone else is doing the same, please feel free to contact me on peter@peternolan.com.

peternolan9 wrote:Hi All,
thanks for the input, I'm playing around with a few ways....and I'll get Tom to try out some of your other suggestions on Saturday (that's a working day here...)

Currently I have a C++ program working that scans the input folders, performs the decoding, zips, saves, has a wait for file semaphore as well as a kill semaphore.....if it calls DS to load the formatted file that's all I will need.....when we try a few things we'll pick the one we like most....there are a lot of ways to skini this cat....we also have downstream dependencies because when the CDRs are being extracted from the staging area we do not want any more being added....

Anyway, one other quick question.

To try out the C++API I am cutting/pasting the demo program from the server job developer guide appendix A. But when I cut/paste from the PDF I lose all the formatting...I started out reformatting everything but it's quite a pain.....has anybody already reformatted this program and have the source? And willing to share?

Thanks

ray.wurlod · Post by **ray.wurlod** » Thu Jul 29, 2004 4:17 am

The problem was the telco has 10 different switch input formats

You haven't encountered the problem when they change the CDR format without telling anyone, then? It will happen!

DataStage BASIC can treat a directory as if it were a table, the operating system file names being treated as "keys".

Code: Select all

OpenPath dirpath To dir.fvar
Then
   Select dir.fvar To 9
   Loop
   While ReadNext TheFileName From 9
      TheJobName = "MyJobName." : TheFileName ; * multi-instance
      hJob = DSAttachJob(TheJobName, DSJ.ERRNONE)
      ErrCode = DSSetParam(hJob, "FileName", TheFileName)
      ErrCode = DSRunJob(hJob, DSJ.RUNNORMAL)
      * etc
   Repeat
   Close dir.fvar
End

Error checking omitted for clarity.

peternolan9 · Post by **peternolan9** » Thu Jul 29, 2004 5:32 am

Hi Ray,
yep, it will happen, and it will crash and they will wonder why....

In fact, the customer actually asked me, quoted as closely as I remember..

"if an operational system changes it's data structure, can we get datastage to automatically understand the change in the data structures so that we don't need to make any changes in datastage?"

(Presumably they feel the database should also understand that if some new data is in a load file it should automatically figure out how to understand the new data and load it....and then the BI tools should do the same...

)

By the way, in the code below, as I read it, and you know I can't read this stuff for nuts, it looks like you are putting the file name in the job name. This would not work for me as the file names have sequental numbers in them to tell you which occurrence of the file has come from the switch..

eg. switch001NNN where NNN is a sequential number that cycles again at 999......

But thanks for the tip on DS Basic...I really must learn that one day if I'm going to have to keep writing DS code.....nowadays, I just use C++ foreverything except when I need VB.net...

Hope all is great with you!!!

ray.wurlod wrote:
The problem was the telco has 10 different switch input formats
You haven't encountered the problem when they change the CDR format without telling anyone, then? It will happen!

DataStage BASIC can treat a directory as if it were a table, the operating system file names being treated as "keys".
Code: Select all
OpenPath dirpath To dir.fvar
Then
   Select dir.fvar To 9
   Loop
   While ReadNext TheFileName From 9
      TheJobName = "MyJobName." : TheFileName ; * multi-instance
      hJob = DSAttachJob(TheJobName, DSJ.ERRNONE)
      ErrCode = DSSetParam(hJob, "FileName", TheFileName)
      ErrCode = DSRunJob(hJob, DSJ.RUNNORMAL)
      * etc
   Repeat
   Close dir.fvar
End
Error checking omitted for clarity.

ray.wurlod · Post by **ray.wurlod** » Thu Jul 29, 2004 2:55 pm

Because it follows the "." character in the job name, the file name is being used as the invocation id (unique identifier) of a multi-instance job. This will work in your case, particularly if you mv or rm the file once it's processed.

MetaStage's post-and-notify mechanism for propagating knowledge of changes will help, but it's impossible (for all practical purposes) to have any ETL tool to be "automagically" aware of metadata changes.

peternolan9 · Post by **peternolan9** » Fri Jul 30, 2004 12:03 am

Hi Ray,
ah, so that's how you do that!!!!! Tom Nel and I were trying to figure out how to split an incoming batch of CDRs into say 8 streams (we have an 8CPU DS license) and then process those 8 streams using job invokation IDs.....what we couldn't figure out was how to write the job so that it knew which file to open inside the job and how to make the files inside the job unique so they would not be corrupted from multiple jobs writing to them...Tom tested lots of things he could think of but we never knew you could use the incoming file name as the invocation ID...we thought that was a number we must assign...

So, thanks very much for this...Tom will be a very happy camper!!!!

ray.wurlod wrote:Because it follows the "." character in the job name, the file name is being used as the invocation id (unique identifier) of a multi-instance job. This will work in your case, particularly if you mv or rm the file once it's processed.

MetaStage's post-and-notify mechanism for propagating knowledge of changes will help, but it's impossible (for all practical purposes) to have any ETL tool to be "automagically" aware of metadata changes.

DSXchange