writing data to a sequential file

dlvsrinivas · Post by **dlvsrinivas** » Mon Aug 04, 2003 5:46 am

Hi,
I would like to create a server job in DataStage 6.0 on SunSolaris which takes fields from a sequential file, process it using different flows and gets fields as output all belongs to a single record. Could anybody suggest how I will merge all these fields into a single sequential file. The combination of all these fields will form a record of the sequential file and the sequential file will have a standard column definition.

spracht · Post by **spracht** » Mon Aug 04, 2003 6:03 am

If I get you right, you have serveral outputs, but you want them all to write to one sequential file. According to my experience, it is better to write to seperate files first and then concatenate all these files to one, e.g. using something like 'cat file1 file2 file3 > file4' in an after job subroutine.

Stephan

chulett · Post by **chulett** » Mon Aug 04, 2003 7:35 am

You'll either need to do a 'spracht' suggests, as multiple streams *cannot* write to a single sequential file simultaneously, or have an intermediate target (hash, Oracle) that can handle that and then pull the result out to your sequential file after it has been populated.

-craig

ariear · Post by **ariear** » Mon Aug 04, 2003 7:43 am

dlvsrinivas,
I think that you want to compose a single record from separate flows
To do that You probably have the same key for every flow. You can decide that one flow will result in a seq file and all the other in hash files and then join them using lookup. Or use a series of merge jobs

kduke · Post by **kduke** » Mon Aug 04, 2003 8:34 am

dlvsrinivas

I think you can use stage variables to store a complete record. I think your data stream looks like this:

1,John
2,Smith
3,111 E. 1st
4,Dallas
5,TX
6,77777
1,Jim
2,Jones
...

Where the first field tells which field is following. So 1=FirstName and 2=LastName. All your stage variables are if statements to check the field type.

Type = field(InLine, ",", 1)
FirstName = OldFirstName
OldFirstName = if Type=1 then FirstName = field(InLine, ",", 2)
LastName = if Type=2 then LastName = field(InLine, ",", 2)
Addr = if Type=3 then Addr = field(InLine, ",", 2)
...

Constraint:
FirstName OldFirstName

You got a new record when the first field changed. Your outpur record is all the stage variables.

There is a trick to get the last record. This trick is on the Tricks and Tips CD that Ascential gives out. I can look it up or maybe Ray or Ken knows it.

Kim.

Kim Duke
DwNav - ETL Navigator
www.Duke-Consulting.com

kcbland · Post by **kcbland** » Mon Aug 04, 2003 8:39 pm

I'm trying to figure out why you need to do what you are doing. To take a row and split it by columns essentially means pivoting a single wide row into many narrow rows (key columns + 1 attribute column?). I can only figure out one "conventional" rational behind this and that would be to improve job processing performance. If you have a single wide row, you have a lot of business rules to perform against many columns and that equates to slow throughput.

Most people, when tuning DataStage jobs, use rows/second as the measurement of performance. This is COMPLETELY wrong! Bytes/second is more accurate. If you are trying to improve performance by decomposing a row, processing columns in parallel, and then recomposing the row, then there's a better way.

You should keep your row together, and simply use job instantiation to achieve parallel processing of jobs. Instead of 1 slow job, you can have N number of slow jobs. At least you're N times faster and you don't have to deal with what you originally requested.

Let's say you took a job with 50 attribute columns, and busted processing into two jobs, the first one taking the key columns and attribute columns 1-25, the second one taking the key columns and attribute columns 26-50. You're max theoretically going to be twice as fast, if you have perfect splitting of the processing load. You then have to add a job to recombine the rows, probably using a hash file to park one of the halves of data for reference lookup/rejoin processing. If you split the row into 5 jobs, each with 10 columns, your recombination is nastier and more costly. You are really heading down the wrong path here. Just how many cpus do you have for doing this? How scalable is this approach?

Now, if you take your job and process a row with all of its columns in its entirety, you don't have any recombination to do. Use job instantiation techniques and create N number of jobs to scale to N times the net rows processed. Ideally, match N to the number of cpus on your server, provided you have not mixed external data stores in your job and are dealing with local host hash files and sequential files. This will be the fastest solution.

If the reason you are processing each column separately is because of extremely complex business rules where each column has multiple lookups for it, ultimately driving your job design to a high number of stages, then I would look to address the business rules. If you have a high degree of commonality in the column processing, you should look to containers to ecapsulate and share common processing. You could also use containers to simply the job design so that it is more manageable.

Good luck!

Kenneth Bland

ray.wurlod · Post by **ray.wurlod** » Tue Aug 05, 2003 12:32 am

Even Megabytes per minute rather than bytes per second, as this smooths out statistical wrinkles. Ultimately, though, the key performance indicator in ETL is finishing within the allocated time window (that is, elapsed time).

There are a couple of ways to get the last line. One is to pre-process the file and add a dummy line to the end of it (if it's a sequential file, of course). Something like echo ",,,,," >> filename.
Another method is not to constrain the output, but to pass it through an Aggregator (the input is sorted, right?) and use the set function Last on the column in which the list is built.

There is a white paper on this very topic (building lists in stage variables) queued for publication on DSX once legal approval has been obtained from Ascential for mention of their trademark.

Ray Wurlod
Education and Consulting Services
ABN 57 092 448 518

dlvsrinivas · Post by **dlvsrinivas** » Tue Aug 05, 2003 3:38 am

Dear All,

Thanks for your valuable suggestions. I am revising my design based on your inputs.
Also, I have some new queries.

1) I would like to know when the combination of Link Partitioner and Link collector are used. Can that be used for the scenario I have?

2) If I proceed with processing the row with all of its columns in its entirety, I have some cases, in which at a certain transform stage, I need to write constraints which splits the flow into one or more paths. Considering that I am carrying all the information of the row at each path, how will I write all these records into a single sequential file again. If I use a sequential file stage at the end of each path to store the record which points to a different sequential file in file system, could somebody suggest how the records of all these files will be written to a single sequential file?

Thanks in advance,
Srinivas.

chulett · Post by **chulett** » Tue Aug 05, 2003 6:01 am

Stephan has already answered this...

The simple answer to #2 is to use an after-job routine. That routine can call a simple script to concatenate the sequential files together. For example, for three files imaginatively named File1, File2 and File3 with the end result going to TargetFile:

Unix - "cat File1 File2 File3 > TargetFile"
DOS - "copy File1+File2+File3 TargetFile"

-craig