extract data from file different num of fields in each row

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Lucky
Participant
Posts: 31
Joined: Mon Oct 10, 2005 11:05 pm

extract data from file different num of fields in each row

Post by Lucky »

Hi,

I have a situation where I have to extract data from a file with rows in the file contain independent set of fields.

ex:

row1 -> rec_type, id,name,qualification

row2 -> rec_type,country,state,place,location

row3 -> rec_type,num_of_rows,file_date,file_time

Depending the rec_type, I have to load different tables. if rec_type= 'H' (header) then the row should go to Table1, if rec_type = 'D', the row should be inserted into Table2 and if rec_type = 'T' this row has to go to Table3.

H -> Header
D -> Detail
T -> Trailer

Please let me know what options should be used in Data Stage
loveojha2
Participant
Posts: 362
Joined: Thu May 26, 2005 12:59 am

Post by loveojha2 »

Read the Sequential File with a sequential file stage with 5 columns specified in the metadata.

Drag it into the Transformer and Create three out links pointing to the three different tables.

In constraint of each link specify the condition of the rectype.
and you are done.
Success consists of getting up just one more time than you fall.
Lucky
Participant
Posts: 31
Joined: Mon Oct 10, 2005 11:05 pm

Post by Lucky »

Hi,

The issue is when you take 5 columns into the transformer and if one of the records consists of less than 5 or more than 5 columns then data stage issues an error. I am looking for an option where data stage can read rows consisting different number of columns from a file.
Kirtikumar
Participant
Posts: 437
Joined: Fri Oct 15, 2004 6:13 am
Location: Pune, India

Post by Kirtikumar »

Read the file with two cols only - first one as rectype and second as AllCols. Second should have type as VC with length equal to max record lenght from H, D and T.

Then in Xer, depending on the RecType, create 3 links. On each of the link will have a col AllCols. One these links put the Row Splitter stage. This stage will create the required columns from the AllCols input column.

So for header row, row splitter will take AllCols as in and should create rec_type, id,name,qualification as output.
Follow the same for all and you will get what you want - i.e. H, D and T hitting 3 diff tables.
Regards,
S. Kirtikumar.
Lucky
Participant
Posts: 31
Joined: Mon Oct 10, 2005 11:05 pm

Post by Lucky »

I think I should have stressed more on the Reading Data from the file than the logic which follows after reading the data.

The issue I am facing now is reading data from the file with rows consisting of different number of fields.

H -> 7 Fields
D -> 4 Fields
T -> 8 Fields

Please let me know if there any suggestions for me to implement the logic.
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

You could read the sequential file with a seq file stage with one column. Set this column to have a delimeter of 000 and a quote of 000.

Connect it to a transformer and select the first character of each row into a stage variable.

Have 3 outputs to sequential files, each containing the one row and set the constraints for each output to be the appropriate value of the stage variable.

Assuming your input is comma delimited
eg Stage Variable: RecTyp = input.col[',',1,1]
Constraint for Header output = RecType='H'

That should work (I think).
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

ShaneMuir wrote:You could read the sequential file with a seq file stage with one column. Set this column to have a delimeter of 000 and a quote of 000.

Connect it to a transformer and select the first character of each row into a stage variable.

Have 3 outputs to sequential files, each containing the one row and set the constraints for each output to be the appropriate value of the stage variable.

Assuming your input is comma delimited
eg Stage Variable: RecTyp = input.col[',',1,1]
Constraint for Header output = RecType='H'

That should work (I think).
I should mention that the output files should retain the original format of the input files, but will be separated into header, trailer, detail.

You can then used each file as required
Lucky
Participant
Posts: 31
Joined: Mon Oct 10, 2005 11:05 pm

Post by Lucky »

Hi Shane, Kirti,loveojha2,

Thanks a lot for your quick responses.

I have got the job to perform what I wanted.

Shane: I have simply followed your solution and included my logic and it worked straight away.

Thank you all for your time.
Lucky
Participant
Posts: 31
Joined: Mon Oct 10, 2005 11:05 pm

Post by Lucky »

Hi All,

There is some more logic to be implemented in this job.

1. Need to check if both Trailer (T) and Header (H) records are present in the file.
2. The Detail (D) records should be processed only if the file consists of both Header and Trailer records.
3. The Trailer will have date, time and ID fields (in addition to the summary fields) which will have to be concatenated as ID_date_time and this value has to be used as a separate field in each Detail (D) row.

Please give me ideas to achieve this solution.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How about you give us your thoughts first?

How many outputs does this job have? What is the layout of each?

Can you do it in a single job? If not, why not?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

Lucky wrote:Hi All,

There is some more logic to be implemented in this job.

1. Need to check if both Trailer (T) and Header (H) records are present in the file.
2. The Detail (D) records should be processed only if the file consists of both Header and Trailer records.
3. The Trailer will have date, time and ID fields (in addition to the summary fields) which will have to be concatenated as ID_date_time and this value has to be used as a separate field in each Detail (D) row.

Please give me ideas to achieve this solution.
Are there any indicators which link a header record to a trailer record for each set? ie Is the ID field you mention an indentifier for each subset of records?
Or is based solely on the input order, ie you are expecting an input of:
  • H,info
    D,info
    D,info
    ...
    T,info
for each subset? and if there was no 'T' before the next H you would ignore that set portion of data?
Lucky
Participant
Posts: 31
Joined: Mon Oct 10, 2005 11:05 pm

Post by Lucky »

Hi Ray/Shane,

I have just finished working on this job.

The logic I have followed:

1. Read all the columns in a row as a string into a transformer ( as suggested in the earlier posts by Shane).
2. Use a stage variable to get the Trailer/ Details rec types (the first field in each string record is the an identifier which helps us in determining if the field is a Header/Detail/Trailer)
3. Now used two sorter transforms after transformer and hash file stage. Constraints used are:
a. RecType = 'D' for 1st Sorter
b. RecType = 'T' for 2nd Sorter
c. RecType = 'T' for Hash Stage
4. Join Hash Stage and Detail Sorter transform data by Source File (Passed as parameter to the job) and insert into Table1 (Constraint has been used to make sure that data will be inserted only if there is an entry in hash file).
5. I have used Sorter to accumulate data in the buffer rather than processing rows one at a time. This way I can avoid processing detail records before processing the Trailer record.

I hope I have made my ideas clear.
Please correct me if I have to follow a much better approach.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

By "Sorter" do you mean a Sort stage?

For efficiency I'd be inclined to pre-process this file to count the rows beginning with "H" and the rows beginning with "T" to make sure that they're the same. This could be done quite quickly in a shell script. Of course, it does not guarantee that there are matching pairs of header and trailer records, but it does let you abandon processing quickly where the counts are not identical.

Since you have not specified your required output, it is impossible to determine whether the remainder of your approach is optimal or not. What do you mean by "buffer"? What happens to the detail records?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ShaneMuir
Premium Member
Premium Member
Posts: 508
Joined: Tue Jun 15, 2004 5:00 am
Location: London

Post by ShaneMuir »

Lucky wrote: I hope I have made my ideas clear.
Please correct me if I have to follow a much better approach.
Not really that clear unfortunately?
I am assuming that you are writing the Trailer record to a hash file and then using some sort of key value to match that trailer record to a detail record. If there is no match, no records are written to your target?
I am also assuming that this gives the desired result?

Something like

Code: Select all

SeqFile ----> TRF ---> SeqFileDetail  ----> TRF ----> Target
               |                             ^
               |                             |
               V                             |
            HshFileTrailer ------------------+
This would seem like the easiest solution?
Where does the header information come in? Do you require the header and trailer information for purposes other than identifying the start and end of a data set?
Lucky
Participant
Posts: 31
Joined: Mon Oct 10, 2005 11:05 pm

Post by Lucky »

Ray,

Yes I meant Sort stage. I agree with your suggestion about the performance issues if we use a sort stage, but I am very new to data stage and still getting used it and still exploring the options available in DS. I have worked on Informatica so I am using that knowledge to create jobs in data stage. Having said that each tool is different in its own way, as you know.

Important thing is, the file will have only one Header and one Trailer records. So now I have to make sure when I process a file, it has to contain a header and trailer with detail records in between. Otherwise I should not process the file and should reject the entire file.

Can you please let me know how we can pre-process the file to check the trailer and headers.

Shane: Your pictorial representation is absolutely correct as far as processing the detail records are concerned except that there should be Sorter Stage before SeqFileDetail stage and Yes this gives me the desired result. If there is no match (match is being done on SRCE_FILE) then I write these records to the reject file. I donot need the header information except that I need to check if the file consists of Header and Trailer records. But I would need Trailer information for reconciling purposes as it consists the Summary information of the Detail records. The trailer record in a file will go Table2 and all the valid detail records to Table1.

Purpose of checking Header and Trailer records: To make sure all the detail records are in a single file.
Post Reply