extract data from file different num of fields in each row

Lucky · Post by **Lucky** » Mon Oct 02, 2006 11:12 pm

Hi,

I have a situation where I have to extract data from a file with rows in the file contain independent set of fields.

ex:

row1 -> rec_type, id,name,qualification

row2 -> rec_type,country,state,place,location

row3 -> rec_type,num_of_rows,file_date,file_time

Depending the rec_type, I have to load different tables. if rec_type= 'H' (header) then the row should go to Table1, if rec_type = 'D', the row should be inserted into Table2 and if rec_type = 'T' this row has to go to Table3.

H -> Header
D -> Detail
T -> Trailer

Please let me know what options should be used in Data Stage

loveojha2 · Post by **loveojha2** » Mon Oct 02, 2006 11:17 pm

Read the Sequential File with a sequential file stage with 5 columns specified in the metadata.

Drag it into the Transformer and Create three out links pointing to the three different tables.

In constraint of each link specify the condition of the rectype.
and you are done.

Lucky · Post by **Lucky** » Mon Oct 02, 2006 11:32 pm

Hi,

The issue is when you take 5 columns into the transformer and if one of the records consists of less than 5 or more than 5 columns then data stage issues an error. I am looking for an option where data stage can read rows consisting different number of columns from a file.

Kirtikumar · Post by **Kirtikumar** » Tue Oct 03, 2006 12:19 am

Read the file with two cols only - first one as rectype and second as AllCols. Second should have type as VC with length equal to max record lenght from H, D and T.

Then in Xer, depending on the RecType, create 3 links. On each of the link will have a col AllCols. One these links put the Row Splitter stage. This stage will create the required columns from the AllCols input column.

So for header row, row splitter will take AllCols as in and should create rec_type, id,name,qualification as output.
Follow the same for all and you will get what you want - i.e. H, D and T hitting 3 diff tables.

Lucky · Post by **Lucky** » Tue Oct 03, 2006 12:22 am

I think I should have stressed more on the Reading Data from the file than the logic which follows after reading the data.

The issue I am facing now is reading data from the file with rows consisting of different number of fields.

H -> 7 Fields
D -> 4 Fields
T -> 8 Fields

Please let me know if there any suggestions for me to implement the logic.

ShaneMuir · Post by **ShaneMuir** » Tue Oct 03, 2006 12:22 am

You could read the sequential file with a seq file stage with one column. Set this column to have a delimeter of 000 and a quote of 000.

Connect it to a transformer and select the first character of each row into a stage variable.

Have 3 outputs to sequential files, each containing the one row and set the constraints for each output to be the appropriate value of the stage variable.

Assuming your input is comma delimited
eg Stage Variable: RecTyp = input.col[',',1,1]
Constraint for Header output = RecType='H'

That should work (I think).

ShaneMuir · Post by **ShaneMuir** » Tue Oct 03, 2006 12:55 am

ShaneMuir wrote:You could read the sequential file with a seq file stage with one column. Set this column to have a delimeter of 000 and a quote of 000.

Connect it to a transformer and select the first character of each row into a stage variable.

Have 3 outputs to sequential files, each containing the one row and set the constraints for each output to be the appropriate value of the stage variable.

Assuming your input is comma delimited
eg Stage Variable: RecTyp = input.col[',',1,1]
Constraint for Header output = RecType='H'

That should work (I think).

I should mention that the output files should retain the original format of the input files, but will be separated into header, trailer, detail.

You can then used each file as required

Lucky · Post by **Lucky** » Tue Oct 03, 2006 1:45 am

Hi Shane, Kirti,loveojha2,

Thanks a lot for your quick responses.

I have got the job to perform what I wanted.

Shane: I have simply followed your solution and included my logic and it worked straight away.

Thank you all for your time.

Lucky · Post by **Lucky** » Tue Oct 03, 2006 5:07 pm

Hi All,

There is some more logic to be implemented in this job.

1. Need to check if both Trailer (T) and Header (H) records are present in the file.
2. The Detail (D) records should be processed only if the file consists of both Header and Trailer records.
3. The Trailer will have date, time and ID fields (in addition to the summary fields) which will have to be concatenated as ID_date_time and this value has to be used as a separate field in each Detail (D) row.

Please give me ideas to achieve this solution.

ray.wurlod · Post by **ray.wurlod** » Tue Oct 03, 2006 5:13 pm

How about you give us your thoughts first?

How many outputs does this job have? What is the layout of each?

Can you do it in a single job? If not, why not?

ShaneMuir · Post by **ShaneMuir** » Tue Oct 03, 2006 5:21 pm

Lucky wrote:Hi All,

There is some more logic to be implemented in this job.

1. Need to check if both Trailer (T) and Header (H) records are present in the file.
2. The Detail (D) records should be processed only if the file consists of both Header and Trailer records.
3. The Trailer will have date, time and ID fields (in addition to the summary fields) which will have to be concatenated as ID_date_time and this value has to be used as a separate field in each Detail (D) row.

Please give me ideas to achieve this solution.

Are there any indicators which link a header record to a trailer record for each set? ie Is the ID field you mention an indentifier for each subset of records?
Or is based solely on the input order, ie you are expecting an input of:

H,info
D,info
D,info
...
T,info

for each subset? and if there was no 'T' before the next H you would ignore that set portion of data?

Lucky · Post by **Lucky** » Tue Oct 03, 2006 6:17 pm

Hi Ray/Shane,

I have just finished working on this job.

The logic I have followed:

1. Read all the columns in a row as a string into a transformer ( as suggested in the earlier posts by Shane).
2. Use a stage variable to get the Trailer/ Details rec types (the first field in each string record is the an identifier which helps us in determining if the field is a Header/Detail/Trailer)
3. Now used two sorter transforms after transformer and hash file stage. Constraints used are:
a. RecType = 'D' for 1st Sorter
b. RecType = 'T' for 2nd Sorter
c. RecType = 'T' for Hash Stage
4. Join Hash Stage and Detail Sorter transform data by Source File (Passed as parameter to the job) and insert into Table1 (Constraint has been used to make sure that data will be inserted only if there is an entry in hash file).
5. I have used Sorter to accumulate data in the buffer rather than processing rows one at a time. This way I can avoid processing detail records before processing the Trailer record.

I hope I have made my ideas clear.
Please correct me if I have to follow a much better approach.

ray.wurlod · Post by **ray.wurlod** » Tue Oct 03, 2006 6:40 pm

By "Sorter" do you mean a Sort stage?

For efficiency I'd be inclined to pre-process this file to count the rows beginning with "H" and the rows beginning with "T" to make sure that they're the same. This could be done quite quickly in a shell script. Of course, it does not guarantee that there are matching pairs of header and trailer records, but it does let you abandon processing quickly where the counts are not identical.

Since you have not specified your required output, it is impossible to determine whether the remainder of your approach is optimal or not. What do you mean by "buffer"? What happens to the detail records?

ShaneMuir · Post by **ShaneMuir** » Tue Oct 03, 2006 6:46 pm

Lucky wrote: I hope I have made my ideas clear.
Please correct me if I have to follow a much better approach.

Not really that clear unfortunately?
I am assuming that you are writing the Trailer record to a hash file and then using some sort of key value to match that trailer record to a detail record. If there is no match, no records are written to your target?
I am also assuming that this gives the desired result?

Something like

Code: Select all

SeqFile ----> TRF ---> SeqFileDetail  ----> TRF ----> Target
               |                             ^
               |                             |
               V                             |
            HshFileTrailer ------------------+

This would seem like the easiest solution?
Where does the header information come in? Do you require the header and trailer information for purposes other than identifying the start and end of a data set?

Lucky · Post by **Lucky** » Tue Oct 03, 2006 7:12 pm

Ray,

Yes I meant Sort stage. I agree with your suggestion about the performance issues if we use a sort stage, but I am very new to data stage and still getting used it and still exploring the options available in DS. I have worked on Informatica so I am using that knowledge to create jobs in data stage. Having said that each tool is different in its own way, as you know.

Important thing is, the file will have only one Header and one Trailer records. So now I have to make sure when I process a file, it has to contain a header and trailer with detail records in between. Otherwise I should not process the file and should reject the entire file.

Can you please let me know how we can pre-process the file to check the trailer and headers.

Shane: Your pictorial representation is absolutely correct as far as processing the detail records are concerned except that there should be Sorter Stage before SeqFileDetail stage and Yes this gives me the desired result. If there is no match (match is being done on SRCE_FILE) then I write these records to the reject file. I donot need the header information except that I need to check if the file consists of Header and Trailer records. But I would need Trailer information for reconciling purposes as it consists the Summary information of the Detail records. The trailer record in a file will go Table2 and all the valid detail records to Table1.

Purpose of checking Header and Trailer records: To make sure all the detail records are in a single file.