Merge multiple files into single data set

suneeth · Post by **suneeth** » Tue Jun 22, 2004 8:32 pm

How would one merge multiple files into single data set with columns from each file becoming a new column in the merged data set?

kcbland · Post by **kcbland** » Tue Jun 22, 2004 9:06 pm

Lots of ways:

1. Load each dataset into a separate table and use a join with one table as the driver.
2. Load each dataset into a separate table and use a UNION with a group-by with MIN/MAX operations on all columns mapping each table into a discrete set of columns. This gives you a full outer join effect across all four datasets.
3. Load 3 datasets into hash files and use a reference lookup for each dataset with one dataset as the primary input stream.

suneeth · Post by **suneeth** » Tue Jun 22, 2004 10:14 pm

Hi Kenneth Bland,
Many thanks for the quicky response.
Mainly I am looking at different approaches in achieving that.
1. Load into tables & join them.
2. Merge stage can be used.
Or do we have any other approaches.

cheers,
suneeth--

kcbland wrote:Lots of ways:

1. Load each dataset into a separate table and use a join with one table as the driver.
2. Load each dataset into a separate table and use a UNION with a group-by with MIN/MAX operations on all columns mapping each table into a discrete set of columns. This gives you a full outer join effect across all four datasets.
3. Load 3 datasets into hash files and use a reference lookup for each dataset with one dataset as the primary input stream.

vmcburney · Post by **vmcburney** » Wed Jun 23, 2004 12:45 am

Ken's probably hit the sack by now but fortunately for you it's a 24hr forum. The merge stage is like a database union statement, it does not merge rows, it merges files. Is this what you are trying to achieve? There is also the link collector stage that does a similar merge/union.

If you want to merge rows from different sources into a single row, in effect performing a join, you need to revisit the approaches outlined by Ken.

If you have multiple files then I would favour putting the smallest files into hash files and then processing the largest file as an input. Use hash file lookups to join the lookup files in a transformer and create a combined output row.

roy · Post by **roy** » Wed Jun 23, 2004 1:05 am

Hi,
Vincent, the merge stage really does combine 2 files seperate lines to 1 line in your output file and makes a join as you can in a DB (it even has a complete set equal to a full outer join).
but there are some disadvantages to using the merge stage(in no particular order):
1. not user freindly (especially when you need to make changes)
2. I've recently came across a problem in large files (over 10 GB each)
3. it can only merge 2 files per merge stage.
4. it has no input link available, forcing you to make different jobs for dependant merge files.

IHTH,

vmcburney · Post by **vmcburney** » Wed Jun 23, 2004 4:30 pm

Thanks Roy, I must be getting my stages mixed up, it does indeed merge the input sequential file rows.

chulett · Post by **chulett** » Wed Jun 23, 2004 5:54 pm

500+ posts and getting your sig wrong, too!

elavenil · Post by **elavenil** » Wed Jun 23, 2004 11:11 pm

Hi

Merging two datasets (few columns from 1st DS & few columns from 2nd DS) can be done few ways in PX.

1. Use merge stage
2. Join Stage
3. Use primary stream is the source and secondary source is the lookup.

Hope this would help.

Regards
Saravanan