Merge multiple files into single data set
Moderators: chulett, rschirm, roy
Merge multiple files into single data set
How would one merge multiple files into single data set with columns from each file becoming a new column in the merged data set?
Lots of ways:
1. Load each dataset into a separate table and use a join with one table as the driver.
2. Load each dataset into a separate table and use a UNION with a group-by with MIN/MAX operations on all columns mapping each table into a discrete set of columns. This gives you a full outer join effect across all four datasets.
3. Load 3 datasets into hash files and use a reference lookup for each dataset with one dataset as the primary input stream.
1. Load each dataset into a separate table and use a join with one table as the driver.
2. Load each dataset into a separate table and use a UNION with a group-by with MIN/MAX operations on all columns mapping each table into a discrete set of columns. This gives you a full outer join effect across all four datasets.
3. Load 3 datasets into hash files and use a reference lookup for each dataset with one dataset as the primary input stream.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Hi Kenneth Bland,
Many thanks for the quicky response.
Mainly I am looking at different approaches in achieving that.
1. Load into tables & join them.
2. Merge stage can be used.
Or do we have any other approaches.
cheers,
suneeth--
Many thanks for the quicky response.
Mainly I am looking at different approaches in achieving that.
1. Load into tables & join them.
2. Merge stage can be used.
Or do we have any other approaches.
cheers,
suneeth--
kcbland wrote:Lots of ways:
1. Load each dataset into a separate table and use a join with one table as the driver.
2. Load each dataset into a separate table and use a UNION with a group-by with MIN/MAX operations on all columns mapping each table into a discrete set of columns. This gives you a full outer join effect across all four datasets.
3. Load 3 datasets into hash files and use a reference lookup for each dataset with one dataset as the primary input stream.
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Ken's probably hit the sack by now but fortunately for you it's a 24hr forum. The merge stage is like a database union statement, it does not merge rows, it merges files. Is this what you are trying to achieve? There is also the link collector stage that does a similar merge/union.
If you want to merge rows from different sources into a single row, in effect performing a join, you need to revisit the approaches outlined by Ken.
If you have multiple files then I would favour putting the smallest files into hash files and then processing the largest file as an input. Use hash file lookups to join the lookup files in a transformer and create a combined output row.
If you want to merge rows from different sources into a single row, in effect performing a join, you need to revisit the approaches outlined by Ken.
If you have multiple files then I would favour putting the smallest files into hash files and then processing the largest file as an input. Use hash file lookups to join the lookup files in a transformer and create a combined output row.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Hi,
Vincent, the merge stage really does combine 2 files seperate lines to 1 line in your output file and makes a join as you can in a DB (it even has a complete set equal to a full outer join).
but there are some disadvantages to using the merge stage(in no particular order):
1. not user freindly (especially when you need to make changes)
2. I've recently came across a problem in large files (over 10 GB each)
3. it can only merge 2 files per merge stage.
4. it has no input link available, forcing you to make different jobs for dependant merge files.
IHTH,
Vincent, the merge stage really does combine 2 files seperate lines to 1 line in your output file and makes a join as you can in a DB (it even has a complete set equal to a full outer join).
but there are some disadvantages to using the merge stage(in no particular order):
1. not user freindly (especially when you need to make changes)
2. I've recently came across a problem in large files (over 10 GB each)
3. it can only merge 2 files per merge stage.
4. it has no input link available, forcing you to make different jobs for dependant merge files.
IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.
Search before posting:)
Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
![Image](http://www.worldcommunitygrid.org/images/logo.gif)
Time is money but when you don't have money time is all you can afford.
Search before posting:)
Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
![Image](http://www.worldcommunitygrid.org/images/logo.gif)
-
- Participant
- Posts: 3593
- Joined: Thu Jan 23, 2003 5:25 pm
- Location: Australia, Melbourne
- Contact:
Thanks Roy, I must be getting my stages mixed up, it does indeed merge the input sequential file rows.
Certus Solutions
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn
Blog: Tooling Around in the InfoSphere
Twitter: @vmcburney
LinkedIn:Vincent McBurney LinkedIn