First occurence out of a group of records.
Moderators: chulett, rschirm, roy
First occurence out of a group of records.
Hi,
How can I get the first occurence out of a group of records?
For instance, I would like to get only the first record where ID1 = 100 and ID2 = 110 (I would get the record with code XYZ, ignore the next two records, get the fourth record, and so on....
ID1 ID2 CODE DESC
----------------------------------
100 110 XYZ TEXT1
100 110 ABC TEXT2
100 110 MNO TEXT3
200 120 EFG TXT1
300 150 OPQ TXT2
......
Thanks,
Max.
How can I get the first occurence out of a group of records?
For instance, I would like to get only the first record where ID1 = 100 and ID2 = 110 (I would get the record with code XYZ, ignore the next two records, get the fourth record, and so on....
ID1 ID2 CODE DESC
----------------------------------
100 110 XYZ TEXT1
100 110 ABC TEXT2
100 110 MNO TEXT3
200 120 EFG TXT1
300 150 OPQ TXT2
......
Thanks,
Max.
Lots of ways. You have to give more information, like are you processing thousands or millions of rows. Is your data sorted already, or are you sorting it. Is it coming to you in a file or a is it in a table.
You can see that we can go different routes on solutions: sorting, aggregating, SQL, unix commands, etc.
You can see that we can go different routes on solutions: sorting, aggregating, SQL, unix commands, etc.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
First of all, tables do not store data ordered.
5000 rows means you can do it anyway you want. But, you will need the data ordered EXPLICITLY using SQL. There is no debate on this matter.
Per your requirement NOT to use SQL to give you the first row in each ordered group, your pure DS solution NOT using any unix command assistance, leaves you with a couple of design options.
1. Write a job that SQL selects w/ ORDER BY and write each row to a hash file using ID1 and ID2 as the primary key. In the transformer stage that writes to the hash file, have a separate reference to that hash file (must be a different hash file stage icon to do this). If the lookup "sees" the row, then you already have gotten the first row and need to throw away that row, otherwise insert it into the hash file. You will have to disable all read and write caching, as well as pre-create the hash file.
2. Write a job that SQL selects w/ ORDER BY and write each row to a aggregator stage using ID1 and ID2 as the grouping keys and use the FIRST aggregator function on all remaining attribute columns. Indicate that the data is sorted by the ID1 and ID2 columns.
5000 rows means you can do it anyway you want. But, you will need the data ordered EXPLICITLY using SQL. There is no debate on this matter.
Per your requirement NOT to use SQL to give you the first row in each ordered group, your pure DS solution NOT using any unix command assistance, leaves you with a couple of design options.
1. Write a job that SQL selects w/ ORDER BY and write each row to a hash file using ID1 and ID2 as the primary key. In the transformer stage that writes to the hash file, have a separate reference to that hash file (must be a different hash file stage icon to do this). If the lookup "sees" the row, then you already have gotten the first row and need to throw away that row, otherwise insert it into the hash file. You will have to disable all read and write caching, as well as pre-create the hash file.
2. Write a job that SQL selects w/ ORDER BY and write each row to a aggregator stage using ID1 and ID2 as the grouping keys and use the FIRST aggregator function on all remaining attribute columns. Indicate that the data is sorted by the ID1 and ID2 columns.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Yes, you can place data into a table in an ordered method. But retrieving date without explicit ordering does not guarantee it is returned to you in the same order entered. I mispoke, you CAN accelerate loading performance by having data already ordered and partitioned in order to direct the data more quickly into a table.
If you have a partitioned database such as DB2 or Informix, or have partitioned tables such as Oracle, then the parallel query paradigm allows for data order to be indeterminate because of the vast variables in play. Simply using a hash partitioned table and loading in an ordered sequence does not mean that a "select * from table" will give you the data back in the same order it went in. When you make the query more complex with joins, to can lose the ordering.
If you have a partitioned database such as DB2 or Informix, or have partitioned tables such as Oracle, then the parallel query paradigm allows for data order to be indeterminate because of the vast variables in play. Simply using a hash partitioned table and loading in an ordered sequence does not mean that a "select * from table" will give you the data back in the same order it went in. When you make the query more complex with joins, to can lose the ordering.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Hello AlvarezM,
now that the Database ordered stuff is out of the way
I'd like to chime in. With a simple group change of the type you've shown I would use a transformer stage with 2 variables that look something like this:
( Please note the order that you declare these *IS* important AND your SQL must order the data so that it comes into the stage as sorted on the ID1 and ID2)
Then just use the Variable "GroupChange" in your output constraint, i.e.
"NOT(GroupChange)" in the constraint means that the current row has the same ID1 as the previous, and can be ignored.
now that the Database ordered stuff is out of the way
![Cool 8)](./images/smilies/icon_cool.gif)
Code: Select all
Variable "GroupChange", set to "IF In.ID1=LastID THEN 0 ELSE 1
Variable "LastID", set to "In.ID1"
Then just use the Variable "GroupChange" in your output constraint, i.e.
"NOT(GroupChange)" in the constraint means that the current row has the same ID1 as the previous, and can be ignored.
clshore - wow; I wouldn't have thought of doing it that way. I like it!! a couple of extra WRITEs to the hash, but it makes the procedure bulletproof. I'm going to file your suggestion away and I can guarantee I'll use it sometime soon. Thanks!
Alvarez-M; I think you should use clshore's approach; it is straightforward, clear & simple. Much more approachable than my suggestion, which will work but might be classified as "tricky" and it will also be difficult for outsiders to understand.
-Arnd.
Alvarez-M; I think you should use clshore's approach; it is straightforward, clear & simple. Much more approachable than my suggestion, which will work but might be classified as "tricky" and it will also be difficult for outsiders to understand.
-Arnd.
Just in case it matters, Carters solution is simple and elegant, but you have to deal with the fact that the data in the hash file is now effectively randomized. If you need to preserve the original selection order, but only take the first of each group, then you MUST resort the data coming out of the hash file. Since the poster did not want to use unix, the sort command is ruled out after dumping the hash file to a text file. The next choice is to use a VOC entry qualified hash file (either create it in the project or use SETFILE to register the VOC entry) and trick the SELECT tab in the Hash file tab to let you BY ID1, ID2 or switch to a UV/ODBC stage to run SQL.
Kenneth Bland
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
In my case, I am bulk loading the output to an Oracle table. It wouldn't matter to reverse the "order by clause" because the table has relatively few records. However, notice that usually, there is an index created for ID columns and such index is usually in Ascending order, therefore by reversing the "order by clause" we would not be taking advantage of the index for performance. If that is what you mean by reversing the "order by clause".