First occurence out of a group of records.

alvarez-m · Post by **alvarez-m** » Wed Jan 05, 2005 9:51 am

Hi,

How can I get the first occurence out of a group of records?

For instance, I would like to get only the first record where ID1 = 100 and ID2 = 110 (I would get the record with code XYZ, ignore the next two records, get the fourth record, and so on....

ID1 ID2 CODE DESC
----------------------------------
100 110 XYZ TEXT1
100 110 ABC TEXT2
100 110 MNO TEXT3
200 120 EFG TXT1
300 150 OPQ TXT2
......

Thanks,
Max.

kcbland · Post by **kcbland** » Wed Jan 05, 2005 9:54 am

Lots of ways. You have to give more information, like are you processing thousands or millions of rows. Is your data sorted already, or are you sorting it. Is it coming to you in a file or a is it in a table.

You can see that we can go different routes on solutions: sorting, aggregating, SQL, unix commands, etc.

alvarez-m · Post by **alvarez-m** » Wed Jan 05, 2005 10:14 am

The oracle table has less than 5,000 rows. The table is already in order (ID1 and ID2 ascending). My preference would be to utilize data stage capabilities or SQL rather than OS commands.
Thanks.

kcbland · Post by **kcbland** » Wed Jan 05, 2005 10:25 am

First of all, tables do not store data ordered.

5000 rows means you can do it anyway you want. But, you will need the data ordered EXPLICITLY using SQL. There is no debate on this matter.

Per your requirement NOT to use SQL to give you the first row in each ordered group, your pure DS solution NOT using any unix command assistance, leaves you with a couple of design options.

1. Write a job that SQL selects w/ ORDER BY and write each row to a hash file using ID1 and ID2 as the primary key. In the transformer stage that writes to the hash file, have a separate reference to that hash file (must be a different hash file stage icon to do this). If the lookup "sees" the row, then you already have gotten the first row and need to throw away that row, otherwise insert it into the hash file. You will have to disable all read and write caching, as well as pre-create the hash file.

2. Write a job that SQL selects w/ ORDER BY and write each row to a aggregator stage using ID1 and ID2 as the grouping keys and use the FIRST aggregator function on all remaining attribute columns. Indicate that the data is sorted by the ID1 and ID2 columns.

alvarez-m · Post by **alvarez-m** » Wed Jan 05, 2005 10:59 am

Perhaps in database theory tables do not store data ordered. However, as a former dba I can tell you that in real life there are ways to have your records ordered in a table: at loading time, primary keys, indexes.

Thanks for your help.

kcbland · Post by **kcbland** » Wed Jan 05, 2005 11:20 am

Yes, you can place data into a table in an ordered method. But retrieving date without explicit ordering does not guarantee it is returned to you in the same order entered. I mispoke, you CAN accelerate loading performance by having data already ordered and partitioned in order to direct the data more quickly into a table.

If you have a partitioned database such as DB2 or Informix, or have partitioned tables such as Oracle, then the parallel query paradigm allows for data order to be indeterminate because of the vast variables in play. Simply using a hash partitioned table and loading in an ordered sequence does not mean that a "select * from table" will give you the data back in the same order it went in. When you make the query more complex with joins, to can lose the ordering.

ArndW · Post by **ArndW** » Wed Jan 05, 2005 12:19 pm

Hello AlvarezM,

now that the Database ordered stuff is out of the way

I'd like to chime in. With a simple group change of the type you've shown I would use a transformer stage with 2 variables that look something like this:

Code: Select all

Variable "GroupChange", set to "IF In.ID1=LastID THEN 0 ELSE 1
Variable "LastID", set to "In.ID1"

( Please note the order that you declare these *IS* important AND your SQL must order the data so that it comes into the stage as sorted on the ID1 and ID2)

Then just use the Variable "GroupChange" in your output constraint, i.e.
"NOT(GroupChange)" in the constraint means that the current row has the same ID1 as the previous, and can be ignored.

alvarez-m · Post by **alvarez-m** » Wed Jan 05, 2005 1:59 pm

Thank you folks. I have learned something new from your postings.
I have decided to use the group change technique that is simple and effective.

clshore · Post by **clshore** » Wed Jan 05, 2005 3:25 pm

Do as Ken suggested, writing the query output into a hash, but reverse the ORDER by clause so the the LAST record written into the hash is the FIRST one that you want. When the query completes, what remains in the hash will be all of the FIRST records.

Carter

ArndW · Post by **ArndW** » Wed Jan 05, 2005 3:29 pm

clshore - wow; I wouldn't have thought of doing it that way. I like it!! a couple of extra WRITEs to the hash, but it makes the procedure bulletproof. I'm going to file your suggestion away and I can guarantee I'll use it sometime soon. Thanks!

Alvarez-M; I think you should use clshore's approach; it is straightforward, clear & simple. Much more approachable than my suggestion, which will work but might be classified as "tricky" and it will also be difficult for outsiders to understand.

-Arnd.

kcbland · Post by **kcbland** » Wed Jan 05, 2005 3:41 pm

Just in case it matters, Carters solution is simple and elegant, but you have to deal with the fact that the data in the hash file is now effectively randomized. If you need to preserve the original selection order, but only take the first of each group, then you MUST resort the data coming out of the hash file. Since the poster did not want to use unix, the sort command is ruled out after dumping the hash file to a text file. The next choice is to use a VOC entry qualified hash file (either create it in the project or use SETFILE to register the VOC entry) and trick the SELECT tab in the Hash file tab to let you BY ID1, ID2 or switch to a UV/ODBC stage to run SQL.

alvarez-m · Post by **alvarez-m** » Wed Jan 05, 2005 4:05 pm

In my case, I am bulk loading the output to an Oracle table. It wouldn't matter to reverse the "order by clause" because the table has relatively few records. However, notice that usually, there is an index created for ID columns and such index is usually in Ascending order, therefore by reversing the "order by clause" we would not be taking advantage of the index for performance. If that is what you mean by reversing the "order by clause".