First occurence out of a group of records.

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
alvarez-m
Participant
Posts: 13
Joined: Tue Nov 16, 2004 3:12 pm

First occurence out of a group of records.

Post by alvarez-m »

Hi,

How can I get the first occurence out of a group of records?

For instance, I would like to get only the first record where ID1 = 100 and ID2 = 110 (I would get the record with code XYZ, ignore the next two records, get the fourth record, and so on....

ID1 ID2 CODE DESC
----------------------------------
100 110 XYZ TEXT1
100 110 ABC TEXT2
100 110 MNO TEXT3
200 120 EFG TXT1
300 150 OPQ TXT2
......


Thanks,
Max.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Lots of ways. You have to give more information, like are you processing thousands or millions of rows. Is your data sorted already, or are you sorting it. Is it coming to you in a file or a is it in a table.

You can see that we can go different routes on solutions: sorting, aggregating, SQL, unix commands, etc.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
alvarez-m
Participant
Posts: 13
Joined: Tue Nov 16, 2004 3:12 pm

Post by alvarez-m »

The oracle table has less than 5,000 rows. The table is already in order (ID1 and ID2 ascending). My preference would be to utilize data stage capabilities or SQL rather than OS commands.
Thanks.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

First of all, tables do not store data ordered.

5000 rows means you can do it anyway you want. But, you will need the data ordered EXPLICITLY using SQL. There is no debate on this matter.

Per your requirement NOT to use SQL to give you the first row in each ordered group, your pure DS solution NOT using any unix command assistance, leaves you with a couple of design options.

1. Write a job that SQL selects w/ ORDER BY and write each row to a hash file using ID1 and ID2 as the primary key. In the transformer stage that writes to the hash file, have a separate reference to that hash file (must be a different hash file stage icon to do this). If the lookup "sees" the row, then you already have gotten the first row and need to throw away that row, otherwise insert it into the hash file. You will have to disable all read and write caching, as well as pre-create the hash file.

2. Write a job that SQL selects w/ ORDER BY and write each row to a aggregator stage using ID1 and ID2 as the grouping keys and use the FIRST aggregator function on all remaining attribute columns. Indicate that the data is sorted by the ID1 and ID2 columns.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
alvarez-m
Participant
Posts: 13
Joined: Tue Nov 16, 2004 3:12 pm

Post by alvarez-m »

Perhaps in database theory tables do not store data ordered. However, as a former dba I can tell you that in real life there are ways to have your records ordered in a table: at loading time, primary keys, indexes.

Thanks for your help.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Yes, you can place data into a table in an ordered method. But retrieving date without explicit ordering does not guarantee it is returned to you in the same order entered. I mispoke, you CAN accelerate loading performance by having data already ordered and partitioned in order to direct the data more quickly into a table.

If you have a partitioned database such as DB2 or Informix, or have partitioned tables such as Oracle, then the parallel query paradigm allows for data order to be indeterminate because of the vast variables in play. Simply using a hash partitioned table and loading in an ordered sequence does not mean that a "select * from table" will give you the data back in the same order it went in. When you make the query more complex with joins, to can lose the ordering.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Hello AlvarezM,

now that the Database ordered stuff is out of the way 8) I'd like to chime in. With a simple group change of the type you've shown I would use a transformer stage with 2 variables that look something like this:

Code: Select all

Variable "GroupChange", set to "IF In.ID1=LastID THEN 0 ELSE 1
Variable "LastID", set to "In.ID1"
( Please note the order that you declare these *IS* important AND your SQL must order the data so that it comes into the stage as sorted on the ID1 and ID2)

Then just use the Variable "GroupChange" in your output constraint, i.e.
"NOT(GroupChange)" in the constraint means that the current row has the same ID1 as the previous, and can be ignored.
alvarez-m
Participant
Posts: 13
Joined: Tue Nov 16, 2004 3:12 pm

Post by alvarez-m »

Thank you folks. I have learned something new from your postings.
I have decided to use the group change technique that is simple and effective.
clshore
Charter Member
Charter Member
Posts: 115
Joined: Tue Oct 21, 2003 11:45 am

Post by clshore »

Do as Ken suggested, writing the query output into a hash, but reverse the ORDER by clause so the the LAST record written into the hash is the FIRST one that you want. When the query completes, what remains in the hash will be all of the FIRST records.

Carter
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

clshore - wow; I wouldn't have thought of doing it that way. I like it!! a couple of extra WRITEs to the hash, but it makes the procedure bulletproof. I'm going to file your suggestion away and I can guarantee I'll use it sometime soon. Thanks!

Alvarez-M; I think you should use clshore's approach; it is straightforward, clear & simple. Much more approachable than my suggestion, which will work but might be classified as "tricky" and it will also be difficult for outsiders to understand.

-Arnd.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Just in case it matters, Carters solution is simple and elegant, but you have to deal with the fact that the data in the hash file is now effectively randomized. If you need to preserve the original selection order, but only take the first of each group, then you MUST resort the data coming out of the hash file. Since the poster did not want to use unix, the sort command is ruled out after dumping the hash file to a text file. The next choice is to use a VOC entry qualified hash file (either create it in the project or use SETFILE to register the VOC entry) and trick the SELECT tab in the Hash file tab to let you BY ID1, ID2 or switch to a UV/ODBC stage to run SQL.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
alvarez-m
Participant
Posts: 13
Joined: Tue Nov 16, 2004 3:12 pm

Post by alvarez-m »

In my case, I am bulk loading the output to an Oracle table. It wouldn't matter to reverse the "order by clause" because the table has relatively few records. However, notice that usually, there is an index created for ID columns and such index is usually in Ascending order, therefore by reversing the "order by clause" we would not be taking advantage of the index for performance. If that is what you mean by reversing the "order by clause".
Post Reply