Look for available number - routine question

kudaka · Post by **kudaka** » Thu Oct 07, 2004 6:30 pm

This may be crazy. But, I am trying to solve the following problem.
This is what I need to do:
1. I will call a routine from DataStage passing a number (int1)
2. The routine has to open a seq/hash file consisting of numbers.
3. check if "int1" is in this file.
4. If not found, return "int1"
5. If found, add 1 to "int1" search the file again
6. Repeat step 5 until not found then return "int1"

The seq/hash file is very big (14mil) consisting of one column of numbers.
Don't ask me why do I want to do this. It is customer's requirement.

What is the approach to do this?

Thanks.
Kud

kduke · Post by **kduke** » Thu Oct 07, 2004 6:54 pm

Put it in a hash file. It would run lots faster than looping through a sequential file over and over.

vmcburney · Post by **vmcburney** » Thu Oct 07, 2004 7:07 pm

A hash file could be difficult to create as it sounds like you have one non-unique column of numbers. Writing to a hash file with that column as a key removes duplicates, writing it without a key takes away the lookup speed of hashing.

How are you using this routine? Are you calling it from DataStage jobs?

If you wanted to turn this into a derivation routine to be called from a transformer I would pre-build a hash file with colunn 1 - the lookup number (a key field), column 2 - count (the number of instances of that number). The hash file would be created by sending the file through an aggregation stage into a hash file. You could then do a lookup for the number and get back either a NULL (number does not exist) or the number of times it exists.

If it is a standalone routine, for example run from a sequence job, I would consider a Unix awk script that did the search and count or a DataStage server job that searched for the number and kept a count in a transformer and outputted those rows that matched that number. I've found processing a large text file via a server job is faster then the routine file processing commands as it has better row buffering and memory management.

kudaka · Post by **kudaka** » Thu Oct 07, 2004 7:30 pm

The file consists of one column of unique numbers.

The routine has to return a number that is not in the file.

vmcburney · Post by **vmcburney** » Thu Oct 07, 2004 8:32 pm

Then a hash file would be perfect and step 5 does not make any sense.

5. If found, add 1 to "int1" search the file again

kcbland · Post by **kcbland** » Thu Oct 07, 2004 8:37 pm

It sounds like what you are doing is "filling in the gaps" in a surrogate key assignment sequence. Methinks you have a table who's primary key has reached it's maximum assignment value, so you have to find the unused numbers.

If I'm right, you have quite a good little problem here.

I'd turn the problem around. How about you build a sorted list of AVAILABLE numbers. Therefore, all you do is keep fetching the next available number off a stack. This is very easy to do.

Now, if your table has the keys assigned, you need to pick out the unused numbers. There's a lot of ways to do this with SQL. For your next available assignment function, all you need to do is write a DS BASIC function to read the text file of available numbers into a COMMON memory variable array and a indexing variable on initialization. Then, simply keep returning the next available number incrementing the indexing variable with each call.

ray.wurlod · Post by **ray.wurlod** » Thu Oct 07, 2004 9:00 pm

I'm with Ken. Alternatively, re-educate your customer.

Surrogate keys (assuming that's what you're doing) are SOLELY about uniqueness; it doesn't matter if there are gaps.

At the very least have them justify their requirement. Be prepared to tell them that the requirement is nonsensical. Or stupid, if you're passionate about it. Being prepared to do this has won as many deals as it has lost.

kudaka · Post by **kudaka** » Fri Oct 08, 2004 7:18 am

kcbland, You are correct. I am trying to find the unused numbers.
The system is AS400, we are using Oracle plug-in to bridge to it and read AS400 files.
What is the method to prepare a list of unused numbers?
As ray.wurlod said this does not make any sense to handle this problem in datastage, but customer does not want to do any enhancements to AS400.(Also they do not have people to support it). Soon this system will go away.
Please help me building a list of unused numbers.
Thanks.

kcbland · Post by **kcbland** » Fri Oct 08, 2004 8:56 am

You can do this many ways. One method may be to select all of the primary keys in use in the table and write it to a hash file. Then, use a pre-generated file containing one row of all numbers from 1 to the max in your table. Use a DataStage job to read that file of numbers, referencing the hash file, and writing output to another file of all numbers not in the hash file. This output file would be your file of unused number from which your function should assign.

To optimize this process, you will have to periodicaly recreate your sequential series file from some number other than 1 so that you don't constantly rescan from 1 when you've already filled in gaps up to 1000000. You're going to have to think about this.

Another method would be to create a table of that sequential series file of numbers from 1 to the maximum allowed on the table. Now do an outer join to this table of your target, and constrain the query to return all rows not in the target table and spool this to your surrogate key unused number file.

I'm sure others could contribute some thoughts as well.

kudaka · Post by **kudaka** » Fri Oct 08, 2004 10:58 am

Creating pre-generated file containing numbers in a column from 1 to the max number has become a challenge for me.
Any ideas? Can we make a simple sql stmt that return 1 thru max number?
Or any shell script?
Please share any ideas.
Thanks.

kcbland · Post by **kcbland** » Fri Oct 08, 2004 11:03 am

Sure, just find a big table somewhere and select the rownumber from it. In Oracle, I'd do:

Code: Select all

select rownum from bigtable where rownum <= 10000000

This will return a list of numbers from 1 to 10000000.

You could actually get the list of unused numbers in a SQL statement like this:

Code: Select all

select rownum from bigtable where rownum not in (select key from target)

Any big table will do, as long as it has more rows in it than what's allowed in your target table.

kudaka · Post by **kudaka** » Fri Oct 08, 2004 11:06 am

I came up with this script. Is it good?
number=1
while [ $number -le 100 ]
do
echo $number >> a.txt
number=`expr $number + 1`
done

Thanks

kcbland · Post by **kcbland** » Fri Oct 08, 2004 11:08 am

Sure but it's slow.

kudaka · Post by **kudaka** » Fri Oct 08, 2004 12:25 pm

Yes it is vey slow. On the other hand we do not have any big table.

If I am success in building unused number file, how to fetch each number to assign it to my record? How to build stack mechanism in to data stage?
Looks like I am lost.

kcbland · Post by **kcbland** » Fri Oct 08, 2004 12:34 pm

Okay, you don't have a big table. How about a medium table? This creates a virtual big table:

Code: Select all

select rownum from
   (select rownum from mediumtable
    union all
    select rownum from mediumtable
    union all
    select rownum from mediumtable
    union all
    select rownum from mediumtable
   )
where rownum not in (select key from target)