Unique keys stored in hashed file less than select

clarcombe · Post by **clarcombe** » Fri Aug 17, 2007 8:35 am

I have created a hashed file from a union of 3 select queries i.e. SELECT A UNION SELECT B UNION SELECT C

The union of these three selects returns 3380694 rows.

However, the hashed file only contains 3380652 rows.

So I checked to see if the keys really were unique for the hashed file

So I did a

Code: Select all

Select <KEYS> from (SELECT KEYS FROM A UNION SELECT KEYS FROM B UNION SELECT KEYS FROM C) GROUP BY KEYS HAVING COUNT(*) > 1

but this returned no rows.

None of the key fields are nullable and they consist of Varchars and Decimals.

From the Director Monitor I can see 3380694 out of the Xfm to HF and 3380652 from HF to XF.

Question
Does anyone have any idea how to find out where the problem lies with this ?

The hashed file is only being used to separate two xfm. Should I use a flat file instead ?

Thanks

chulett · Post by **chulett** » Fri Aug 17, 2007 8:40 am

You could try splitting off to a flat file before the hashed file and then another afterwards, then you could compare the two and see who is 'missing'. That would probably go a long way towards figuring out the problem.

If I may, why do you 'need' to 'separate' two transformers?

clarcombe · Post by **clarcombe** » Fri Aug 17, 2007 8:45 am

You could try splitting off to a flat file before the hashed file and then another afterwards, then you could compare the two and see who is 'missing'. That would probably go a long way towards figuring out the problem.

True, however the job takes over 5 hours to run. I want to try and avoid rerunning at this point and see if another non-run idea comes up

Code: Select all

If I may, why do you 'need' to 'separate' two transformers?

Good question.
When I run the job without an intermediate stage between the two xfms it hangs.

I have only added hashed files as necessary (well to make the thing run anyway)

chulett · Post by **chulett** » Fri Aug 17, 2007 8:51 am

If don't recall ever needing to add a process break-point between two transformers to solve a 'job hang' before... did you try just adding an IPC stage instead?

As a tangential discussion, I'd be curious what your full job design is to see if we can get a better idea why doing that would solve your problem. But that can wait until after your hashed file question gets answered.

Abu@0403 · Post by **Abu@0403** » Sun Aug 19, 2007 9:38 pm

One of the option is to use UNION ALL to check in the query. If the count matches with that of UNION then there is a problem. If not then there are duplicates in keys.

For performance no two transformers should be continuous. It will decrease the performance and sometimes the job even aborts if the data involved is too high. I too would prefer to use IPC stage in between the transformers.

clarcombe · Post by **clarcombe** » Tue Aug 21, 2007 1:20 am

I have found a solution.

I knew I could be sure that Datastage was correctly deleting the duplicates in the hashed file. This meant that my SQL query was not working correctly.

I broke the query down into smaller queries in the following manner.

Select * FROM A WHERE EXISTS (SELECT 1 FROM B WHERE A.COLUMNS = B.COLUMNS)

and did this for all of the tables i.e. A in B, A in C, C in B etc and found my missing rows.

What I don't understand though is why the query in the first post failed to identify the duplicate rows. Ah well, the mysteries of SQL.

ray.wurlod · Post by **ray.wurlod** » Tue Aug 21, 2007 1:59 am

The query in the first post needed two UNION ALL operators in the inner query. UNION alone removes duplicates.

clarcombe · Post by **clarcombe** » Tue Aug 21, 2007 2:10 am

Thanks Ray. Its always good to receive enlightenment

ray.wurlod · Post by **ray.wurlod** » Tue Aug 21, 2007 2:33 am

You're welcome.