Unique keys stored in hashed file less than select

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

Unique keys stored in hashed file less than select

Post by clarcombe »

I have created a hashed file from a union of 3 select queries i.e. SELECT A UNION SELECT B UNION SELECT C

The union of these three selects returns 3380694 rows.

However, the hashed file only contains 3380652 rows.

So I checked to see if the keys really were unique for the hashed file

So I did a

Code: Select all

Select <KEYS> from (SELECT KEYS FROM A UNION SELECT KEYS FROM B UNION SELECT KEYS FROM C) GROUP BY KEYS HAVING COUNT(*) > 1
but this returned no rows.

None of the key fields are nullable and they consist of Varchars and Decimals.

From the Director Monitor I can see 3380694 out of the Xfm to HF and 3380652 from HF to XF.

Question
Does anyone have any idea how to find out where the problem lies with this ?

The hashed file is only being used to separate two xfm. Should I use a flat file instead ?

Thanks
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

You could try splitting off to a flat file before the hashed file and then another afterwards, then you could compare the two and see who is 'missing'. That would probably go a long way towards figuring out the problem.

If I may, why do you 'need' to 'separate' two transformers? :?
-craig

"You can never have too many knives" -- Logan Nine Fingers
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

Post by clarcombe »

You could try splitting off to a flat file before the hashed file and then another afterwards, then you could compare the two and see who is 'missing'. That would probably go a long way towards figuring out the problem.
True, however the job takes over 5 hours to run. I want to try and avoid rerunning at this point and see if another non-run idea comes up

Code: Select all

If I may, why do you 'need' to 'separate' two transformers?
Good question.
When I run the job without an intermediate stage between the two xfms it hangs.

I have only added hashed files as necessary (well to make the thing run anyway)
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

If don't recall ever needing to add a process break-point between two transformers to solve a 'job hang' before... did you try just adding an IPC stage instead? :?

As a tangential discussion, I'd be curious what your full job design is to see if we can get a better idea why doing that would solve your problem. But that can wait until after your hashed file question gets answered.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Abu@0403
Participant
Posts: 32
Joined: Wed Aug 08, 2007 11:21 pm

Post by Abu@0403 »

One of the option is to use UNION ALL to check in the query. If the count matches with that of UNION then there is a problem. If not then there are duplicates in keys.

For performance no two transformers should be continuous. It will decrease the performance and sometimes the job even aborts if the data involved is too high. I too would prefer to use IPC stage in between the transformers.
----------------
Abu
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

Post by clarcombe »

I have found a solution.

I knew I could be sure that Datastage was correctly deleting the duplicates in the hashed file. This meant that my SQL query was not working correctly.

I broke the query down into smaller queries in the following manner.

Select * FROM A WHERE EXISTS (SELECT 1 FROM B WHERE A.COLUMNS = B.COLUMNS)

and did this for all of the tables i.e. A in B, A in C, C in B etc and found my missing rows.

What I don't understand though is why the query in the first post failed to identify the duplicate rows. Ah well, the mysteries of SQL.
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The query in the first post needed two UNION ALL operators in the inner query. UNION alone removes duplicates.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
clarcombe
Premium Member
Premium Member
Posts: 515
Joined: Wed Jun 08, 2005 9:54 am
Location: Europe

Post by clarcombe »

Thanks Ray. Its always good to receive enlightenment :)
Colin Larcombe
-------------------

Certified IBM Infosphere Datastage Developer
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You're welcome.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply