where not exists implementation

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
kavuri
Premium Member
Premium Member
Posts: 161
Joined: Mon Apr 16, 2007 2:56 pm

where not exists implementation

Post by kavuri »

Hi,
I am working to implement a SQL scenario in datastage.

select a.fld1,fld2,fld3 from TableA a
where not exists
(select b.fld1,fld2,fld3 from TableB b
where b.fld1=a.fld1
and b.fld2=a.fld2
)
and a.fld2 in (select fld2 from TableB)

My tables are some what huge (A table is about 50million records and b table is about 500,000 records).

I want to implement this with good performance. I know I can use lookup and get the rejects. Can anyone explain me a good design?

Thanks
Kavuri
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Re: where not exists implementation

Post by kwwilliams »

A good design would look at the totality of the work you need to accomplish and design for the whole set of work instead of just a part of it. Is there anything else that this design is going to accomplish?
kavuri
Premium Member
Premium Member
Posts: 161
Joined: Mon Apr 16, 2007 2:56 pm

Re: where not exists implementation

Post by kavuri »

I wanted to see this piece alone...
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Re: where not exists implementation

Post by kwwilliams »

You have SQL that works. Lookup with reject link. Left outer join filter for null fields only. One will fit with you overall design hopefully. Good luck with designing one piece at a time.
creatingfusion
Participant
Posts: 46
Joined: Tue Jul 20, 2010 1:26 pm
Location: USA
Contact:

Re: where not exists implementation

Post by creatingfusion »

select a.fld1,fld2,fld3 from TableA a
where not exists
(select b.fld1,fld2,fld3 from TableB b
where b.fld1=a.fld1
and b.fld2=a.fld2
)

For large number of records as like in Datawarehouses it is advised to create staging table containing only the key fields and then perform yours SQL query on that and get the output result. further perform a look up to fetch all data from the table.

In yours case create StgTableA with fields fld1 and fld2 for the larger table. Define indexes on the tables appropriately and then perform lookup with TableB to fetch only the records not existing in tableB. Dump the records into dataset or another staging table. This being yours key fields of yours output and can be used further to get all other fields of the table using look-up.
The staging tables would be full refresh tables and gets truncated as soon as its use is no more.
Although this being complex process but for large data volume it takes less buffer space in query execution and even fetches the output faster.
Abhijit
IBM Certified Solution Developer Infosphere DataStage
udayk_2007
Participant
Posts: 72
Joined: Wed Dec 12, 2007 2:29 am

Post by udayk_2007 »

As the table you are using is huge in size,one way could be to create a look fileset first from the table, with including only the required columns and then using this lookup fileset in the lookup operation.

Regards
Ulhas
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Did you try running your SQL in your database? For me, it does not appear performance efficient.

Main reason being your are selecting 3 columns for not-exists whereas you are looking for a 2 key combination.

If I were in your place, I will start by looking into possible matches by field2. If this is less than 20% (something small in matches and distinct values), I will include this as a lookup first and then proceed to match by 2 columns for rejects.

Also I will look into vertical pivot of possible field2 values into a single MV attribute with field1 as key and do both check in a single lookup. Whichever way, lookup for 500k with just 2 fields must not be a huge load.

Btw, why not include this query in your source and extract relevant records ?
kwwilliams
Participant
Posts: 437
Joined: Fri Oct 21, 2005 10:00 pm

Post by kwwilliams »

Main reason being your are selecting 3 columns for not-exists whereas you are looking for a 2 key c

His query should read

select a.fld1,fld2,fld3 from TableA a
where not exists
(select 1 from TableB b
where b.fld1=a.fld1
and b.fld2=a.fld2
)
and a.fld2 in (select fld2 from TableB)

exists clause in sql doesn't use the data returned so you are correct this is not very effecient. My point to this excercise is that the poster is not willing to describe the entire job flow so this is an academic discussion about how to find results without knowing what else needs to be done with them. Its the understanding of the full requirement that will lead to the correct job design. There is no best way to do this type of work that will fit every situation. There is a most appropriate way to do the work, that best fits the overall need.
kavuri
Premium Member
Premium Member
Posts: 161
Joined: Mon Apr 16, 2007 2:56 pm

Post by kavuri »

Hi,
There is a huge SQL which i need to convert this into a DataStage job. Here below is the code. I may not able to give the exact code because of privacy limitations with the client. But I had converted the code as is.

SELECT B.col1, B.col2, B.col3, B.col4, B.col5
FROM
(SELECT e.*
FROM tablee e
,(SELECT DISTINCT t2.col3, t2.col2, t2.col6, t2.col7, t2.col8t, t2.col9
FROM tablea a
,tablet1 t1
,tablet2 t2
WHERE a.col3 = t1.col3
AND t1.col3 = t2.col3
AND a.col2 = t1.col2
AND t1.col8 = t2.col8) t
WHERE e.col3 = t.col3
AND e.col2 = t.col2
AND e.col10 = 6
AND e.col1 <> 188
AND e.col1 <> 28) B
WHERE NOT EXISTS
(SELECT A.col1, A.col2, A.col3, A.col4, A.col5
FROM tableA A
WHERE B.col3 = A.col3
AND B.col1 = A.col1
AND B.col4 = A.col4
AND B.col2 = A.col2)
AND B.col4 in (SELECT A.col4 FROM tableA A)
AND B.col3 in (SELECT A.col3 FROM tableA A)
AND B.col1 in
(SELECT col1
FROM tablex
WHERE col3 IN (SELECT col3 FROM tabley)
AND col1 = col12
AND col13 = 17
AND col14 = 1)

By the way I want to iterate my Database is Netezza. So i dont have any concept of indexes. I know if I run this SQL is Netezza it will run better than DataStage (If the distribution si proper), but thats not in my hand.

Even there are 2 more unions and a left outer join outside this query whcih i can handle. When it comes to design, everyone will have their own ideas and I am sure I can understand all of them to generate the optimim results.

Thanks for your suggestions.
have a great day
Kavuri
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

kavuri wrote: WHERE NOT EXISTS
(SELECT A.col1, A.col2, A.col3, A.col4, A.col5
FROM tableA A
WHERE B.col3 = A.col3
AND B.col1 = A.col1
AND B.col4 = A.col4
AND B.col2 = A.col2)
AND B.col4 in (SELECT A.col4 FROM tableA A)
AND B.col3 in (SELECT A.col3 FROM tableA A)
AND
Why would anybody have a subquery doing the same condition as the main query and compare the values !?

I know it is a deformed copy of existing SQL, but it must not be so disturbingly difficult for others to understand.

Best will be to write down the logic in steps and start from there.
Post Reply