Page 1 of 1

where not exists implementation

Posted: Mon Jul 26, 2010 12:43 pm
by kavuri
Hi,
I am working to implement a SQL scenario in datastage.

select a.fld1,fld2,fld3 from TableA a
where not exists
(select b.fld1,fld2,fld3 from TableB b
where b.fld1=a.fld1
and b.fld2=a.fld2
)
and a.fld2 in (select fld2 from TableB)

My tables are some what huge (A table is about 50million records and b table is about 500,000 records).

I want to implement this with good performance. I know I can use lookup and get the rejects. Can anyone explain me a good design?

Thanks
Kavuri

Re: where not exists implementation

Posted: Mon Jul 26, 2010 1:25 pm
by kwwilliams
A good design would look at the totality of the work you need to accomplish and design for the whole set of work instead of just a part of it. Is there anything else that this design is going to accomplish?

Re: where not exists implementation

Posted: Mon Jul 26, 2010 1:27 pm
by kavuri
I wanted to see this piece alone...

Re: where not exists implementation

Posted: Mon Jul 26, 2010 1:33 pm
by kwwilliams
You have SQL that works. Lookup with reject link. Left outer join filter for null fields only. One will fit with you overall design hopefully. Good luck with designing one piece at a time.

Re: where not exists implementation

Posted: Mon Jul 26, 2010 2:46 pm
by creatingfusion
select a.fld1,fld2,fld3 from TableA a
where not exists
(select b.fld1,fld2,fld3 from TableB b
where b.fld1=a.fld1
and b.fld2=a.fld2
)

For large number of records as like in Datawarehouses it is advised to create staging table containing only the key fields and then perform yours SQL query on that and get the output result. further perform a look up to fetch all data from the table.

In yours case create StgTableA with fields fld1 and fld2 for the larger table. Define indexes on the tables appropriately and then perform lookup with TableB to fetch only the records not existing in tableB. Dump the records into dataset or another staging table. This being yours key fields of yours output and can be used further to get all other fields of the table using look-up.
The staging tables would be full refresh tables and gets truncated as soon as its use is no more.
Although this being complex process but for large data volume it takes less buffer space in query execution and even fetches the output faster.

Posted: Mon Jul 26, 2010 10:46 pm
by udayk_2007
As the table you are using is huge in size,one way could be to create a look fileset first from the table, with including only the required columns and then using this lookup fileset in the lookup operation.

Regards
Ulhas

Posted: Tue Jul 27, 2010 2:02 am
by Sainath.Srinivasan
Did you try running your SQL in your database? For me, it does not appear performance efficient.

Main reason being your are selecting 3 columns for not-exists whereas you are looking for a 2 key combination.

If I were in your place, I will start by looking into possible matches by field2. If this is less than 20% (something small in matches and distinct values), I will include this as a lookup first and then proceed to match by 2 columns for rejects.

Also I will look into vertical pivot of possible field2 values into a single MV attribute with field1 as key and do both check in a single lookup. Whichever way, lookup for 500k with just 2 fields must not be a huge load.

Btw, why not include this query in your source and extract relevant records ?

Posted: Tue Jul 27, 2010 6:59 am
by kwwilliams
Main reason being your are selecting 3 columns for not-exists whereas you are looking for a 2 key c

His query should read

select a.fld1,fld2,fld3 from TableA a
where not exists
(select 1 from TableB b
where b.fld1=a.fld1
and b.fld2=a.fld2
)
and a.fld2 in (select fld2 from TableB)

exists clause in sql doesn't use the data returned so you are correct this is not very effecient. My point to this excercise is that the poster is not willing to describe the entire job flow so this is an academic discussion about how to find results without knowing what else needs to be done with them. Its the understanding of the full requirement that will lead to the correct job design. There is no best way to do this type of work that will fit every situation. There is a most appropriate way to do the work, that best fits the overall need.

Posted: Tue Jul 27, 2010 7:50 am
by kavuri
Hi,
There is a huge SQL which i need to convert this into a DataStage job. Here below is the code. I may not able to give the exact code because of privacy limitations with the client. But I had converted the code as is.

SELECT B.col1, B.col2, B.col3, B.col4, B.col5
FROM
(SELECT e.*
FROM tablee e
,(SELECT DISTINCT t2.col3, t2.col2, t2.col6, t2.col7, t2.col8t, t2.col9
FROM tablea a
,tablet1 t1
,tablet2 t2
WHERE a.col3 = t1.col3
AND t1.col3 = t2.col3
AND a.col2 = t1.col2
AND t1.col8 = t2.col8) t
WHERE e.col3 = t.col3
AND e.col2 = t.col2
AND e.col10 = 6
AND e.col1 <> 188
AND e.col1 <> 28) B
WHERE NOT EXISTS
(SELECT A.col1, A.col2, A.col3, A.col4, A.col5
FROM tableA A
WHERE B.col3 = A.col3
AND B.col1 = A.col1
AND B.col4 = A.col4
AND B.col2 = A.col2)
AND B.col4 in (SELECT A.col4 FROM tableA A)
AND B.col3 in (SELECT A.col3 FROM tableA A)
AND B.col1 in
(SELECT col1
FROM tablex
WHERE col3 IN (SELECT col3 FROM tabley)
AND col1 = col12
AND col13 = 17
AND col14 = 1)

By the way I want to iterate my Database is Netezza. So i dont have any concept of indexes. I know if I run this SQL is Netezza it will run better than DataStage (If the distribution si proper), but thats not in my hand.

Even there are 2 more unions and a left outer join outside this query whcih i can handle. When it comes to design, everyone will have their own ideas and I am sure I can understand all of them to generate the optimim results.

Thanks for your suggestions.
have a great day
Kavuri

Posted: Tue Jul 27, 2010 8:13 am
by Sainath.Srinivasan
kavuri wrote: WHERE NOT EXISTS
(SELECT A.col1, A.col2, A.col3, A.col4, A.col5
FROM tableA A
WHERE B.col3 = A.col3
AND B.col1 = A.col1
AND B.col4 = A.col4
AND B.col2 = A.col2)
AND B.col4 in (SELECT A.col4 FROM tableA A)
AND B.col3 in (SELECT A.col3 FROM tableA A)
AND
Why would anybody have a subquery doing the same condition as the main query and compare the values !?

I know it is a deformed copy of existing SQL, but it must not be so disturbingly difficult for others to understand.

Best will be to write down the logic in steps and start from there.