Re: How to get a distinct count of result set of multople table joins?



On Feb 27, 10:03 pm, Charles Hooper <hooperc2...@xxxxxxxxx> wrote:
On Feb 27, 5:43 pm, "Chris ( Val )" <chris...@xxxxxxxxx> wrote:





On Feb 28, 3:27 am, Charles Hooper <hooperc2...@xxxxxxxxx> wrote:
I suspect that Oracle is taking a long time to remove the duplicate
rows due to a combination of the number of rows, the number of columns
(you are specifying to retrieve all columns from the tables), and the
amount of memory available for sorting unique (or hash unique) the
rows to produce a unique list of rows.  Are you able to better define
the columns that are interest, and eliminate any columns that are
common between the various tables.  

Unfortunatley I did not write the SQL, and I don't understand the
data requirements enough to manipulate it. I can ask for it to be
changed, but that can take ages to happen :)

I have found in some cases on
Oracle 10.2.0.2 that listing all columns in a GROUP BY allows Oracle
to retrieve a distinct list of rows more quickly than by using the
DISTINCT syntax.

In the second of your examples, I do not see where you are using the
UNION syntax - are you trying to do something different in that
example?  

Sorry, I meant that I tried to use the same query as I just posted to
Mark,
but with the COUNT statement added as in my first post.

I am having trouble getting the right count returned.

Explain plan, or better yet a DBMS Xplan showing the predicted and
actual results would be helpful, as requested by Ed and Mark.  If the
explain plan shows nothing useful, turn on a 10046 trace at level 8
and see what is happening behind the scenes.

Even though this is a DEV database, I am not allowed to go to those
extreams.

My job is to automate a process of extracting data, of which this SQL
was handed
to me to use in my script.

Thanks for your help.

--
Chris

Looking at the plan, 13 tables were full table scanned producing an
estimated 19,222,632 bytes in 24,271 rows, using 12 hash joins for
each half of the UNION when the duplicates are removed.  You indicate
that Oracle is actually returning 700,000 rows before the unique count
is applied.  This seems to imply that Oracle is actually dealing with
roughly 554,400,000 bytes for each half of the UNION when the
duplicates are removed.  Your SQL statement is likely making
significant use of the TEMP tablespace.  You may be able to improve
performance by increasing the memory available for sorting and hash
joins.  Assuming that WORKAREA_SIZE_POLICY is set to AUTO, you might
be able to improve performance a little by having the DBA increase the
PGA_AGGREGATE_TARGET.

To see if the above estimates are close, you might try running
DBMS_XPLAN for the query.  See the following link for an example:
 http://jonathanlewis.wordpress.com/2006/11/09/dbms_xplan-in-10g/

I think that I now understand what you originally posted.  The two
queries should be returning the same number of rows before the count,
since the UNION syntax removes duplicate rows automatically.  Note
the:
  "SORT (UNIQUE)" in your execution plan.

Here is a test case that demonstates the performance improvement
suggestions that I made earlier:
CREATE TABLE T1 (
  C1 VARCHAR2(30),
  C2 VARCHAR2(30),
  C3 VARCHAR2(30));

CREATE TABLE T2 (
  C1 VARCHAR2(30),
  C2 VARCHAR2(30),
  C3 VARCHAR2(30));

CREATE TABLE T3 (
  C1 VARCHAR2(30),
  C2 VARCHAR2(30),
  C3 VARCHAR2(30));

INSERT INTO
  T1
SELECT
  TO_CHAR(SIN(ROWNUM*3.1415/180)*10000,'0000000000'),
  TO_CHAR(COS(ROWNUM*3.1415/180)*10000,'0000000000'),
  TO_CHAR(TAN(ROWNUM*3.1415/180)*10000,'0000000000')
FROM
  DUAL
CONNECT BY
  LEVEL<=400000;

INSERT INTO
  T2
SELECT
  TO_CHAR(SIN(ROWNUM*4*3.1415/180)*10000,'0000000000'),
  TO_CHAR(COS(ROWNUM*4*3.1415/180)*10000,'0000000000'),
  TO_CHAR(TAN(ROWNUM*4*3.1415/180)*10000,'0000000000')
FROM
  DUAL
CONNECT BY
  LEVEL<=100000;

INSERT INTO
  T3
SELECT
  TO_CHAR(SIN(ROWNUM*7.2*3.1415/180)*10000,'0000000000'),
  TO_CHAR(COS(ROWNUM*7.2*3.1415/180)*10000,'0000000000'),
  TO_CHAR(TAN(ROWNUM*7.2*3.1415/180)*10000,'0000000000')
FROM
  DUAL
CONNECT BY
  LEVEL<=70000;

COMMIT;

EXEC
DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T1',CASCADE=>TRUE);

EXEC
DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T2',CASCADE=>TRUE);

EXEC
DBMS_STATS.GATHER_TABLE_STATS(OWNNAME=>USER,TABNAME=>'T3',CASCADE=>TRUE);

The first test is similar to your first query:
SELECT
  COUNT(*)
FROM
(SELECT
  *
FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND T1.C1=T3.C1(+)
UNION
SELECT
  *
FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND T1.C1=T3.C1(+));

  COUNT(*)
----------
   2994198

The DBMS Xplan (watch for wordwrapping):
---------------------------------------------------------------------------­-------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows | A-Rows
|   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
Used-Tmp|
---------------------------------------------------------------------------­-------------------------------------------------------------------------
|   1 |  SORT AGGREGATE            |      |      1 |      1 |      1 |
00:03:16.27 |    5392 |  73199 |  73199 |       |       |
|         |
|   2 |   VIEW                     |      |      1 |   2806K|   2994K|
00:03:09.16 |    5392 |  73199 |  73199 |       |       |
|         |
|   3 |    SORT UNIQUE             |      |      1 |   2806K|   2994K|
00:03:03.17 |    5392 |  73199 |  73199 |   269M|  8322K|   37M
(1)|     572K|
|   4 |     UNION-ALL              |      |      1 |        |   7089K|
00:01:32.59 |    5342 |      0 |      0 |       |       |
|         |
|*  5 |      HASH JOIN RIGHT OUTER |      |      1 |   1403K|   3544K|
00:00:28.78 |    2671 |      0 |      0 |  1102K|  1102K|  178K
(0)|         |
|   6 |       TABLE ACCESS FULL    | T2   |      1 |      1 |      0 |
00:00:00.01 |       3 |      0 |      0 |       |       |
|         |
|*  7 |       HASH JOIN RIGHT OUTER|      |      1 |   1403K|   3544K|
00:00:07.51 |    2668 |      0 |      0 |  4768K|  2062K| 7035K
(0)|         |
|   8 |        TABLE ACCESS FULL   | T3   |      1 |  70000 |  70000 |
00:00:00.07 |     402 |      0 |      0 |       |       |
|         |
|   9 |        TABLE ACCESS FULL   | T1   |      1 |    400K|    400K|
00:00:00.40 |    2266 |      0 |      0 |       |       |
|         |
|* 10 |      HASH JOIN RIGHT OUTER |      |      1 |   1403K|   3544K|
00:00:28.78 |    2671 |      0 |      0 |  1102K|  1102K|  207K
(0)|         |
|  11 |       TABLE ACCESS FULL    | T2   |      1 |      1 |      0 |
00:00:00.01 |       3 |      0 |      0 |       |       |
|         |
|* 12 |       HASH JOIN RIGHT OUTER|      |      1 |   1403K|   3544K|
00:00:07.51 |    2668 |      0 |      0 |  4768K|  2062K| 7049K
(0)|         |
|  13 |        TABLE ACCESS FULL   | T3   |      1 |  70000 |  70000 |
00:00:00.07 |     402 |      0 |      0 |       |       |
|         |
|  14 |        TABLE ACCESS FULL   | T1   |      1 |    400K|    400K|
00:00:00.40 |    2266 |      0 |      0 |       |       |
|         |
---------------------------------------------------------------------------­-------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   5 - access("T1"."C1"="T2"."C1")
   7 - access("T1"."C1"="T3"."C1")
  10 - access("T1"."C1"="T2"."C1")
  12 - access("T1"."C1"="T3"."C1")

The above took 3 minutes and 16 seconds on Oracle 11.1.0.6

Sample 2 returns the same result, just significantly faster:
SELECT
  COUNT(*)
FROM
(SELECT DISTINCT
  *
FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND T1.C1=T3.C1(+));

  COUNT(*)
----------
   2994198

---------------------------------------------------------------------------­------------------------------------------------------------------------
| Id  | Operation                 | Name | Starts | E-Rows | A-Rows
|   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem |
Used-Tmp|
---------------------------------------------------------------------------­------------------------------------------------------------------------
|   1 |  SORT AGGREGATE           |      |      1 |      1 |      1 |
00:01:18.14 |    2671 |  40365 |  40365 |       |       |
|         |
|   2 |   VIEW                    |      |      1 |   1403K|   2994K|
00:01:11.67 |    2671 |  40365 |  40365 |       |       |
|         |
|   3 |    HASH UNIQUE            |      |      1 |   1403K|   2994K|
00:01:05.68 |    2671 |  40365 |  40365 |   351M|    15M|   46M
(1)|     337K|
|*  4 |     HASH JOIN RIGHT OUTER |      |      1 |   1403K|   3544K|
00:00:28.78 |    2671 |      0 |      0 |  1102K|  1102K|  181K
(0)|         |
|   5 |      TABLE ACCESS FULL    | T2   |      1 |      1 |      0 |
00:00:00.01 |       3 |      0 |      0 |       |       |
|         |
|*  6 |      HASH JOIN RIGHT OUTER|      |      1 |   1403K|   3544K|
00:00:07.51 |    2668 |      0 |      0 |  4768K|  2062K| 7049K
(0)|         |
|   7 |       TABLE ACCESS FULL   | T3   |      1 |  70000 |  70000 |
00:00:00.07 |     402 |      0 |      0 |       |       |
|         |
|   8 |       TABLE ACCESS FULL   | T1   |      1 |    400K|    400K|
00:00:00.40 |    2266 |      0 |      0 |       |       |
|         |
---------------------------------------------------------------------------­------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("T1"."C1"="T2"."C1")
   6 - access("T1"."C1"="T3"."C1")

The third example is one of the performance enhancements that I
suggested, using GROUP BY, rather than DISTINCT:
SELECT
  COUNT(*)
FROM
(SELECT
  T1.C1,
  T1.C2,
  T1.C3,
  T2.C1,
  T2.C2,
  T2.C3,
  T3.C1,
  T3.C2,
  T3.C3
FROM
  T1,
  T2,
  T3
WHERE
  T1.C1=T2.C1(+)
  AND
...

read more »- Hide quoted text -

- Show quoted text -

I was under the impression that performance is not an issue but rather
it is the fact that the two queries in the initial post are not
returning the same result set.

The base query is an outer join. Since an outer join says return all
rows from the base table plus the matching data from the joined table
or null values when there is not matching row, duplicate output would
potentially be expected but duplicate output could also indicate that
a required join condition has not been specified.

The join should be exaimined to be sure all join conditions have been
properly specified. The query should be checked to make sure it is
the right query for the desired output.

Then we are back to if

select query
union
select same_query

and

select count(*) from ( select distinct query)

should produce the same result set counts.

Note that getting the same results set does potentially depend on none
of the base tables in the join being updated during the time the two
queries are ran else the view of the data will be different.

-- Mark D Powell --



.



Relevant Pages


Loading