Re: SQL: Working with huge tables of chronological data
- From: "charely" <nospam@xxxxxxxxx>
- Date: Sun, 8 Apr 2007 22:45:42 +0200
"Charles Hooper" <hooperc2000@xxxxxxxxx> schreef in bericht
news:1176044631.189804.86380@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
On Apr 8, 4:08 am, "charely" <nos...@xxxxxxxxx> wrote:
"John" <acide.ascorbi...@xxxxxxxxx> schreef in
berichtnews:1175788930.059725.19220@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Here are the technical details and the query I've been using so far.
TableA is ~100 millions row and contains (timestamp, evtA)
TableB is ~30 millions row and contains (timestamp, evtB)
The following query took ~60h (on a private but quite slow server) to
compute. ~1h is what I'm aiming to.
select TA1_B.evtA, TA2.evtA
from
(
select TA1.evtA, TA1.timestamp timeA1, TB.evtB, min(TB.timestamp)
min_timeB
from tableA TA1 left outer join tableB TB on (TA1.timestamp <
TB.timestamp)
group by TA1.evtA, TA1.timestamp, TB.evtB
) TA1_B,
tableA TA2
where
TA1_B.timeA1 < TA2.timestamp
and (TA2.timestamp < TA1_B.min_timeB or TA1_B.min_timeB is null)
and TA1_B.evtA <> TA2.evtA;
Thanks!
John
What about
select ta1.timestamp , ta2.timestamp from tablea ta1 , tablea ta2
where ta2.timestamp > ta1.timestamp and ta2.timestamp <=
nvl((select min(timestamp) from tableb b where b.timestamp >
ta1.timestamp) ,
(select max(timestamp) from tablea))
The nvl function is only needed to also catch events where no later
event
in b exist.
I have not tested this for performance , but assuming you have indexes
on
the timestamp
columns ( or using IOTs for the tables) , the optimizer will probably
use
range scans on
those indexes.
I don't that that the SQL statement you provided will offer the data
that the OP was wanting. Assume the following:
TABLEA (T1)
05-APR-2007 00:07:00
05-APR-2007 00:10:00
05-APR-2007 00:12:00
05-APR-2007 00:15:00
05-APR-2007 00:17:00
05-APR-2007 00:20:00
TABLEB (T2)
05-APR-2007 00:09:00
05-APR-2007 00:18:00
The OP wanted to retrieve pairs of data from two rows of TABLEA (T1)
where a value from TABLEB (T2) does not fall between the pairs from
TABLEA (T1). In this case, we need to report all pairs of values
between the time values 05-APR-2007 00:10:00, 05-APR-2007 00:12:00, 05-
APR-2007 00:15:00, and 05-APR-2007 00:17:00 since those fall between
the two values from TABLEB (T2). That should yield the following
list:
00:10-00:12
00:12-00:15
00:10-00:15
00:15-00:17
00:12-00:17
00:10-00:17
Let's reformat the query that you provided so that it can use the
sample tables and indexes that I provided previously, in order to see
if it is an efficient starting point for the OP:
SELECT /*+ GATHER_PLAN_STATISTICS */
TA1.V1,
TA2.V1
FROM
T1 TA1,
T2 TA2
WHERE
TA2.V1 > TA1.V1
AND TA2.V1 <= NVL(
(SELECT
MIN(V1)
FROM
T2 B
WHERE
B.V1 > TA1.V1),
(SELECT
MAX(V1)
FROM
T1));
SELECT
*
FROM
TABLE (DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));
| Id | Operation | Name | Starts | E-Rows | A-
Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|* 1 | FILTER | | 1 | |
1000 |00:00:00.47 | 79 | | | |
| 2 | MERGE JOIN | | 1 | 15000
| 167K|00:00:01.17 | 8 | | | |
| 3 | SORT JOIN | | 1 | 300
| 300 |00:00:00.01 | 1 | 11264 | 11264 |10240 (0)|
| 4 | INDEX FULL SCAN | T2_IND1 | 1 | 300
| 300 |00:00:00.01 | 1 | | | |
|* 5 | SORT JOIN | | 300 | 1000
| 167K|00:00:00.50 | 7 | 36864 | 36864 |32768 (0)|
| 6 | INDEX FAST FULL SCAN | T1_IND1 | 1 | 1000 |
1000 |00:00:00.01 | 7 | | | |
| 7 | SORT AGGREGATE | | 71341 | 1 |
71341 |00:00:02.09 | 69 | | | |
| 8 | FIRST ROW | | 71341 | 15 |
71341 |00:00:01.21 | 69 | | | |
|* 9 | INDEX RANGE SCAN (MIN/MAX) | T2_IND1 | 71341 | 15 |
71341 |00:00:00.47 | 69 | | | |
| 10 | SORT AGGREGATE | | 1 | 1
| 1 |00:00:00.01 | 2 | | | |
| 11 | INDEX FULL SCAN (MIN/MAX)| T1_IND1 | 1 | 1000
| 1 |00:00:00.01 | 2 | | | |
-------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("TA2"."V1"<=NVL(,))
5 -
access(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))
filter(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))
9 - access("B"."V1">:B1)
Notice the Starts column in the above - that is the number of times
that portion of the plan was executed. Also note the significant
difference between estimated and actual rows, as well as what is
reported in the actual time column. Where did that bind variable in
the predicate information for ID 9 come from (Oracle optimization)?
Here is the plan for the final SQL statement that I provided, just for
comparison:
| Id | Operation | Name | Starts | E-
Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-
Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------
|* 1 | CONNECT BY WITHOUT FILTERING | | 1
| | 1652 |00:00:00.06 | 23 | 12 | |
| |
| 2 | VIEW | | 1 |
1035 | 1194 |00:00:00.07 | 23 | 12 | |
| |
| 3 | WINDOW NOSORT | | 1 |
1035 | 1194 |00:00:00.06 | 23 | 12 | 73728 | 73728
| |
| 4 | VIEW | | 1 |
1035 | 1194 |00:00:00.06 | 23 | 12 | |
| |
| 5 | WINDOW SORT | | 1 |
1035 | 1194 |00:00:00.05 | 23 | 12 | 43008 | 43008 |38912
(0)|
| 6 | VIEW | | 1 |
1035 | 1194 |00:00:00.05 | 23 | 12 | |
| |
| 7 | SORT ORDER BY | | 1 |
1035 | 1194 |00:00:00.05 | 23 | 12 | 57344 | 57344 |51200
(0)|
| 8 | VIEW | | 1 |
1035 | 1194 |00:00:00.07 | 23 | 12 | |
| |
| 9 | UNION-ALL | | 1
| | 1194 |00:00:00.06 | 23 | 12 | |
| |
|* 10 | HASH JOIN RIGHT OUTER | | 1 |
1000 | 1000 |00:00:00.05 | 14 | 12 | 898K| 898K| 1132K
(0)|
| 11 | TABLE ACCESS FULL | T2 | 1 |
300 | 300 |00:00:00.04 | 7 | 6 | |
| |
| 12 | TABLE ACCESS FULL | T1 | 1 |
1000 | 1000 |00:00:00.01 | 7 | 6 | |
| |
| 13 | MERGE JOIN ANTI | | 1 |
35 | 194 |00:00:00.01 | 9 | 0 | | |
|
| 14 | TABLE ACCESS BY INDEX ROWID| T2 | 1 |
300 | 300 |00:00:00.01 | 2 | 0 | |
| |
| 15 | INDEX FULL SCAN | T2_IND1 | 1 |
300 | 300 |00:00:00.01 | 1 | 0 | |
| |
|* 16 | SORT UNIQUE | | 300 |
1000 | 106 |00:00:00.01 | 7 | 0 | 36864 | 36864 |32768
(0)|
| 17 | INDEX FAST FULL SCAN | T1_IND1 | 1 |
1000 | 1000 |00:00:00.01 | 7 | 0 | |
| |
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("TS"+1=PRIOR NULL)
10 - access("T1"."V1"="T2"."V1")
16 - access("T1"."V1"="T2"."V1")
filter("T1"."V1"="T2"."V1")
If nothing else, this shows how efficient analytic functions can be
when compared to other methods.
Regarding Mladen Gogala suggestion to use LAG(V1,1), LAG(V1,2),
LAG(V1,3), ... LAG(V1, n) - I looked at using that method initially.
The problems that I encountered: #1 what should the value of n be so
that I do not miss any potential matches, #2 how do I create a new
result row for each of the matches, since all of the matches will be
returned on the same result row. In other words, you will only be
able to return one set of matches per result row when using LAG,
unless you can find some way to uncoil all of the resulting matches
into new result rows. That may be the reason that Jonathan Lewis
stated "the option using the analytic lag() would only give you the
adjacent pairs."
I would also suggest not retrieving all rows from the two tables
(TableA is ~100 millions rows, TableB is ~30 millions rows) and
processing the data client side, since you would then also need to
consider latency caused by network traffic when the 130 million rows
are retrieved from the database and sent to the client for processing.
Charles Hooper
PC Support Specialist
K&M Machine-Fabricating, Inc.
As far as I can tell, the provided query does match the OP's request,
but I do agree that the CBO has generated a poor execution plan.
I got a better plan with
select /*+USE_NL(ta2 ta1)*/ ta1.v1 , ta2.v1 from t1 ta1 , t1 ta2
where ta2.v1 > ta1.v1 and ta2.v1 <=
nvl((select min(v1) from t2 b where b.v1 > ta1.v1) ,
(select max(v1) from t1))
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost
(%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 25 | 650 | 7514K
(1)| 25:02:57 |
| 1 | NESTED LOOPS | | 249K| 6346K| 20258
(2)| 00:04:04 |
| 2 | TABLE ACCESS FULL | T1 | 9999 | 126K| 9
(0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | IX_T1 | 25 | 325 | 2
(0)| 00:00:01 |
| 4 | SORT AGGREGATE | | 1 | 13 |
| |
| 5 | FIRST ROW | | 500 | 6500 | 2
(0)| 00:00:01 |
|* 6 | INDEX RANGE SCAN (MIN/MAX) | IX_T2 | 500 | 6500 | 2
(0)| 00:00:01 |
| 7 | SORT AGGREGATE | | 1 | 13 |
| |
| 8 | INDEX FULL SCAN (MIN/MAX)| IX_T1 | 9999 | 126K|
| |
-----------------------------------------------------------------------------------------
the plan still does two index range scans for every row in T1 ,
but avoids sorting the union of T1 and T2.
.
- Follow-Ups:
- Re: SQL: Working with huge tables of chronological data
- From: Maxim Demenko
- Re: SQL: Working with huge tables of chronological data
- References:
- SQL: Working with huge tables of chronological data
- From: John
- Re: SQL: Working with huge tables of chronological data
- From: Ana C. Dent
- Re: SQL: Working with huge tables of chronological data
- From: John
- Re: SQL: Working with huge tables of chronological data
- From: charely
- Re: SQL: Working with huge tables of chronological data
- From: Charles Hooper
- SQL: Working with huge tables of chronological data
- Prev by Date: Re: Configuration question
- Next by Date: Re: Configuration question
- Previous by thread: Re: SQL: Working with huge tables of chronological data
- Next by thread: Re: SQL: Working with huge tables of chronological data
- Index(es):
Relevant Pages
|