Re: SQL: Working with huge tables of chronological data



On Apr 8, 4:08 am, "charely" <nos...@xxxxxxxxx> wrote:
"John" <acide.ascorbi...@xxxxxxxxx> schreef in berichtnews:1175788930.059725.19220@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Here are the technical details and the query I've been using so far.

TableA is ~100 millions row and contains (timestamp, evtA)
TableB is ~30 millions row and contains (timestamp, evtB)

The following query took ~60h (on a private but quite slow server) to
compute. ~1h is what I'm aiming to.

select TA1_B.evtA, TA2.evtA
from
(
select TA1.evtA, TA1.timestamp timeA1, TB.evtB, min(TB.timestamp)
min_timeB
from tableA TA1 left outer join tableB TB on (TA1.timestamp <
TB.timestamp)
group by TA1.evtA, TA1.timestamp, TB.evtB
) TA1_B,
tableA TA2
where
TA1_B.timeA1 < TA2.timestamp
and (TA2.timestamp < TA1_B.min_timeB or TA1_B.min_timeB is null)
and TA1_B.evtA <> TA2.evtA;

Thanks!

John

What about

select ta1.timestamp , ta2.timestamp from tablea ta1 , tablea ta2
where ta2.timestamp > ta1.timestamp and ta2.timestamp <=
nvl((select min(timestamp) from tableb b where b.timestamp >
ta1.timestamp) ,
(select max(timestamp) from tablea))

The nvl function is only needed to also catch events where no later event
in b exist.

I have not tested this for performance , but assuming you have indexes on
the timestamp
columns ( or using IOTs for the tables) , the optimizer will probably use
range scans on
those indexes.

I don't that that the SQL statement you provided will offer the data
that the OP was wanting. Assume the following:
TABLEA (T1)
05-APR-2007 00:07:00
05-APR-2007 00:10:00
05-APR-2007 00:12:00
05-APR-2007 00:15:00
05-APR-2007 00:17:00
05-APR-2007 00:20:00

TABLEB (T2)
05-APR-2007 00:09:00
05-APR-2007 00:18:00

The OP wanted to retrieve pairs of data from two rows of TABLEA (T1)
where a value from TABLEB (T2) does not fall between the pairs from
TABLEA (T1). In this case, we need to report all pairs of values
between the time values 05-APR-2007 00:10:00, 05-APR-2007 00:12:00, 05-
APR-2007 00:15:00, and 05-APR-2007 00:17:00 since those fall between
the two values from TABLEB (T2). That should yield the following
list:
00:10-00:12
00:12-00:15
00:10-00:15
00:15-00:17
00:12-00:17
00:10-00:17

Let's reformat the query that you provided so that it can use the
sample tables and indexes that I provided previously, in order to see
if it is an efficient starting point for the OP:
SELECT /*+ GATHER_PLAN_STATISTICS */
TA1.V1,
TA2.V1
FROM
T1 TA1,
T2 TA2
WHERE
TA2.V1 > TA1.V1
AND TA2.V1 <= NVL(
(SELECT
MIN(V1)
FROM
T2 B
WHERE
B.V1 > TA1.V1),
(SELECT
MAX(V1)
FROM
T1));

SELECT
*
FROM
TABLE (DBMS_XPLAN.DISPLAY_CURSOR(NULL,NULL,'ALLSTATS LAST'));

| Id | Operation | Name | Starts | E-Rows | A-
Rows | A-Time | Buffers | OMem | 1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------
|* 1 | FILTER | | 1 | |
1000 |00:00:00.47 | 79 | | | |
| 2 | MERGE JOIN | | 1 | 15000
| 167K|00:00:01.17 | 8 | | | |
| 3 | SORT JOIN | | 1 | 300
| 300 |00:00:00.01 | 1 | 11264 | 11264 |10240 (0)|
| 4 | INDEX FULL SCAN | T2_IND1 | 1 | 300
| 300 |00:00:00.01 | 1 | | | |
|* 5 | SORT JOIN | | 300 | 1000
| 167K|00:00:00.50 | 7 | 36864 | 36864 |32768 (0)|
| 6 | INDEX FAST FULL SCAN | T1_IND1 | 1 | 1000 |
1000 |00:00:00.01 | 7 | | | |
| 7 | SORT AGGREGATE | | 71341 | 1 |
71341 |00:00:02.09 | 69 | | | |
| 8 | FIRST ROW | | 71341 | 15 |
71341 |00:00:01.21 | 69 | | | |
|* 9 | INDEX RANGE SCAN (MIN/MAX) | T2_IND1 | 71341 | 15 |
71341 |00:00:00.47 | 69 | | | |
| 10 | SORT AGGREGATE | | 1 | 1
| 1 |00:00:00.01 | 2 | | | |
| 11 | INDEX FULL SCAN (MIN/MAX)| T1_IND1 | 1 | 1000
| 1 |00:00:00.01 | 2 | | | |
-------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("TA2"."V1"<=NVL(,))
5 -
access(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))

filter(INTERNAL_FUNCTION("TA2"."V1")>INTERNAL_FUNCTION("TA1"."V1"))
9 - access("B"."V1">:B1)

Notice the Starts column in the above - that is the number of times
that portion of the plan was executed. Also note the significant
difference between estimated and actual rows, as well as what is
reported in the actual time column. Where did that bind variable in
the predicate information for ID 9 come from (Oracle optimization)?

Here is the plan for the final SQL statement that I provided, just for
comparison:
| Id | Operation | Name | Starts | E-
Rows | A-Rows | A-Time | Buffers | Reads | OMem | 1Mem | Used-
Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------
|* 1 | CONNECT BY WITHOUT FILTERING | | 1
| | 1652 |00:00:00.06 | 23 | 12 | |
| |
| 2 | VIEW | | 1 |
1035 | 1194 |00:00:00.07 | 23 | 12 | |
| |
| 3 | WINDOW NOSORT | | 1 |
1035 | 1194 |00:00:00.06 | 23 | 12 | 73728 | 73728
| |
| 4 | VIEW | | 1 |
1035 | 1194 |00:00:00.06 | 23 | 12 | |
| |
| 5 | WINDOW SORT | | 1 |
1035 | 1194 |00:00:00.05 | 23 | 12 | 43008 | 43008 |38912
(0)|
| 6 | VIEW | | 1 |
1035 | 1194 |00:00:00.05 | 23 | 12 | |
| |
| 7 | SORT ORDER BY | | 1 |
1035 | 1194 |00:00:00.05 | 23 | 12 | 57344 | 57344 |51200
(0)|
| 8 | VIEW | | 1 |
1035 | 1194 |00:00:00.07 | 23 | 12 | |
| |
| 9 | UNION-ALL | | 1
| | 1194 |00:00:00.06 | 23 | 12 | |
| |
|* 10 | HASH JOIN RIGHT OUTER | | 1 |
1000 | 1000 |00:00:00.05 | 14 | 12 | 898K| 898K| 1132K
(0)|
| 11 | TABLE ACCESS FULL | T2 | 1 |
300 | 300 |00:00:00.04 | 7 | 6 | |
| |
| 12 | TABLE ACCESS FULL | T1 | 1 |
1000 | 1000 |00:00:00.01 | 7 | 6 | |
| |
| 13 | MERGE JOIN ANTI | | 1 |
35 | 194 |00:00:00.01 | 9 | 0 | | |
|
| 14 | TABLE ACCESS BY INDEX ROWID| T2 | 1 |
300 | 300 |00:00:00.01 | 2 | 0 | |
| |
| 15 | INDEX FULL SCAN | T2_IND1 | 1 |
300 | 300 |00:00:00.01 | 1 | 0 | |
| |
|* 16 | SORT UNIQUE | | 300 |
1000 | 106 |00:00:00.01 | 7 | 0 | 36864 | 36864 |32768
(0)|
| 17 | INDEX FAST FULL SCAN | T1_IND1 | 1 |
1000 | 1000 |00:00:00.01 | 7 | 0 | |
| |

Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("TS"+1=PRIOR NULL)
10 - access("T1"."V1"="T2"."V1")
16 - access("T1"."V1"="T2"."V1")
filter("T1"."V1"="T2"."V1")

If nothing else, this shows how efficient analytic functions can be
when compared to other methods.

Regarding Mladen Gogala suggestion to use LAG(V1,1), LAG(V1,2),
LAG(V1,3), ... LAG(V1, n) - I looked at using that method initially.
The problems that I encountered: #1 what should the value of n be so
that I do not miss any potential matches, #2 how do I create a new
result row for each of the matches, since all of the matches will be
returned on the same result row. In other words, you will only be
able to return one set of matches per result row when using LAG,
unless you can find some way to uncoil all of the resulting matches
into new result rows. That may be the reason that Jonathan Lewis
stated "the option using the analytic lag() would only give you the
adjacent pairs."

I would also suggest not retrieving all rows from the two tables
(TableA is ~100 millions rows, TableB is ~30 millions rows) and
processing the data client side, since you would then also need to
consider latency caused by network traffic when the 130 million rows
are retrieved from the database and sent to the client for processing.

Charles Hooper
PC Support Specialist
K&M Machine-Fabricating, Inc.

.



Relevant Pages

  • Re: SQL: Working with huge tables of chronological data
    ... TableA is ~100 millions row and contains (timestamp, ... from tableA TA1 left outer join tableB TB on (TA1.timestamp < ... that portion of the plan was executed. ...
    (comp.databases.oracle.server)
  • Re: Cursors (again)
    ... What does it matter if TableB is temporary? ... or tableA. ... open cTableB ...
    (microsoft.public.sqlserver.programming)
  • Re: SQL: Working with huge tables of chronological data
    ... TableB is ~30 millions row and contains (timestamp, ... from tableA TA1 left outer join tableB TB on (TA1.timestamp < ... that portion of the plan was executed. ...
    (comp.databases.oracle.server)
  • Re: Dazed and Confused
    ... related to the records in TableB, *and* how EACH record in TableB is related ... to the records in TableA. ... Purchase Order Details Table to a Purchase Order Table ... Junction table to the Purchase Requisition table. ...
    (microsoft.public.access.tablesdbdesign)
  • Re: Subselect Query Problem
    ... tableA LEFT JOIN tableB ... the result (of the inner join), it is logically re-introduced into the ... result, but since nothing match in tableB, whatever comes from tableB is ... > is Table B that has the first three fields as the primary key, ...
    (microsoft.public.access.queries)