Re: A new proof of the superiority of set oriented approaches: numerical/time serie linear interpolation



On May 3, 4:33 am, Cimode <cim...@xxxxxxxxxxx> wrote:
On May 3, 10:12 am, "Brian Selzer" <b...@xxxxxxxxxxxxxxxxxxx> wrote:> "Cimode" <cim...@xxxxxxxxxxx> wrote in message

news:1178173985.960152.227740@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[snip]

Eliminating self-joins is beneficial regardless of the implementation.
Beneficial to what ? Performance ? What performance ? Response
time ? concurrency ? cost of administration ? (Please answer these
precise questions). Do you realize that despite some lengthy (maybe
worthy) attempts at clarifying your point, some people here have no
clue what you are talking about.

I'll try to be as concise as possible. Eliminating multiple self-joins
reduces the number of passes through the data required to answer a query.

So you are taking about physical IO's. Right? Can you explain to me
how you reduce physical IO's when itterating on the basis of number of
lines multiplied by the number of operations required. Can you
explain to me how

This reduces (1) execution time,

How ? Could you establish that execution time is reduced in any
concurrent environment.

(2) the duration that locks are held,
How ? What kind of lock ? What isolation level are you refering to ?

(3) the amount of I/O required,

What IO's ? physical ? logical IO?

(4) CPU utilization, and

Do you mean percentage of usage ?

(5) the amount of
memory required. The reduction in execution time improves response time.

I am not sure I understand I understand this last sentence. Are you
aware that most of the time used in a process is mostly (compile +
binding). Execution is a minor part of the resources used.

The reduction in the duration that locks are held improves concurrency.

What locks ? How does the duration of a lock improve concurrency in
most modern systems. Are your familiar with lock sharing ? Do you
realize that most SQL DBMS do mutualize object physical accesses in a
multi transactional context?

The
reduction in I/O required, CPU utilization, and memory required improves
overall throughput. I would venture a guess that the cost of administration
or at least of maintenance would be reduced as well: I personally think it's
a lot easier to follow a simple fetch loop followed by a simple UPDATE FROM
than a complicated set-based solution with four outer theta-joins, an outer
equijoin, an inner equijoin, an aggregation and a union.

I still do not understand how having one view would be more cumbersome
to maintain then several operations (update, creation of additional
objects)...

Cursors are the cause of most timeouts I face daily. (whether due to
dealock, excessive HD/RAM swapping or exlusive object locking).

Regards...



[snip]- Hide quoted text -

- Show quoted text -

Not sure if it advances the conversation, but I thought I'd look into
this a bit further.
I decided to look into this a bit further. Obviously, no by-hand
iteration-based code is going to beat a declarative solution executed
by a "sufficiently intelligent" optimizer. Yet, I think the query can
be expressed much more concisely (allowing the optimizer a much better
chance at an efficient execution) than what Cimode posted.

To test this theory, I created an N-row table in Oracle called K1:

K1 (n int, d float, t time) -- "Station <n> is <d> miles from previous
station and recorded the train at time <t>"

The table was populated as follows (given some number N):
<n = sequence (2..N-1),
d = random float between 0 and 1,
t = "coin-toss", if "heads" then current_time + <n>minutes +
random(0,1) minutes, else NULL>
.... plus 2 boundary tuples:
<1,0,current_time>
<N, 1, current_time+N minutes>

A sample N=10 table:
1 0 5/2/2007 12:50:13 PM
2 0.47
3 0.94 5/2/2007 12:53:38 PM
4 0.65 5/2/2007 12:54:51 PM
5 0.64
6 0.98
7 0.6
8 0.65
9 0.12
10 1 5/2/2007 1:00:13 PM

I then wrote a query which produced this output for the above K1
(which can be checked for accuracy):
1 0 5/2/2007 12:50:13 PM
2 0.47 5/2/2007 12:51:21 PM
3 0.94 5/2/2007 12:53:38 PM
4 0.65 5/2/2007 12:54:51 PM
5 0.64 5/2/2007 12:55:43 PM
6 0.98 5/2/2007 12:57:02 PM
7 0.6 5/2/2007 12:57:50 PM
8 0.65 5/2/2007 12:58:43 PM
9 0.12 5/2/2007 12:58:52 PM
10 1 5/2/2007 1:00:13 PM

I then loaded K1 with N=1000000 (1 million), and submitted the query,
storing all results into a new table, K2. The index was built on
K1(n) in 5 seconds, and the 1-million-row query ran in 1 minute 42
seconds, giving a total runtime of 1 minute 47 seconds. Based on 50k,
100k, and 500k tests, the performance is O(N log N), so you can
project this to any K1 tablesize you'd like.


I'll forgo a complete proof of correctness, and will simply note that
Card(K2) = 1000000, and that a spotcheck comparison of K1 and the
resulting K2 using the 10 ID's between 400000 and 400009 yields:

In K1:
400000 0.22
400001 0.53 2/4/2008 8:10:19 AM
400002 0.17
400003 0.27 2/4/2008 8:11:50 AM
400004 1 2/4/2008 8:12:57 AM
400005 0.63
400006 0.97
400007 0.37
400008 0.56
400009 0.36 2/4/2008 8:18:06 AM

and in K2:
400000 0.22 2/4/2008 8:08:43 AM
400001 0.53 2/4/2008 8:10:19 AM
400002 0.17 2/4/2008 8:10:54 AM
400003 0.27 2/4/2008 8:11:50 AM
400004 1 2/4/2008 8:12:57 AM
400005 0.63 2/4/2008 8:14:04 AM
400006 0.97 2/4/2008 8:15:48 AM
400007 0.37 2/4/2008 8:16:28 AM
400008 0.56 2/4/2008 8:17:28 AM
400009 0.36 2/4/2008 8:18:06 AM


2 caveats before posting the query.
Caveat1: Remember my words, "sufficiently intelligent" optimizer? The
Oracle optimizer (aka CBO) is good, but comes up short in approaching
theta-joins. Here, the CBO consistently fails to realize that joining
K1 (n) against some relation R (n_start, n_end, ...) with criteria
"WHERE n BETWEEN n_start AND n_end" is performed in O(r_n log k1_n) by
indexing K1(n) then scanning R and finding matching rows in K1 using
the index.
Lacking this insight, the Oracle CBO consistently chose to join K1
with R using a O(r_n*k_n) approach: match each row in K1 with every
row in R where n >= n_start, then scan the results for all rows where
n <=n_end. Thus, I needed to manually build an index on K1(n), and
force the CBO to use the O(n logn) algorithm via "hints" (embedded
comments that force the CBO's hand), e.g. /*+ ordered use_nl(md)
index(md) */.

Caveat2:
W/ respect to the window-fucntion "RANK() over(ORDER BY n)", I'd be
curious to see a solution that returns the same results as in the "hd"
subquery, yet runs in O(n log n), using only standard relational
operators. I could only come up with the following O(n^2) version:
select k1.n, k1.d, k1.t, count(*) rnk
from k1, k1 t2
where k1.t is not null
and t2.t is not null
and t2.n <= k1.n
group by k1.n, k1.d, k1.t;


That said, here is the query:
--------------------------------------------------------------
with hd as (select n, d, t, rank() over (order by n) rnk
from k1
where t is not null),
rng as (select /*+ ordered use_nl(k1) */
s.n ns, e.n ne, s.t ts, e.t te, sum(k1.d) rd
from hd s, hd e, k1
where e.rnk = s.rnk+1
and k1.n > s.n and k1.n <= e.n
group by s.n , e.n , s.t , e.t
)
select /*+ ordered use_nl(md) use_nl(prev) */
md.n, md.d, ts+(sum(prev.d) / rd)*(te-ts) t
from rng, k1 md, k1 prev
where md.n > rng.ns and md.n < rng.ne
and prev.n > rng.ns and prev.n <= md.n
group by md.n, md.d, rd, ts, te
union
select n, d, t from hd;
--------------------------------------------------------------

I should add another bonus of declarative programming - the ease of
which I can take advantage of concurrency. The query took 107
seconds. Using a simple parallel hint of degree 8 the query ran in 45
seconds (indexing K1 with parallel degree 8 took about 2 seconds).
Degree 8 happens to be the max concurrency of the dev box used for
this test; but that's a fraction of what our production machine could
do (were I to implement this in our production system with full
allocation of all parallel resources, it'd conceivably run in 15
seconds or less).

Brian - does this compare favorably with your procedural approach?


.



Relevant Pages

  • Re: index bloat?
    ... execution plan is, how can I get the query optimizer to generate it - *without the hints*? ... Because some users of the DB are using report designing software that will just generate the join query with no hints, and I don't want them sitting there for hours when they don't have to. ... Does the optimizer not choose the merge join because it requires a bookmark lookup? ...
    (comp.databases.ms-sqlserver)
  • Re: Adding to Plan_Table
    ... Looking at default goal of Optimizer (All ... > First Rows and the query is now using indexes - Good. ... > store this into plan table. ... and in no way affects the execution of the query. ...
    (comp.databases.oracle.server)
  • Re: OpenForm Does Not Update Embedded Query
    ... It opens the form. ... RecordSource is a query, ... The optimizer populates a Solution ... But I think it's released after execution. ...
    (microsoft.public.access.modulesdaovba)
  • RE: Xlocking with a select statement
    ... named query expression, order clause, update clause, lock option ... A result table or the underlying base tables are updateable if the query ... A lock can be requested for the ...
    (microsoft.public.sqlserver.programming)
  • Re: Finally which ORM tool?
    ... Also, if you pass a variable to the query, the value the ... you have a linq query and by changing the variable's value, ... q is affected if I change foo AFTER this query and BEFORE execution. ... And it is a declaration, but one which captures the variables. ...
    (microsoft.public.dotnet.languages.csharp)