Re: Self Joins and optimization
- From: "Brian Selzer" <brian@xxxxxxxxxxxxxxxxxxx>
- Date: Sat, 12 May 2007 22:30:00 GMT
"David Cressey" <cressey73@xxxxxxxxxxx> wrote in message
news:Lci1i.4920$1X1.767@xxxxxxxxxxx
"Brian Selzer" <brian@xxxxxxxxxxxxxxxxxxx> wrote in message
news:ej81i.8350$rO7.3542@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[big snip]
Don't take this the wrong way, I understand that this is a first attemptat
a very complicated problem (I spent several weeks on it back in 2002.),but
I really don't think the above solution is a move in the right direction.that
In fact, it is very similar to what I moved away from. It is essentially
the same file layout that existed in the old DOS-based Btrieve database
I replaced, except that ....
Brian, don't take this the wrong way, either.
I understand that you solved this problem some time ago, and that your
solution is satisfactory to you. That's not the question I'm addressing
at
all. What I'm interested in exploring is your assertion of the
non-existence of a set oriented solution to the problem, where the set
oriented solution runs at approximately the same speed as your sequential
processing solution.
I'm not convinced that multiple self joins are in fact required in order
to
produce a set oriented solution. In particular, I'm interested in
finding
a solution that bypasses one of the steps in the set oriented solution you
posted at my request. It's the step where each TX ON is matched with its
corresponding TX OFF. That step seems unnecessary to me, although I
can't
prove it, yet.
Unfortunately, it is necessary. How else could you tell which activities an
employee was performing during an interval bounded by events adjacent in
time? Each row in TX contains only half of the information required to
determine that; each row also contains only half of the information required
to determine the intervals between events.
Consider:
t1....t2....t3....t4....t5....t6
a1: ------------------------
a2: --------------
a3: -----
During interval t3..t4, the employee is performing activities a1, a2 and a3,
but the pair of rows that define the interval t3..t4 do not reference a1 or
a2 at all, nor do the adjacent rows reference a1. The entire interval for
each activity must be calculated in order to determine whether each
interval, t1..t2, t2..t3, t3..t4, t4..t5, t5..t6, overlaps it.
If there is any point at issue between you and me, its your assertion,
in
the thread that gave rise to this thread, that Cimode would have been
better off using cursors in his approach to interpolation. I doubt that
assertion, and I don't think your example, even if it stands up, proves
that assertion.
The proof is in the pudding. Cimode has yet to provide additional
information about the frequency of nulls in the data so that apples can be
compared to apples. But after further reflection, I think his query can be
answered with two simultaneous passes through the data, provided an index
exists on ID. (It is the key, so it follows that there should be an index.)
In Sql Server, a table variable would be needed to cache the results,
requiring an additional write and read per row making the total passes four.
I haven't played with them much, but in Oracle, a pipelined table function
could be used that would stream the results directly, eliminating the
write/read overhead. Furthermore, since the second cursor follows closely
behind the first (see below), leaving only rows with null between, it is
highly likely that little if any additional I/O would be required for the
second pass through the data. I challenge anybody to find a set-based
solution to Cimode's problem that would consistently require fewer than two
passes through the table without increasing overhead in other ways, such as
the cost of maintaining an additional index or a materialized view.
Here's pseudocode for the meat of a pipelined table function:
fetch x
fetch y
if x%FOUND then
-- skip over stuff that can't be interpolated (initial null rows)
while x.ArrivalTime is null and x%FOUND loop
fetch x
fetch y
end loop
end if
if y%FOUND then
fetch y
-- if the last y.ArrivalTime is null, then we're done (skip terminal null
rows)
while y%FOUND loop
-- skip ahead while accumulating total distance
while y.ArrivalTime is null and y%FOUND loop
accumulate TotalDistance
fetch y
end loop
if y%FOUND then
-- catch up while interpolating ArrivalTime
while x.ArrivalTime is null loop
calculate ArrivalTime
adjust TotalDistance
pipe row
fetch x
end loop
reset TotalDistance
-- don't need to interpolate
ArrivalTime := x.ArrivalTime
pipe row
fetch x
fetch y
end if
end loop
end if
In Sql Server, each pipe row above would require an insert. In either case,
the entire solution should execute in linear time.
The key here is that there is nothing happening within the fetch loop that
touches the database. Two streams are attached as input to the function and
the resulting stream is attached to the enclosing select which presents the
results. This operates much the same as a merge join optimizer step, which
expects its two input streams to already be sorted and produces one output
stream.
Your assertion has enormous implications, both practical and theoretical,
if it turns out to be true. I can comment better on the practical
implications than on the theoretical ones.
In practice, many database consultants and/or developers have made a
minor
industry out of weeding cursor oriented code out of database applications
and substituing well engineered set oriented soultions. If your assertion
is correct, nearly all of these improvements were substantially less than
optimal, at least from a speed point of view. For example, December 28,
2006 Kevin Kirkpatrick (I think) outlined a case where he wrote about
300
lines of SQL that did, in 45 minutes, what would have taken about 300
hours using the procedural solution he replaced. Kevin's example is more
startling than yours. And, IMO, more convincing.
I don't dispute the benefits of well engineered set-oriented code. I am not
asserting that procedure-oriented solutions are always better. I want to
make that absolutely clear. But there *are* occasions when a
procedure-oriented solution will outperform a set-oriented one. I have now
provided two examples, and the key to both is the elimination of self-joins.
What I object to is the unfounded and unsupported blanket claim that
procedure-oriented solutions are always slower, always reduce concurrency,
and always reduce scalability when compared to equivalent set-based
solutions.
I hasten to add that there ARE cases where I would adopt a procedural
approach, but I would NOT make the performance claims that you made. If
you're a practical person, you have more tools than just a hammer in your
tool kit.
I'll leave the theoretical implications of your assertion that cursors are
better to the theoreticians in this ng.
.
- References:
- Self Joins and optimization
- From: David Cressey
- Re: Self Joins and optimization
- From: Brian Selzer
- Re: Self Joins and optimization
- From: David Cressey
- Re: Self Joins and optimization
- From: Brian Selzer
- Re: Self Joins and optimization
- From: David Cressey
- Re: Self Joins and optimization
- From: Brian Selzer
- Re: Self Joins and optimization
- From: David Cressey
- Self Joins and optimization
- Prev by Date: Re: more closed-world chatter
- Next by Date: Re: Self Joins and optimization
- Previous by thread: Re: Self Joins and optimization
- Next by thread: Re: Self Joins and optimization
- Index(es):
Relevant Pages
|