Re: Self Joins and optimization




"David Cressey" <cressey73@xxxxxxxxxxx> wrote in message
news:ZE%_h.163$D52.68@xxxxxxxxxxx
In the "interpolation" thread, Brian has been expounding on the idea that
he can discover algorithms that he knows to be correct, and that
outperform
anything an optimizer can generate. He's mentioned "self joins" and the
idea that he can get a job doen with only one read, where the automatic
solution generated in response to declarative code generates multiple
passes
through the data.

My experience leads me to the opposite conclusion from that.

Here's the questions I'd like the newsgroup to consider:

What, if any, algorithms that are demonstrably correct are going to be
overlooked by the optimizer, although discoverable by a human coder? Why
would an optimizer fail to discover the possibility of eliminating
self-joins? Are there any undecidability issues here?

Here's a maybe not-so-simple example:

Given a table with the columns,

EMPLOYEE#, JOB#, STEP#, MACHINE#, TX_TIME, TX_TYPE

where TX_TYPE can be either 'ON' or 'OFF' and the key is the entire heading,

compute the total labor performed to date for each job step.

Notes: (1) an employee cannot be perfoming the same job step on the same
machine at the same time, but (2) intervals can overlap, and when they do,
labor should be distributed between them evenly. For example, an employee
may be monitoring two or more machines during the same period of time, so
his labor during that period should be divided evenly between the job steps
on each machine, but that same employee cannot be monitoring the same job
step on the same machine more than once during the same period of time.

A procedural approach would require two passes through the data: One to
compute labor per job step per employee, and one to compute the aggregate,
labor per job step.

A set-based approach would require a minimum of six passes through the data:
Determining the intervals during which each employee was performing a set of
job steps requires a left outer self-join, or two passes through the data.
Computing the cardinality of each of those sets so that labor can be
distributed evenly requires an aggregation and then a join--three additional
passes. Computing the aggregate, labor per job step, requires a final pass.

Since the final pass is common to both approaches, the procedural approach
effectively replaces five passes through the data with a single pass.

Why would the optimizer fail? Because I don't think it's possible for the
optimizer to detect every intrarelational dependency in a table, nor do I
think it would be advisable, unless it were possible to enumerate them in
the schema.

[snip]


.



Relevant Pages

  • Re: Self Joins and optimization
    ... anything an optimizer can generate. ... labor should be distributed between them evenly. ... In order to make things easier to follow, instead of a single query, I split ...
    (comp.databases.theory)
  • Re: Self Joins and optimization
    ... anything an optimizer can generate. ... labor should be distributed between them evenly. ... The execution plan generated for the supplied query includes ten ...
    (comp.databases.theory)
  • Re: Self Joins and optimization
    ... anything an optimizer can generate. ... In order to make things easier to follow, instead of a single query, I ... EMPLOYEE# INT NOT NULL, ... I built a table, called "segments" ...
    (comp.databases.theory)
  • Re: Self Joins and optimization
    ... anything an optimizer can generate. ... In order to make things easier to follow, instead of a single query, I ... EMPLOYEE# INT NOT NULL, ... I built a table, called "segments" ...
    (comp.databases.theory)
  • Self Joins and optimization
    ... he can discover algorithms that he knows to be correct, ... anything an optimizer can generate. ... Oracle CBO, although the Oracle CBO appears to have been conceptually ...
    (comp.databases.theory)