A new proof of the superiority of set oriented approaches: numerical/time serie linear interpolation



Among the least explored opportunities RM has to offer comes the issue
of numerical / time series linear interpolation of values. Recently,
I have solved the following problem on a community board which I
encountered this problem a few years ago with different facts while
working for the Frech railroad company. I thought about using the
recent example (better for pedagogical purposes) to illustrate the
work done a few years ago.

To make sure the heavier traffic lines were weel maintained and fit
for high speed service, each 2 weeks, the company launched trains
equipped with receptors that ran across lines and recorded some data
such as passage time and distance. The data was later pourred in a
statistical database for analysis and decision were made that some
line required additional maintaince (replacement of line segment).

Without getting into too much detail, the process consists of the
following segment of reality captured: A train goes from a start train
station at a specific start time and ends its ittinerary in another
station at a end time. Between these stations, the train crosses
several stations where some data was *not* collected for some
reason(mainly the receptor was not working). As a consequence, on
direct image systems (SQL Server, ORACLE or DB2), NULL pointers were
placed instead of the values of passage time and a row was to be added
at each station passage. So, we have a *train_passage* table like
this (I supposed a SK ID)

ID Distance ArrivalTime
24 0.289 2000-01-01 06:29:00.000
25 0.193 NULL --> 1st value to be interpolated
26 0.299 NULL --> 2nd value to be interpolated
27 0.131 NULL --> 3rd value to be interpolated
28 0.444 NULL --> 4th value to be interpolated
29 0.16 NULL --> 5th value to be interpolated
30 0.665 NULL --> 6th value to be interpolated
31 0.186 2000-01-01 06:33:00.000

The point of the problem was to interpolate all NULL ArrivalTime
values, based on the following formula.(A static linear formula is
used for the moment and there are probably better formulas but the
point of the problem is not about the efficiency of the formula):

ArrivalTime for ID = 25:
(Distance for 25 / Sum Distance for IDs 25-30) * (ArrivalTime for ID
31-ArrivalTime for ID 24)
ArrivalTime for ID = 26:
(Distance for 26 / Sum Distance for IDs 25-30) * (ArrivalTime for ID
31-ArrivalTime for ID 24)

and so forth....

In most cases I have encountered, practionners and developpers use a
procedural approach (namely cursors) to solve this problem. For each
row with unknown datetime values, they executed a function. As a
consequence on several million records and because each row performs
heavy calculation, the resource consumtion simply grows
exponentially. (Mainly because of non separation between the physical
and logical layer in direct image systems). So, every two weeks, they
had to run a 10 hour batch process to interpolate the missing data.
In pseudo code the initial procedural looked like this:

FOR ALL ROWS
START
IF ROW HAD ALL DATA
STORE NEW START TIME
STORE NEW END TIME
WRITE ALL COMPUTED INTERPOLATED VALUE FROM
PREVIOUS SEGMENT OF LINE INTO FINAL INTERPOLATED TABLE
WRITE NEW ROW IN FINAL INTERPOLATED TABLE
END IF

IF ROW HAD MISSING DATA
COMPUTE INTERPOLATED DATA
STORE COMPUTED VALUE IN TEMP
END IF

GO TO NEXT ROW DATA
END

SHOW ALL VALUES IN FINAL INTERPOLATED TABLE


Because of the procedural approach,the process locked up the database
in exclusive mode and prevented any concurrent process and had to be
launched several time because it sometime simply timed out due to a
lack of resources. This is an example that demonstrate that set
operations are light years from procedural approaches. I reexpressed
the interpolation as a complex UNION operation between two relations
and ended up with following SQL statement....(note: this is T-SQL,
dateadd is a function that adds a certain numer of time perdiod to a
specific datetime)

(
select E.id, E.Distance, dateadd(s, (E.Distance/G.distance) * G.gen,
E.start) arrivaltime
from yourtable F inner join
(
select A.ID, A.Distance, B.arrivaltime start,
datediff(s,B.arrivaltime, C.arrivaltime) as gen
from yourtable A
left outer join
(
select id, distance, arrivaltime from yourtable
where ArrivalTime is not null
) B
on B.id < A.id
left outer join
(
select id, distance, arrivaltime from yourtable
where ArrivalTime is not null
) C
on C.id > A.id
where A.arrivaltime is null
) E
on E.ID = F.ID

left outer join
(
select sum(A.Distance) distance, datediff(s,B.arrivaltime,
C.arrivaltime) as gen
from yourtable A
left outer join
(
select id, distance, arrivaltime from yourtable
where ArrivalTime is not null
) B
on B.id < A.id
left outer join
(
select id, distance, arrivaltime from yourtable
where ArrivalTime is not null
) C
on C.id > A.id
where A.arrivaltime is null
group by
datediff(s,B.arrivaltime, C.arrivaltime)
) G
on E.gen = G.gen
)
union
(
select ID, Distance, Arrivaltime from yourtable where ArrivalTime is
not null
)


As a result, I obtained the following interpolation...

ID Distance ArrivalTime
24 0.289 2000-01-01 06:29:00.000
25 0.193 2000-01-01 06:29:24.000
26 0.299 2000-01-01 06:29:37.000
27 0.131 2000-01-01 06:29:16.000
28 0.444 2000-01-01 06:29:56.000
29 0.160 2000-01-01 06:29:20.000
30 0.665 2000-01-01 06:30:24.000
31 0.186 2000-01-01 06:33:00.000

The execution time went from 10 hours to 30 minutes and the database
(in fact the table) was not in exclusive mode anymore (conccurent
queries could be ran during the execution of the
interlpolation)...Most of all, no more time outs were to be noted and
we had the assurance that the process would not stop at the 9th hour.

This example establishes once more the superiority of set oriented
operations over procedural approaches but it also brings some
questions...If we can interpolate values from relations, would not we
imagine a TRDBMS with a systematic method (based on interpolation) to
deal with missing numeric/datetime data. Would not that be a possible
systematic method to handle missing data in the spirit of what Codd
imagined (I would give a lot to be able to ask him this question if he
was alive)...

I would be grateful for your comments, ideas on this subject....

.



Relevant Pages