Re: strange behaviour of ntp peerstats entries.



mayer@xxxxxxxxxxx (Danny Mayer) writes:

Unruh wrote:
mayer@xxxxxxxxxxx (Danny Mayer) writes:

Unruh wrote:
Brian Utterback <brian.utterback@xxxxxxx> writes:

Unruh wrote:
"David L. Mills" <mills@xxxxxxxx> writes:
You might not have noticed a couple of crucial issues in the clock
filter code.
I did notice them all. Thus my caveate. However throwing away 80% of the
precious data you have seems excessive.
Note that the situation can arise that the one can wait many more than 8
samples for another one. Say sample i is a good one. and remains the best
for the next 7 tries. Sample i+7 is slightly worse than sample i and thus
it is not picked as it comes in. But the next i samples are all worse than
it. Thus it remains the filtered one, but is never used because it was not
the best when it came in. This situation could keep going for a long time,
meaning that ntp suddenly has no data to do anything with for many many
poll intervals. Surely using sample i+7 is far better than not using any
data for that length of time.

On the contrary, it's better not to use the data at all if its suspect.
ntpd is designed to continue to work well even in the event of loosing
all access to external sources for extended periods.

And this could happen again. Now, since the
delays are presumably random variables, the chances of this happening are
not great ( although under a condition of gradually worsening network the
chances are not that small), but since one is running ntp for millions or
billions of samples, the chances of this happening sometime becomes large.


There are quite a few ntpd servers which are isolated and once an hour
use ACTS to fetch good time samples. This is not rare at all.

And then promplty throw them away because they do not satify the minimum
condition? No, it is not "best" to throw away data no matter how suspect.
Data is a preecious comodity and should be thrown away only if you are damn
sure it cannot help you. For example lets say that the change in delay is
.1 of the variance of the clock. The max extra noise that delay can cause
is about .01 Yet NTP will chuck it. Now if the delay is 100 times the
variance, sure chuck it. It probably cannot help you. The delay is a random
process, non-gaussian admitedly, and its effect on the time is also a
random process-- usually much closer to gaussian. And why was the figure of
8 chosen ( the best of the last 8 tries) why not 10000? or 3? I suspect it
came off the top of someone's head-- lets not throuw away too much stuff,
since it would make ntp unseable, but lets throw away some to feel
virtuous. Sorry for being sarcastic, but I would really like to know what
the justification was for throwing so much data away.

No, 8 was chosen after a lot of experimentation to ensure the best
results over a wide range of configurations. Dave has adjusted these
numbers over the years and he's the person to ask.


OK. The usual comment is that you throw away about 40% of the data using
the median filter (eg looking at the shm refclock program where that
40%figure is attributed to him and in ntp as well). But here one is trowing
away over 80% ( Ie keeping less than 1/6 of the data).
Running a very quick test on one system on my lan, I find that this changes
the variance of the offsets by about 10%. Ie, it makes only a marginal
difference to the variance. ( and yes, there is a fair amount of
correlation between the offset fluctuation and the delay fluctutation.
(correlation coefficient .5) . Actually the main thing this seems to do is
to make the variance in the delay times small, not the variance in the
offset.

I am also a little bit surprized that it is the delay that is used and not
the total roundtrip time. As I seem to read it, the delay is (t4-t3+t2-t1)
ie, it does not take into account the delay within the far machinei (eg
t4-t1), but
only propagation delay. I would expect that the former might even be more
important than the latter, but that is a pure guess-- ie no measurements on
even one system to back it up.
Now it may be that on that rocky road to Manila, the propagation delay is
by far the most important, but on a moderm lan, especially with a low
propagation delay of hundreds of usec rather then 100s of msec, I wonder.

I munged ntp record_peer_stats to also print out the p_off and p_del, (ie
the immediate offset and delay of the current packet) and counted up in the
output how often peer->off and p_off are different from each other,
indicating a thrown away packet of data. I got 83% of the time.


.



Relevant Pages

  • Re: strange behaviour of ntp peerstats entries.
    ... variance truly does minimize the offset variance. ... the filter reduces the variance by some 10 ... delay variance IS reduced, ...
    (comp.protocols.time.ntp)
  • Re: strange behaviour of ntp peerstats entries.
    ... there is an old RFC or IEN that reports the results with varying numbers of clock filter stages, from which the number eight was the best. ... The filter can introduce additional delay in the feedback loop. ... meaning that ntp suddenly has no data to do anything with for many many ...
    (comp.protocols.time.ntp)
  • Re: strange behaviour of ntp peerstats entries.
    ... But the filter goes well ... The procedure does drastically reduce the variance of the delay, ... meaning that ntp suddenly has no data to do anything with for many many ...
    (comp.protocols.time.ntp)
  • Re: strange behaviour of ntp peerstats entries.
    ... It would seem self evident from the equations that minimizing the delay variance truly does minimize the offset variance. ... More to the point, emphasis added, the wedge scattergrams show just how good the filter can be. ...
    (comp.protocols.time.ntp)
  • Re: strange behaviour of ntp peerstats entries.
    ... delay variance truly does minimize the offset variance. ... Further evidence of that is in the raw versus filtered offset graphs in the architecture briefings. ... the filter reduces the variance by some 10 dB. ...
    (comp.protocols.time.ntp)