Re: Surrogate Keys: an Implementation Issue




"JOG" <jog@xxxxxxxxxxxxx> wrote in message
news:1154435294.728381.182060@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Brian Selzer wrote:
"JOG" <jog@xxxxxxxxxxxxx> wrote in message
news:1154375986.924741.168930@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Brian Selzer wrote:
I prefer to think of a surrogate as more of a temporal attribute.
While
its
value identifies a proposition in the database, it also indicates that
a
new
proposition became known by the database, allowing it to become part
of
the
discussion.

A logfile will indicate when a relvar has a different relation value in
it. One doesn't need a surrogate for that.

What do I mean by that? Surrogates make it possible to
determine whether totally new information is being added to the body
of
knowledge or that existing information has had to be altered to
reflect
changes in the circumstances underpinning that information.

Totally new information? All updates add totally new information. It's
a totally new relation each time you make an update. You are confusing
RT with the entity/relationship model. The RT is not concerned with
entities but statements of fact. I recommend having another look at the
ORM as it helped me shift out of that mindset.


Yes, totally new. I perceive a significant difference between a
proposition
that reflects new circumstances and a proposition that reflects changes
in
existing circumstances. For example, if a new employee, John Smith, is
hired, then information about him is totally new and now an additional
tuple
must exist to reflect that information; on the other hand, if employee
Joyce
Miller gets married and becomes Joyce Harper, then information about her
isn't totally new--much of it was already known by the database prior to
the
update, and now a replacement tuple must exist to reflect that
information.
In practice, it is often necessary to define extra constraints on added
information.

Well I doubt we're going to agree then Brian, because according to
relational theory all updates/inserts are equally new information. We
are replacing one relvar value with another and there is no "link"
between the two and this is a fundamental part of the underlying
mathematics.

For example, several tasks may need to be performed by
Personnel for a new employee, requiring that the several attributes in
the
Employee relation that denote whether or not those tasks have been
completed
be set a certain way for a new employee. If all updates add totally new
information, then RT is not sufficient to meet that need, and we must
rely
on implementation-specific extensions to define such constraints.

The Relational Model is insufficient in that regard because it does
not
require that
primary keys be immutable.

I think that surrogates should be hidden from end-users, but that is
more
of
a best practice than a requirement and should be imposed upon and
enforced
by applications and database administrators. I disagree with the
notion
that surrogates should "leak" into the real world. They should be
accessible to applications in order to prevent the update problems
I've
described in an earlier post, but their values should never appear on
a
data
entry screen or document. There are many reasons for this, here are a
couple:

(1) Anything that an end-user can see is subject to a change order.
For
example, a customer may object to being referred to by the number
"666."
(2) It may be necessary to take a database offline and change the
physical
representation or values for surrogates, such as is the case when
databases
are merged. If a user has written down the key value, he'll complain
that
he can't find the information anymore.

If you hide it, it is a row address. A pointer, if one at the logical
level.


No, it's not.

Hey, have the courtesy to expand glib responses. I've explained why imo
a hidden surrogate is no more than an artificial entity ID - and
essentially what one is doing with hidden surrogates is no different to
utilising CODASYL record ID's. (and we all know about that right?).
Where's the theory behind them? Convince me via the math.



There should be a second candidate key on every relation. The
difference
between OIDs and surrogates is that OIDs are pointers that reference
an
instance of an object while a surrogate indicates that a statement of
fact
became known.

OID's reference an instance of a class, not an object,

True.

... but you are
correct that they are pointers. Yet if you hide a surrogate, then there
is little difference, it becomes a pointer to a proposition - the field
itself is no part of that real world proposition, it merely addresses
it. Again I honestly believe it is thinking in terms of entities that
generates the confusion.


Again, the database shouldn't hide them, the applications and DBA should.

There is a significant difference. Yes, a surrogate identifies a
proposition, just like any other candidate key, so in one sense it
"points"
to a proposition in the same sense that any candidate key "points" to a
proposition, but the difference manifests itself at the theoretical
level,
rather than the implementational level. At the theoretical level,
surrogates enable the propositions in one database state to be correlated
with the propositions in the next, making it possible to differentiate
between added, replaced and retired information during an update,
permitting
targeted constraints to be defined.

Look, if one is using a surrogate in this way one is saying that no
candidate keys are sufficient to identify a proposition for one reason
or another. You are saying that somehow, in the face of the accepted
fundamentals of liebniz equality, there is a situation when every
single one of an item's attributes have changed and yet it is still
somehow the same thing. You're certainly going to have to provide more
justification for a kludge of this nature and its huge philosophical
break with accepted logic. But hey, I'm all ears if you can do that
formally.


Not exactly. Any candidate key value is sufficient to identify a
proposition within a single database state, but that doesn't mean that it is
sufficient across multiple successive states. While it's true that if two
things are indistinguishible, they are the same thing, the reverse is not
necessarily true: it's not a caterpillar that emerges from a chrysalis, it's
a butterfly.


This means that in an object store there can be object
instances that are identical with exception of the object identifier.

And with a hidden surrogate one can record identical real world
propositions, with the exception of that row identifier. Its the same
mistake imo.


No, one can't. A surrogate is not a surrogate without a second candidate
key; therefore, it is not possible to record identical real world
propositions with a surrogate.

While it may be interesting to be able to count the number of
identical
object
instances, the same proposition can't have become known more than once
within the same database state. Relations are sets of propositions,
so
it
doesn't make any sense to allow duplicates, since something is either
known
to be true or not.

Neither does it make sense to permit duplicate statements of fact, by
using a hidden surrogate.


A surrogate does not allow duplicate statements of fact. See my previous
comment.

Therefore, I disagree with the idea of using a surrogate
without a second candidate key, even if that key includes every other
attribute in the relation. A surrogate is not a surrogate without a
second
candidate key.

"JOG" <jog@xxxxxxxxxxxxx> wrote in message
news:1154262656.521112.118530@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
This discussion has illuminated me, and I would like to share that.
A
row is of course merely a proposition, and is identified by the
predicate fields that make each proposition in that set unique. A
surrogate key similar helps us identify a particular row, so there
initially seems no problem there, and its not an addressing issue.
Rather the issue is that the surrogate _does not exist_ in the real
world, it is initially an artifice of the database, and is not part
of
the original proposition.

So whats it doing there? Consider two people with the same names:

Ex ( Name(x, "Sarah Jones") & Marital_status(x, married) ) &
Ey ( Name(y, "Sarah Jones") & Marital_status(y, married) ) &
x != y

We don't have sufficient information to distinguish x and y (note
this
specifically refers to the propositions, not the conceptual entities
involved). Perhaps this distinguishing information is impossible to
record, or simply cost ineffective to do so. Anyhow, If we had a
mechanism that seperated propositions x and y (an OID for example),
then how on earth would we know which one to update when the time
came?
(Obviously we couldn't, and thats part of the problem with OID's.)

But are surrogates in RM any different? My initial opinion was that
the
answer was NO, that they seem to be no different to OID's, except in
their encoding. However gradually my opinion has changed. The reason
for this is that while an OID is always physical, a surrogate key
leaks
into the real world almost as soon as it is generated. While it is
initially an artifice, it quickly becomes a real attribute, and that
validates it's use as part of the logical model. Hence the rows
become:

Ex ( EmpID(x, 1) & Name(x, "Sarah Jones") & status(x, married) ) &
Ey ( EmpID(x, 2) & Name(y, "Sarah Jones") & status(y, married) )

Noone can tell you what their OID in some database is, but they may
well be able to tell you what their employee/payroll number is.

It hence seems to me that it is the leak in to the real world that
is
absolutely _vital_ to validate surrogate use.



In the end, maybe there is no way we will agree, as you view the world
in somehow definably permanent entities with some divine temporal
consistency on their identity. This is a very common and
understsandable view, but it is an artifice and imo unhelpful to the
progress of accurate information recording.



.