Re: Guidelines to a decent support of surrogate key implementation
- From: -CELKO- <jcelko212@xxxxxxxxxxxxx>
- Date: 25 May 2007 07:04:48 -0700
I have a taxonomy of key types. We have a lot of problems with
terminology on this one, so let me get that out of the way.
There is no such thing as a "universal, one-size-fits-all" key. Just
as no two sets of entities are the same, the attributes that make them
unique have to be found in the reality of the data. Here is my
classification of types of keys (needs a monofont):
natural artificial exposed surrogate
==================================================================
Constructed from reality |
of the data model | Y N N Y
|
verifiable in reality | Y N N N
|
verifiable in itself | Y Y N N
|
visible to the user | Y Y Y N
1) A natural key is a subset of attributes which occur in a table and
act as a unique identifier. They are seen by the user. You can go to
the external reality and verify them. you would also like to have
some validation rule. Example: UPC codes on consumer goods (read the
package barcode) and validate them with a check digit or a
manufacturer's website, geographical co-ordinates (get a GPS).
2) An artificial key is an extra attribute added to the table which is
seen by the user. It does not exist in the external reality, but can
be verified for syntax or check digits inside itself. It is up to the
DBA to maintain a trusted source for them. Example: the open codes in
the UPC scheme which a user can assign to his own stuff. The check
digits still work, but you have to verify them inside your own
enterprise.
If you have to construct a key yourself, it takes time to deisgn them,
to invent a validation rule, set up audit trails, etc.
3) An "exposed physical locator" is not based on attributes in the
data model and is exposed to user. There is no reasonable way to
predict it or verify it, since it usually comes from the physical
state of the hardware at the time of data insertion. The system
obtains a value thru some physical process in the storage hardware
totally unrelated to the logical data model. Example: IDENTITY
columns, other proprietary, non-relaitonal auto-numbering devices.
Technically, these are not really keys at all, since they are
attributes of the PHYSICAL storage and are not even part of the
LOGICAL data model. But they are handy for lazy, non-RDBMS
programmers who don't want to research or think! This is the worst
way to program in SQL.
4) A surrogate key is system generated to replace the actual key
behind the covers where the user never sees it. It is based on
attributes in the table. Example: Teradata hashing algorithms,
indexes, pointer chains, ADABASE numbers, etc.
The fact that you can never see it or use it for DELETE and UPDATE or
create it for INSERT is vital. When users can get to them, they will
screw up the data integrity by getting the real keys and these
physical locators out of synch. The system must maintain them.
** Notice that people get "exposed physical locator" and surrogate
mixed up; they are totally different concepts. **
An appeal to authority, with a quote from Dr. Codd: "..Database users
may cause the system to generate or delete a surrogate, but they have
no control over its value, nor is its value ever displayed to
them ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979),
Extending the database relational model to capture more meaning. ACM
Transactions on Database Systems, 4(4). pp. 397-434.
This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means
never used in queries, DRI or anything else that a user does.
Codd also wrote the following:
"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.
(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result
that some or all of the serial numbers might be changed.).
(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.
(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it
has ceased to have one (e.g. and applicant for a job and a retiree).
These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] -
is to introduce entity domains which contain system-assigned
surrogates. Database users may cause the system to generate or delete
a surrogate, but they have no control over its value, nor is its value
ever displayed to them....." (Codd in ACM TODS, pp 409-410).
References
Codd, E. (1979), Extending the database relational model to capture
more meaning. ACM Transactions on Database Systems, 4(4). pp.
397-434
The steps for finding a key are
1) Look for an industry standard and the trusted external source that
maintains and verifies it. I count this as a natural key, but you
could argue that it is artificial.
2) Look for a natural key in the attributes. Example: (longitude,
latitude) makes a good key for a geographical location. A GPS can be
used to verify it.
3) If you must design a new identifier, plan it carefully --
especially if people will see and use it. You have to be able to
verify it in application programs, so you should have a regular
expression, other syntax rules, check digits. You have to be able to
be verify in the reality of the model or with a trusted source that
you maintain.
Validation means the format is good -- "This could one of our invoice
numbers because it is 7 digits long, passes a Bull code check digit
and begins with { '01', '02', '07', '99'}"
Verification means that it references a real entity -- "This is a real
invoice because I can look it up in Accounts Payable and trace its
approval back to Cindy Lu Who on 2005-02-12."
This is hard work. I have a few chapters in SQL PROGRAMMING STYLE on
scales, measurements and how to design encoding schemes.
.
- References:
- Prev by Date: Re: Databasescheme with UML
- Next by Date: Re: Question about "linking tables"
- Previous by thread: Re: Guidelines to a decent support of surrogate key implementation
- Next by thread: Little design mistakes that can be easily avoided (2): Listenning to CELKO (and CELKO alikes)
- Index(es):
Relevant Pages
|