Re: UUIDs (Was: Software implementing Gentech Genealogical Data Model)



On Wed, 10 Jun 2009 12:32:58 -0600, Bob Melson <amia9018@xxxxxxxxxxx>
wrote:

But I'm really NOT trying to prescribe an infallible method at this point.
The data points I include are for the point of discussion only. I don't
know what would be necessary to arrive at a unique description. What I do
suggest is that an identifier based on some concatenation of data elements
would allow me to publish a snapshot of my individual and allow others
with exactly the same information to establish that we're talking about
the same person and the same information relating to him. It doesn't
remove the need for research and doesn't even mean that the information is
correct - it just means that the snapshots we've "taken" are the same.
And that might be a worthwhile objective.

I really don't know that what I've suggested is either meaningful or
practical. Matter of fact, I somewhat doubt it is, save, perhaps, as a
confidence indicator of sorts. And even that is doubtful, seems to me.


Although I must admit that yours might succeed and even overshoot: I'll
bet you'll never find two records without spelling mistakes in any of
these fields

<snip>

As I've said repeatedly, what I've been looking for is an explanation of
the uids currently in use in, e.g., phpGedView and frequently seen on
RootsWeb. The discussion has, if you will, degenerated into an argument
about word meanings: what DOES the phrase universally unique identifier
really mean in a genealogical context? And from that, it's become what
you're replying to. In large part that's because I've not explained
myself well, but it's also because others either don't understand MD5
hashing or RFC 4122 or choose to ignore one or both or are just being
contrarian (like me, I admit).

Let me see if I can state the "problem" more clearly. In a genealogical
context, there is a universe of strings, concatenated from an agreed set
of data elements which describe what's known about a "person" at a
particular point in time. Given the set of data elements is properly
selected, those strings may be assumed to be unique within their universe.
Those strings will, when passed to a hash function, result in the return
of a unique number. In the case of the MD5 hash function, that unique
number will be a 32-digit hexadecimal number. Coincidentally (?), the
data fields of an RFC 4122 UID are 32 digits in length. While there are
multiple types of UID described in RFC 4122, the one of interest is that
described in paragraph 4.3 of that document, which permits the use of
identifiers derived from unique strings. (It occurs to me, however, that
the RFC 4122 definition is unnecessary for our purpose - all we need is an
agreement that a genealogical uuid consists of a 32-digit hex number
derived in the manner stated.)

Now, the only time such a uuid would have any significance, or so it seems
to me, is when the record from which it's derived is published. And, even
then, its utility is problematic. It's certainly not a guarantee that the
information published is correct or complete. It's really little more
than a signal to others with exactly identical information that the
information and, therefore, the person to which it's applied, is the same
as the person/information in their own collection. It says nothing about
origin or accuracy or sources or completeness, it doesn't take subsequent
changes into account - it merely says that, at the time the snapshot was
taken, that the records were identical. Further uses are, as they say,
left as an exercise for the reader.

Stupefied Ol' Bob

Let me state what I think you are saying:

Your goal is to generate a Unique ID Number, such that:
1) Every person who ever existed is given a distint number
2) A given person has one 1 ID number assigned.
3) You want to make it so that when you publish your information,
everyone is assigned one of these numbers.
4) If someone else publishes data about the same people, they will
most likely get the same ID number for the same people.

To meet requirement 1, you need to build the number from enough data
the it will be unique for each person.
To meet requirement 2, for a given person, you (and anyone else) will
need to use exactly the same set of data to generate this number
To meet requirement 3, then for everyone in your database, you need to
know all of that info (otherwise you can't generate the number).

To begin with, I am not sure about your research, but many of the
people in mine do not have all the information that is apt to be
needed to make this number. Since everyone needs a number, something
has to give. We will need to be able to generate a number for less
information, which suddenly will mean we create a different number for
this person than we should, so Goal 2 is lost. We also need to be real
careful about losing #1, as if all we know is that we have a John
Smith born c 1830, the father of Jane Smith, then we probably don't
have enough to have a unique number, and #1 is an absolute requriment
or data from different people wil be mixed, as a key purpose of these
IDs are to be key for our database.

Now if we get to the other part of your question, what about the
number that, for example, Rootsweb will attach to a person. First,
they don't need an algorithmic way to generate these numbers for
people, they can just be given out sequentially as they identify
matchs among all the trees uploaded to them (and the people in the
trees can all be given UUIDs by combining the ID number of the tree
with the ID number withing that tree of the person). Each "person" in
the global tree, probably has links to each of the trees suporting
that information, if they find they made a mistake a gave two numbers
to what they now think is one person, they can chose one and point all
the records to that new person (and maybe save to old number and have
it point to the new person too). If they find they confused two people
and had merged them, they create a new number and divide the info to
the appropriate ID.

Creating these ID numbers by just trying to combine the know data of
the person if not going to be very productive. The odds that two
researchers will generate the same number is fairly small, so it
doesn't help for that, while the odds that duplicates appear for
distinct people gets large, especially for people with little data. A
much better solution to this is to bring the other data into the
database as a totally distinct set of data, and then run a matching
algorithm on the data finding chunks that look similar enough to be
considered a match, and add matching tag info. If the database uses
the UUIDs that I was talking about, if it sees a UUID that it already
has, it know that this is data it has already seen and can use that
info to reduce the work needed.

.



Relevant Pages

  • Re: UUIDs (Was: Software implementing Gentech Genealogical Data Model)
    ... suggest is that an identifier based on some concatenation of data elements ... those strings may be assumed to be unique within their universe. ... Now, the only time such a uuid would have any significance, or so it seems ... trees can all be given UUIDs by combining the ID number of the tree ...
    (soc.genealogy.computing)
  • Re: Software implementing Gentech Genealogical Data Model
    ... if I purge the UUID for ol' Joe on System A and regenerate the ... Stupefied Ol' Bob ... UUID is a universal mechanism that gives a unique identifier to whatever you ... in your database only a reference, NOT an identifier, since ...
    (soc.genealogy.computing)
  • Re: UUIDs (Was: Software implementing Gentech Genealogical Data Model)
    ... I understand RFC 4122 UUIDs .. ... But what I've posited elsewhere is an identifier generated by a ... universe of similarly generated identifiers. ... I can get a UUID by typing "uuidgen". ...
    (soc.genealogy.computing)
  • Re: New FamilySearch - are the gains worth the losses?
    ... If the member submissions on IGI are any guide to the trees I wouldn't want to download them! ... Because the new database doesn't support the one function ...
    (soc.genealogy.computing)
  • Re: Call for discussion...
    ... I would not put separate trees in a single file. ... If you're using some kind of database, the situation is a bit different ... Essentially you are saying safety first protects one's data. ...
    (soc.genealogy.computing)