Re: UUIDs (Was: Software implementing Gentech Genealogical Data Model)



On Wednesday 10 June 2009 02:04, Gordon Burditt (gordonb.w05qx@xxxxxxxxxxx)
opined:

But, back to what I've been saying, the concept of universally unique
IDs implies such an id is valid for the same entity, whatever that
entity is, everywhere that entity is found - in multiple databases on
_my_ machine, on Alice's machine, clear through to Zeke's machine.

I disagree. UUIDs have a large number space such that every
genealogist can give every person in the world that ever lived a
UUID, without having any duplicates. A given person, or birth,
death, or marriage record, may have billions of UUIDs referring to
the same person or fact.

OK, I understand RFC 4122 UUIDs .. I think. What you suggest above
implies that I want to slap what's essentially a label on a record, that
label being generated essentially randomly and using one specific type of
RFC 4122 uid.

Actually, this is a great way of keeping track of your *SOURCES* of
information, something which has been bothering me for a while.

But what I've posited elsewhere is an identifier generated by a
known mechanism, such as MD5, from some combination of data points for an
individual. Such an identifier falls within the definition in the RFC of
another form/type of uid.

Yes, but that other type of uid involves using strings that are already
unique, such as URLs, domain names, and OIDs, which already have a
method of keeping them unique.

For the sake of argument, let's say that the
string we submit to MD5 is a concatenation of surname, first/christian
name, day, month and year of birth, city, county, state and country of
birth, and maybe father's christian name and mother's maiden name.

It would be nice if I had all that info for even 20% of my data, but
I don't.

You seem to be attempting to create a Universal Social Security Number.
One of the biggest problems in genealogy is trying to prove that the
person referred to in this record and that record are the same person.
Don't automate it by a trivial process lightly.

Couple of points here: (1) What I've been trying to do is come up with
some sort of rationale for the uuids currently found/in use in several
genealogy programs, including my favorite, phpGedView; (2) the datapoints
I suggest are not intended to be anything but an example of what MIGHT be
used in the generation of the id; (3) so what some part of the information
is incomplete or missing? - such data as there is will still uniquely
hash.

And, yeah, I quite understand that the identifier won't remain constant
over time if I continue to add information to the individual's record, or
modify the information I already have.

Now, I've implemented a process on my data that can try to match
up two records that *might* be the same, based on matches between
data such as you describe. I use a point score for matches (e.g.
1 point for sex matching, 10 points for birth year matching, -50
for birth year (or estimate) being more than 30 years apart, 20
points for birth month, day, and year matching, 10 points for first
name matching, -10 points for both first names known and they DON'T
match, etc.) Those pairs with the highest scores are examined for
possibly being the same person. But it's a pairwise comparison.
Also done in this process is estimation of birth, death, and marriage
dates, based on relationships between people. It puts people in
the approximate century, to avoid false match attempts between
someone and their great-great-grandfather.

There are lots of practical problems. Missing data, for one.
Spelling, for another. My oldest known ancestor along the mail
line is Robert Burditt of Malden, Massachusetts. Or Robert Burden.
or one of several other variant spellings of that last names.
He came from someplace in England, but I haven't found any sources
that claim to say from where. Date of birth is a range of years.

Often I'll have two records of a person: their birth record, which
might have most of the data above, and their marriage or their
child's birth, which likely won't. It would be nice to be able to
match those up with confidence, but you can't.

Now, about country of birth? Is that country *at the time of birth*,
or what country that place is now? What about disputed territories?
That's not an issue for Robert Burditt, but people are born in areas
where you can get murdered for attaching the wrong country name to
certain birthplaces in the Middle East.

Years are a problem also. How about a birth date of 1745/6? No,
that doesn't mean 1745 or 1746. There was this disagreement over
when the year began in some parts of the world (in this case, it
comes from English colonies that became the New England states
in the USA.

I really don't know how to say this differently: For the sake of argument
and example, let's agree that the string to be hashed is a concatenation
of those elements I suggested last night. Now, granting that dates and
places of birth/death can be, and frequently are, incomplete, let's also
agree that we will use as much of that information as we have. Believe
me, I'm not trying to prescribe what MUST go into that string, I'm just
trying to come up with an example of what I've been talking about. There
ARE difficulties with what I've suggested, I know that and knew it going
in. I really don't know what should go into that string, OK?


Now, a
string composed of those elements will hash using MD5 to a particular
32-digit hexadecimal value which is theoretically unique.

No, it's not. Consider the Smith family where twins, both named
John (and no middle initial), were born. Same date, same parents,
same birthplace. I had a problem believing that anyone would
actually DO that, until a specific example was pointed out to me.
Ok, I don't remember the exact names, but it did happen. It was
within my lifetime, too, not way back in colonial days. I suspect
a review of past TV episodes of the "Jerry Springer show" and "Maury
Povich" would reveal a few of these, plus strangeness like "I am
my own grandfather".

Please remember, the elements I suggested make up the string to be hashed
were just for the sake of the discussion. I don't know that those
elements are or would be sufficient to describe an individual and didn't
mean to suggest that they were. As for MD5, its algorithm is supposed to
guarantee than no two strings will hash to the same value UNLESS they are
absolutely identical - the string "aaaa" will always and everywhere hash
to the same 32-digit hex value. If the string changes, say to "aaab",
that new string will hash to a different and unique 32-digit hex value.
I'm much too lazy to do the math, but it's pretty easy to see that the
number of possible 32-digit hex numbers is, for all intents and purposes,
infinite.

If and only if
your information relating to that individual exactly matches mine will
the identifier you generate match mine - otherwise, we'd get different,
but
still unique, values. If we were both to publish our data, tagged with
that id, we'd be reasonably assured, if the ids were identical, that we
were dealing with the same person and, in point of fact, with identical
information relating to that individual, with whatever consequences that
might suggest.

No, you're not assured of that. There's a place in my family tree where
there are two cousins, born a couple of years apart, with the same name.
One of my ancestors is the son of one of those two cousins. But WHICH
ONE? I've encountered research from different places listing both
possibilities. Both sides giving me their version swear their version is
the correct one. Both agree on the name of my ancestor's mother, but it's
unclear which
cousin married that woman. Two descriptions of my ancestor, one having
his father as cousin A, the other having his father as cousin B, would
still
end up with the same ID. The two cousins lived close together, so the
birth place of my ancestor, if known, isn't enough to resolve the
problem, unless perhaps you can get the birth place down to a specific
house.

But, see above. The only way you're going to get the same ID is if the
strings describing the cousins are exactly the same. If there is ANY
difference in their records, in the elements used - thus, in the strings
submitted to the hasing algorithm - they will not result in the same hash
value, guaranteed by the md5 algorithm. And, as I said, the data elements
I suggested for inclusion in the string-to-be-hashed were for the sake of
example - nothing else.

And that's what I _thought_, maybe better said hoped, the uids/uuids
generated by phpGedView or associated with some records on RootsWeb might
be. Clearly, though, they are not as I had hoped/thought and I'm left
wondering what value, if any, they have beyond being just another label.

A universal label that identifies the source of the information would
be of considerable use, but it doesn't remove the work of genealogy.

And I don't disagree with the latter part of the above. The only use I can
think of for uuids is as a flag that would tell me that we're both working
on the same person and that we've arrived at the same point in our
research WRT that person. That might be useful as a measure of confidence
but it certainly doesn't reduce the requirement for (further) research.
It really just says that our snapshots are the same.

Understood that
that's based strictly on the meanings of the words universal and
unique.

universal, adjective, modifies unique. Universal does *not* modify
"identifier".
However, if an identifier is universally unique ... Given the conditions
for generating an identifier outlined above, it would be both unique and
universal, in so far that no other similarly generated identifier would
exactly match it - there would be NO collisions, in other words within
the universe of similarly generated identifiers.
<snip>
Now, granting my understanding is deficient, would somebody please
define
just what IS meant by the phrase universally unique id? And, if it

It's (universally unique) ID, not (universal ID) and (unique ID).


On my computer, I can get a UUID by typing "uuidgen". It takes no
arguments. It makes no difference whether I intend to attach this
UUID to my father's birth certificate, to my father, to a USENET
news posting, or to one of my socks.

Technically, the UUID is generated from a MAC address of the computer,
a very fine-grained time stamp, a counter, and some other stuff. There
are other procedures to use if the computer doesn't have a MAC address.

See RFC 4122 for a complete description. The uuid you describe is just
one
variety - in fact, the first of several. See particularly Para 4.3,
which discusses a uuid generated from an "arbitrary" string.

Yes, but it's supposed to be generated from a string that's already
unique - examples given include domain names, URLs, OIDs, etc.

You miss the point, I think. You first said - see your quote above - that
a uuid was comprised of certain information, implying that that was the
ONLY definition. And that's clearly not the case, as I point out. As for
the uniqueness of the strings, isn't that part of what we're talking
about? What data elements do we select to form the string which will be
hashed to give us our uuid?

Senescent Ol' Bob

--
Robert G. Melson | Rio Grande MicroSolutions | El Paso, Texas
-----
A government big enough to give you everything you want is big
enough to take away everything you have. Thomas Jefferson
.



Relevant Pages


Loading