Re: Capturing genealogy data from websites



singhals wrote:
I dunno. I'd "view with alarm" something that made it easier for someone to -- ummm, appropriate? the content of one of my websites. Seems a little like helping oneself before being asked?

It would depend on the format of your website but this does raise the question of "why a website?".

I was brought up in the old-fashioned academic environment that took the view that you published stuff, at least in part, because you want to share it with others. If you don't want to share, don't publish.

Take for instance http://familytree.dearnley.com/reports/ian_goddard_research1.htm (Mark is being overgenerous in that quite a lot of it his work or duplicates what he'd done). I did quite a lot of this before deciding that my James Dearnley (http://familytree.dearnley.com/reports/ui63.htm) didn't fit into it (any info on the origins of the James at the root of this would be gratefully welcomed BTW). However it seemed a shame not to share that and other stuff on some of Mark's pages. Likewise stuff I've posted to various Genforums family pages.

Nevertheless it would be difficult to import anything from pages such as those I've just referenced; you'd have to try and parse the HTML.

It's also the case that a site like this doesn't set out to be a systematic presentation of original registers etc. To use it you'd have to go to the likes of Familysearch, FreeBMD, Ancestry or whatever & from there to the originals. In other words what Mark's site and discussion forums share isn't so much raw data as thinking.

OTOH Familysearch, FreeBMD, Ancestry or whatever do set out to provide raw data. Inevitably it will have shortcomings in transcription, in geographic background (e.g. /every/ Almondbury and Kirkburton baptism of potential interest has to be checked to see if was really there or at a chapel), etc. But ISTM that it's better to expend one's efforts in refining electronically transfered copies than in introducing extra errors by retyping. That requires some means of passing the grunt work to the machine. In the past Familysearch has offered Gedcom downloads to do that which brings us to...

And, for the record, nFS apparently DOES have a way to d/l stuff. I haven't tried it, I haven't even looked for it too hard, but I understand It's In There.

I asked about this & was told:

The following is a copy of the policy on Gedcom files.
“FamilySearch Website (www.familysearch.org): December 2010
Uploading and Downloading GEDCOM Files
Currently, the FamilySearch website does not support the uploading and downloading of GEDCOM files. The uploading of GEDCOM files is planned for a future release. There are currently no plans to allow the downloading of GEDCOM files.”

They didn't mention downloading in any other format so I'm assuming, maybe wrongly, that if they were planning some other download format they'd have mentioned that.

And, not to throw off on your programming skills or anything, chances are good that a certain percentage of the users of your program won't like the way it works any more than they like the way GEDCOM works.

That's why some of us like open source S/W and the way in which, at least in theory, a community of users can help shape it.

It nothing else, your plan is predicated upon the premise that the data will be CONSISTENTLY presented in a certain format. That pretty much assumes that format will fit all scenarios...one of the major complaints about GEDCOM.

The only requirement of the data capture itself is that the data is in the form of repeated lines of

name_of_some_field : value_of_that_field

I like to make a minimum number of assumptions and that's about the bare minimum. Given that it will provide well-formed, i.e. syntactically correct, XML. Capturing, say, FreeBMD data would be out of scope as would capturing the 1881 British census household records from nFS. Nevertheless, even if it only covered nFS BMD records that could save a lot of work.

The reasons for choosing XML as the output are (a) it becomes an easy job and (b) the technology (XSLT) exists for converting it into other formats including other XML formats.

Clearly births/baptisms, marriages & deaths/burials all have their individual requirements & the XSL stylesheets will have to handle that variation.

Because the original data format is quite flat variations in field order should, AFAICS, be tolerated by the XSL. Some variations in spelling of field names will be flattened out, e.g. Father's name, Fathers name and Father's Name will all become fathersname as an element name. At least they will as soon as I go back to the code to make it cast them all to lower case. A format which replaced Father's name by simply Father as a field name would have to be explicitly handled. If some site introduced a new concept, say military rank, this would be silently ignored unless the stylesheet were altered.

One of the real problems is that the nFS format combines forenames & surnames. Differentiating between multiple forenames & double barreled surnames is never going to go right. Another is going to be variations of date format between source and target.

Nevertheless experience has taught me that this overall type of architecture can be surprisingly flexible.

On a philosophical and/or theoretical level, it'll be an interesting project, though.

I refer the Hon. lady to my earlier answer re open source.

(g) Lemmeno when you need someone to crash it.

Watch this space.

--
Ian

The Hotmail address is my spam-bin. Real mail address is iang
at austonley org uk
.



Relevant Pages

  • Photos
    ... Once you have one opened then go to Format> Format ... provide a link to the website so we could help you with any issues. ... Read the instructions. ... page you will import will be 2250 pixels long. ...
    (microsoft.public.publisher.webdesign)
  • Re: Interesting SFSO CD sales statistics
    ... I'll not claim that downloading classical music is antithetical to the ... First is that the CD format may prove more popular with grayheads - ... such as I - due to our difficulties in easing into newer technologies. ... palpability and reality of collecting records is something that will ...
    (rec.music.classical.recordings)
  • Re: Do you consider this software licensing practice ethical? I am furious!
    ... without eating up resources if they do not format their hard ... and at the end of a FAQ "printed clearly" on the website? ... because I normally do use a repair install or a restore point ... else's quilting software that has what you regard as a better licensing policy. ...
    (misc.consumers)
  • Re: Self-service checkouts have not cut supermarket queues
    ... of card you want to pay with, so the computer should know what number ... that the strange format was probably the cause, ... their own to successfully parse and perform a cursory validation on ... I wrote an ecommerce website for a local cleaning ...
    (uk.legal)
  • Re: Get generated html data from aspx page and send out email
    ... > When a payment is done they post the data to my website. ... > that to a string variable with html formating tags in it and email it out, ... > have an asp.net web page which generated an invoice to view in the right ... > format and all that, so the question becomes is there any way in .net that ...
    (microsoft.public.dotnet.framework.aspnet)