Re: SEO technology for Copyright Patrol?



Roy Schestowitz wrote:

[snip]

I am not talking about unasked for links to a site, or garbled scraping,
merely direct, unauthorized copying of whole articles / major portions
of articles (expressed as a percnetage, e.g. "URL xyz contains 67% text
identical to your URl jkl."

You cannot quantify such things easily, just as you cannot merge two
pieces of similar text. Try, for example, to forge together two 'forks' of
text which have been worked on by different individuals. To use a familiar
example, have you ever mistakenly edited some older version of a text that
you worked on, only to *later* reveal that you had worked on an out-of-
date version? This is not the case with syntactic code, for instance, as
it can often be merged (CVS-like tool), much like isolated paragraphs in
text,which benefit from tools like 'diff'. Been there, (colleagues) done
that.

Okay, i see i was not clear enough. I am looking for a service that
proves a semi-automated version of what i have successfully done by hand.

VERSION ONE -- GOOGLE-BASED

1) I submit my top 250 keywords to your web interface. I also submit one
ten-word sentence fragment (a "check phrase") for each URL i am
protecting. You may set me a limit of numbers of pages i can protect for
a iven amount of fee. Let's say you alow me 100 pages. My 250 keywords,
100 URLs, and the accompanying 100 check phrases are permanently logged
at your site (but can be changed by an "edit" function). The check
phrase for each URL is MY responsilibility to choose and must be way
unique. Like, say (real example):

"contingent of spiritually-inclined folks who will not use common"

which is from
http://www.luckymojo.com/candle,agic.html

2) Your service bot goes to google (or -- see below for VERSION TWO, in
which it does not go to google, but rather to a google-geneated
"personal cache) and it searches on the 100 check-phrases. In the real
life example above, my check-phrase turns up 4 matches. Two are at my
own domain (one is a weidly garbled URL that i have no idea what it's
about, but probably some wacky symbolic link thingie that my husband
screwed around with) and thus are eliminated -- and the other 2 are not
at my domain and thus are potential cases of illegal copyright
infringement, and are logged at your web-based interface so i can view them.

3) The bot does a whois lookup on the two infringing domains -- in this
real-life example:

ausetkmt. com

freewill.tzo. com/~callista

and it logs the data in your web-based interface so i can view it.

4) The bot obtains, from a cache, three copies of a customizable
"friendly (stage one) complaint letter and drafts them to each domain:
one to the owner, one to the owner's tech contact (in my experience
owners who plagiarize often claim inability to delete files as a reason
to avoid action; this works around their excuse-making), and one to the
domain's isp.

3) The bot generates a web-page based alert, displaying all information
about the infringing sites and notifying me that the draft "friendly"
complaints are ready to be sent.

6) At your web site, i can perform a personal check of the pages --
similar to the Wikipedia "diff" function and displayed the same way
(side by side) -- before i commit to sending the "friendly" complaint
letters or abort the send.

7) If i decide to send the "friendly" complaint letters, this action is
logged and dated and displayed at your web interface for my future
reference.

8) There is a 'tickler" function that makes a re-check of any site to
which i have sent a complaint at one-week intervals. This informs me at
the web site whether the infringement is still up.

9) Decision fork:

9A) If the infringing page is gone, it is marked (in red) "Page No
Longer Online" but it stays in the system for access anytime i wish to
re-check my "History" with that domain (or my "History" in general).

9B) If the infringing page is still there, I am offered the option to
send a strongly worded "legal" (stage two) complaint and to print two
hard copies to be sent to the contact addresses for domain owner and
isp. (Subsidiary idea: keep the snail-mail copyright department
addresses of major isps -- and any isps ever cntacted by the system --
on file, for they are uusally difficult to track down and it would save
the client time having to look them up.)

10, 11, 12) Repaeat steps 7, 8, and 9 for the "legal" complaint.

13) If the "legal" complaint generates no response, i am given the
option of sending a fully documented letter (with all relevant date
stamps and so forth from your service's histry records) to google
informing them of the infringement and requsting them to de-list the
offending URL (or domain) from their SERPs. (Side-note: if the service
is well-publicized, google will probably agree to honour their
complaints. If three such services exist, they can form an Association
and gogle will definitely have to deal with them.)

14) This ends the service's responsibilities. For any further actions, i
must hire a lawyer.

The point of my babbling is that you can never measure such thing
reliably, let alone know their meaning. Statistics have their flaws. What
if in your text you cited (and linked to) an article and then provided
some long quote?A second site could do the same and unintentionally
assimilate to your content. The issue of copyrights and intellectual
property suffers tremendously nowadays. Bear in mind that apart from
Google Groups, there are at least half a dozen Web sites that copy the *
entire* content of this newsgroup, making it public.

This is all true, but not relevant. I am talking about webmasters who
build sites competing with my site's SERPs by deliberate copyright
infrinngement of my own copyright protected web pages. See above
scenario.

As for attention by a human, i would expect a design that offered me the
option to send automated cease and desist letters (customizable) to the
domain owner / tech rep and isp host copyright rep. Why would abuse@isp
. net beome involved?

I was referring to the people employed by ISP to deal with abuse reports.
If they began to receive automated mail, there would be no barrier on the
amount of workload. This would also cast a shadow on abuse reports which
are submitted manually.

The letters would be submitted manually. See above.

[Google and Evil discussion tabled for anther thread -- an interesting
subject and one worthy of conversation, but off-topic here.]

I would pay a yearly fee for such a service.

Does it exist?

I doubt it.

Could you design and market it?

*smile* I am not a businessman.

Could you design it?

If not, why not? (And can those restrictions be overcome?)

Such a tool would need to hammer a search engine quite heavily. How
would the search engine feel about it and what does the search engine
have to earn?

Well, a large (e.g. google) search engine could charge money for
the service.

This raises further questions. If that was the case:

-Could Google benefit from permitting plagiarism nests to exist?

No.

-Would people truly waste and invest money in fighting evil?

Authors and businesses invest in fighting copyright and trademark
infringement all the time. I spend many hours per year at the task. A
semi-automated web-based system would save me 100-plus hours per year
and a great deal of frustration. I would pay 250 dollars per year to
subscribe, maybe more. A sliding scale of pricing could allow for
different levels of examination based on varying the number of client
keywords / number of client pages handled.

This reminds me of the idea of pay-per-E-mail as means of preventing spam.

I don't see the similarity. I am talking about a web-based service to
which i could subscribe that would allow me to patrol the web for
copyright infringments.

Or, perhaps you could design a search engine to handle it in a way that
does not hammer google.

For instance, my field is occultism / religion / spirituality folklore.
I supply your bot with keyword terms -- say 250 of them -- from my site.
Your bot goes to google and colllects the URLs for all sites ranking in
the top 200 for all those terms.

Then i submit my domain name to your bot. Your bot takes 1 page at a
time from my domain and searches all cached URLs it had retrieved
earlier from google. It then moves to my next age and repeats the
search.

*smile* You got greedy.

,----[ Quote ]
| Your bot takes 1 page at a time from my
| domain and searches all cached URLs
`----

The practicality of search engines is based on the fact that you index
sites off-line. You can't just go linearly searching for duplicates. The
least you can do is find pointers to potential culprits by using the
indices. I guess I have missed you point though. If you are talking about
surveying and analysing top pages for a given search phrase, how far
should you go? There are infinitely many search phrases.

That is true -- and that is why, when you spoke of "hammering google," i
theorized another, less google-intensive way to do the job. Here is how
i envision it working with a non-google-hammering web interface, relying
only peripherally on google to generate the initial batch of
information.

VERSION TWO -- PERSONAL CACHE BASED

1) I submit my top 250 keywords to your web interface. I also submit one
ten-word sentence fragment (a "check phrase") for each URL i am
protecting. My 250 keywords, 100 URLs, and the accompanying 100 check
phrases are permanently logged at your site (but can be changed by an
"edit" function). The check phrase for each URL is MY responsilibility
to choose and must be way unique.

2) Your bot goes to google only ONCE for each those 250 kewords, finds
the top 200 results for each keyword, and caches them offline. 200 x 250
= 50,000 pages -- but there will be duplications of common terms, so,
with duplication eliminated, we might theorize that those 50,000
potential URLs will actually reduce down to 25,000 pages. Whatever the
number, that would be my personal index cache at your service.

3) If a trial proved that the above numbers were unworkable, we coud
limit my input to 100 keywords x top 100 results at google per keyword.
This would result in 10,000 pages, which, with duplication eliminated,
might reduce to 5,000 pgaes.

4) Levels of payment could be arranged for a 100 / 100 search or a 250 /
200 search or whatever other arrangments you deemed feasible. Thus
clients would pay for the amount of breadth and depth of search -- and
the amunt of cache space at your end -- that they required.

5) I could, at a specified interval -- say once a month -- rewrite my
250 (or 100) keywords. In any case, the 200 (or 100) top results for
each keyword would be automatically updated at google once a month (or
every three months, if that is easier.)

6) When i submit my ten-word check-phrases, your bot does not return to
google, but rather searches my personal index cache.

7) I believe that this system would be sufficient (and better than
hammering google) because my MAJOR goal is to eliminate successful
competitors for SERPs, and other, less successful plgiraists, are of far
lesser concern. A button at the web site that initiates a once-a-year
sweep of all google cached pages (as opposed t all of my personal
indexed cache pages at your service) would be sufficient to eliminate
the low-level plagiarists.

Your bot also updates its cache from google's top 200 once a month and i
can change the keywords i want it to cache as well, on its next update.

That leaves gaps for misuse. Any control that is given to the user over
indexing, keywords and the like is bound to break. This must be the reason
why search engines ignore meta data and will never have second thoughts.

I disagree. This is a service that the user pays for and as long as the
interface is clear, clean, and functional, it is the service's
responsibilityto automate certain tasks and the user's responsibility to
authorize the implementation the semi-automatized tasks.

I really do think this is a useful commerical service just waiting to
happen. I look forward to your further comments, as you are one of the
few people i know in the world who can discuss these matters at all, as
well as being kind to those who, like me, are merely logical thinkers
and not actually computer programmers.

cat yronwode
.


Quantcast