Re: Client-side search engine capable of indexing .pdf files is needed.



Le 9/4/09 1:11 AM, Stefan Weiss a écrit :
On 03/09/09 22:15, SAM wrote:
Le 9/3/09 3:43 PM, Stefan Weiss a écrit :
On 03/09/09 14:56, SAM wrote:
I certainly do not well understand what you mean by indexing files.
If it is only to report the list of the names of pdf files stored in a folder (on the CD) the browser must be abble to display it
Then on this window there is certainly a search button, no?
I doubt gordom was interested in a list of file names. Creating a full
text index is quite a bit more complex than simply listing directory
contents. <http://fr.wikipedia.org/wiki/Indexation>
Well, who will choice the terms to index ?
Who will built for each file its own array of terms ?
Who will built the links for each term (to the files and inside them)?

The indexer will do all of that.

From the point where the data are complete and in an object (or a simple array) I suppose that most of the job is made.

Not necessarily. You need both parts for an efficient search engine: the
index and the lookup algorithm. The index lookup needs to be fast, and
able to sort the results in a meaningful way.

<http://cjoint.com/?jdvO4bUE6Q> 1500 items
(without index ... not in SANstore)

| var liste = [
| '00.htm',
| '000.htm',
| '0000000000000001.txt',
| '001.htm',
| '12-1.gif',
| '20-100_100tre.htm',
| '20-100_100tre2.htm',

That's just a list of file names again, not a full text index. It has
only 1500 entries, which isn't even close to what we're dealing here.

It has 1500 entries, will the CD contain more than 1500 files ?
With these simple entries (they could have been lines of a cvs file, each line been a card of the file with name, date, list of indexed terms, short introduction ...)

I didn't understand the "not in SANstore" part - how is that relevant?

I havn't more complicated example in stock (in store ? in SAM's shop).
If you would have one I'll be glad to see it.

Searching one or more terms along this list is very fast because we have only to keep each line containing one of the terms : a single loop on the 1500 lines (or entries). The new list of files, expected relatively short, can then be easily manipulated to show what wanted.

About indexation of a list of terms met in the files I suppose we can have an array of them
terms = [
'add 12 125 956',
'addition 1 8 274 315 977 1235',
...
where the numbers are the indexes to find the correct files stored in another array.
This method would have to be faster.
Maybe it takes more room in memory ? Not sure.

Regarding your other post: Spotlight is only available on OSX, and
(AFAIK) doesn't have a JavaScript front-end. It may be possible to burn
a its index to a CD, but without the Spotlight executable, that won't
help much.

At least that could be a solution for a specific environment ;-)
<http://www.apple.com/downloads/macosx/home_learning/deliciouslibrary.html>

TNO's suggestion has a similar problem: it requires WSH to be installed
and accessible from an HTML page (unlikely). It will be afwully slow as
well, because each search will have to read the complete contents of the

I suppose that it would be better to have all the content written in memory.

CD. And then it probably won't find "à bientôt" because the source
encoding doesn't match the search encoding.

Once Reg Exp will plan that \w is no more only ASCII characters but those of more complet charsets, perhaps will we can match more seriously (or easily) !english words, even if search functions were made by an illiterate guy from US.

JSSINDEX still looks like the way to go (didn't test it, though). BTW, I
just checked, Lush is available as Debian and Ubuntu packages. If there
aren't any other requirements, getting the indexer to work should be a
piece of cake.

Something in Ruby ?
<http://books.google.fr/books?id=OBhAuww-OokC&pg=PA137&lpg=PA137&dq=ruby+file+indexer&source=bl&ots=2yh2lSt1bK&sig=0vjYl4cMJ-3PxayHwg0YJOGYnbk&hl=fr&ei=t1ugSr24Ac74-QaGqsD0Dw&sa=X&oi=book_result&ct=result&resnum=8#v=onepage&q=&f=false>

--
sm
.



Relevant Pages

  • Re: Setting up a variable array based on form input
    ... What I need to build is an Internal Rate of Return (IRR) into this ... Static ValuesAs Double 'Array of values ... automatically based on the loan term selected on the form. ... Static Entries As Integer ...
    (comp.databases.ms-access)
  • Re: Non-Scrolling Background inside CScrollView
    ... Now i will create an array containing the complete output (all buttons, ... entries and entry separators). ... When scrolling i just display the "visible" entries from the array in ... the distance from top to the corrosponding entry. ...
    (microsoft.public.vc.mfc)
  • Re: Array and resize
    ... For obvious reasons, we want the created table to have a header, ... option instead of placing the array directly into a 4x8 range. ... I wanted to use/test the resize option, since the entries from the FE ...
    (microsoft.public.excel.programming)
  • Re: Constant time insertion into a sorted list?
    ... I tried a straight-forward priority queue implemented with an array. ... As soon as you allocate an array, the maximum number of entries is ... fixed by the size of the array, hence regardless of what algorithm ... To make insertion constant with the method you ...
    (comp.programming)