[time-nuts] Marrisons 1948 article

Sat Aug 6 11:22:12 UTC 2011

John Miles <jmiles at pop.net> wrote:

> None of the articles appear to be text-searchable, unfortunately, so that'll

> take a few kilowatt-hours of CPU time to fix.

On that subject, what do you use for that?

Personally I do something like this:
- pdftohtml
- index the html pages with mnogosearch
- dump on server
- the pdf's are now searchable through a web interface (and from command line obviously)

This works fine for pdf's that have embedded text, but obviously no go for OCR.

So basically the question is, know of any good open source ocr software for the job?
In the absence of better options I'll probably give tesseract-ocr a spin, and see if it's any good for this.

regards,
Fred