[time-nuts] Marrisons 1948 article

Sat Aug 6 13:07:12 UTC 2011

> On that subject, what do you use for that?
> 
> Personally I do something like this:
> - pdftohtml
> - index the html pages with mnogosearch
> - dump on server
> - the pdf's are now searchable through a web interface (and from command
> line obviously)
> 
> This works fine for pdf's that have embedded text, but obviously no go for
> OCR.
> 
> So basically the question is, know of any good open source ocr software
for
> the job?
> In the absence of better options I'll probably give tesseract-ocr a spin,
and
> see if it's any good for this.

I've been using a commercial package (http://pdftransformer.abbyy.com/ ) and
have been really happy with it in general.  It's slow, but does a good job
on even marginally readable text.  I don't think I've ever needed to use it
in batch mode, but I believe there's a way to make it happen, and that will
be necessary since every article is in its own .PDF file. 

The wget process copies the HTML index pages as well as the .PDFs and fixes
up the links to point to the local copies, so that part is pretty easy to
deal with.  For my own copy of the archive, I'll probably merge all of those
index pages into one document so that all of the article titles can be
browsed on a single page.

-- john, KE5FX