Download a public domain book spread out over 53 HTML pages as JPEG's

8 replies
Trying to download a public domain book spread out over 53 HTML pages as JPEG's.

I came up with a clunky way to gather up the JPEG's that I can then run through OCR but thought I'd check here to see if anyone knows of any quick easy solutions.

Unfortunately, I don't always take the most direct path when I'm trying to do things, hopefully some of you do.
#book #domain #download #html #jpeg #pages #public #public domain #public domain book #spread
  • Profile picture of the author DebraConrad
    The only “quick, easy” solution for this is to pay someone else to do it ; )

    It’s gonna take some labor no matter which what.

    First, I would check to make sure the book isn’t available elsewhere in a format that’s easier to work with.

    If not, you have to download all the jpeg page scans no matter what. you could run OCR on each page, very time consuming.

    Here’s the way I used to do it, much faster, makes the computer do most of the work.

    1) Download all of the jpeg page scans to a folder, make sure they are titled in numerical page order

    2) Select all images, right click, choose “combine in adobe acrobat”, this will convert all of the images into one pdf with all of the pages in order from beginning to end (you could stop right there and have an ebook if you didn’t want to do any editing)

    3) Then run the pdf thru Abby PDF Transformer, this will perform OCR and spit out the book in Microsoft Word format

    4) Proofread and correct the word document

    5) Convert back to pdf, done
    Did you know that you can use Public Domain content for articles, blog posts, products, free reports and more? Debra's Public Domain Treasure Hunter blog can show you how....

    Ordinary Baby Boomer making money from home - Debra Conrad blog.
    {{ DiscussionBoard.errors[2981406].message }}
  • Profile picture of the author Doug Slaton
    Pay someone...that's the smart choice for sure.

    Will check to see if it's 'out there' somewhere.

    Hadn't thought too much about the conversion was more challenged by downloading all the different jpegs across all the different pages. Thought I'd write a macro to do that but wanted to not reinvent the wheel if possible.

    Experimented with Httrack but never did get the filtering quite right so would've ended up downloading waaaaay to much of the site. Guess that's not all bad since the site's stuffed with public domain goodies.

    I like the Acrobat idea that seems quick and painless.

    Thanks again for your suggestions
    {{ DiscussionBoard.errors[2981735].message }}
  • Profile picture of the author Jesus Perez
    Check out Mturk by Amazon. Upload each jpg for manual translation for a small fee. Then compile it.

    Or stick to OCR.

    {{ DiscussionBoard.errors[2982761].message }}
  • Profile picture of the author Doug Slaton
    @Chris Kent
    Here's the 'starting gate' for plenty 'o public domain:
    Rare Book, Manuscript, and Special Collections Library
    Category Descriptions

    The item I'm interested in is here:

    but the pics for it are here:

    Their directory structure is non-standard or least not laid out the way I would normally expect. When I saw their structure, seemed like a piece of cake to dive into the images folder but not so. Get access denied all along the directory tree. Only way to a pic is with the full url.

    @Jesus Perez
    Hadn't thought about Mturk - good idea there.
    {{ DiscussionBoard.errors[2993631].message }}
  • Profile picture of the author Doug Slaton
    The book is mine at this point, thanks to everyone in the thread but I'm still curious about the gathering.

    It is the same book - thank you. Internet Archive should've been my first stop. Just like Ms Conrad said...First, I would check to make sure the book isn’t available elsewhere.

    @Chris Kent
    And now the million dollar question: How did you gather up those image urls? It's great you snagged all the urls, did you find an automated or semi-automated way to get 'em?

    I tried 'ripping' them from the source but all I saw in the source was page #s, like this: /eaa.Q0042/pg.x/

    Once again, thanks to all that posted.
    {{ DiscussionBoard.errors[2993969].message }}

Trending Topics