Did Google Just Give a Million Reasons for Publishers to Opt Out?
In a way, it worked to say there were 500k free books out there for users of the BN Store or the Sony devices. Both are hard to use and sort through, and the idea never met the reality.
Err, now you can access those Google files from anywhere.
They’re not good. They’re not just not good in comparison with, say, the last 20k Project Gutenberg files released, they’re also not good in comparison with the last half-million or so Archive editions. And it’s why they’re not good that’s so worrisome.
Both the Archive and Google can’t be read natively. You have to do what we’ve all been doing since Abbyy FineReader 6 and dump the .pdf into a bunch, run OCR, proof the thing. Quick ‘n dirty for one’s self, or more for you people.
For the post-processing, the initial quality of scans from google is so off. Just so–
–all right, disclaimer. Right now. I live near Washington, DC. In the DC area, there are places that seek consulting assistance on scanning projects of stuff that you can’t just send off to the Philippines. The offices have names like “Document Control,” are required for various regulations, and involve things like high-end workstations and servers cabled down without internet access. For the record, I like money and there are less-qualified than myself you could talk to about spending hours scanning things in your basement, but I don’t go seeking to work for other people, and most of what I’ve seen has been overkill on security, considering…
–anyway, there are basic things google could have done with their scans so that, apart from the disaster pages, books wouldn’t lose every period after a t And quotes and di and numbers and training and–I could have told them these things. More to the point, Project Gutenberg could have told them these things. Charles Franks, Juliet Sutherland? They’d've loved to help, but they probably wouldn’t have passed the 11-interview hiring process or whatever it is now. Just… tiny things to make a source scan 10 times better.
And, instead of doing mundane, low-rent things to save a ton of work with each scan, Google’s got ideas about improving the OCR software (really? source image is key), and this thing with patents where it senses slight impressions on the paper (really?) to yield perfect results (really???)
Don’t get me wrong, a lot of innovations are happening in book processing these days that will bring the cost of getting good quality digitalization way down (at least for, ahem, literary titles–smut’s smut), but google’s going live with a product that clearly isn’t ready, or even close to being ready, while asking publishers to sign on to a deal where google gets to sell their books looking like… ick.
Personally, I’m sorta rooting for wrapped OEBPS files with Google DRM–you know those’ll have the best logo. But when you think about it, GBS’s change over time mirrors that of Epub itself: mission creep that the product is wholly unready for.
I mean, you have OEBPS that converts great to whatever proprietary format for your device. You hand it off in folders. OK, now put a wrapper on it! Yay, everyone agrees, but Adobe’s having trouble making it work. Well, fine, now, let’s make that wrapper the end product…. it’ll work… IT’LL WORK!
Google Book Search starts as a way for google to maybe make some money off powerful datamining software with some ads. Smart, cool, genius. Err, money’s not there. OK, let’s scan in-copyright books. Err, major objections, lawsuits! Oh, and money’s not there. Fine, we’ll print our scans! All of them! (Trust me, money’s not there. Or, just ask Kessinger.) Well, we’ll give away ebooks (Money’s not there), and, later, sell ebooks! (Your money’s not gonna be there if the books-for-sale look like that.)
Expect new objections in the settlement to be filed on quality issues.
Share ThisTags: .Epub, BN, Charles Franks, Devices, DRM, Ebooks, Google, Kindle, Scanning, Sony
August 27th, 2009 at 1:16 am
david said:
> I could have told them these things.
> More to the point, Project Gutenberg
> could have told them these things.
> Charles Franks, Juliet Sutherland?
charles franks?
juliet sutherland?
oh dear, david, you haven’t kept up, at all, have you?
i have documented quite clearly — and repeatedly –
how woeful the distributed proofreaders workflow is.
_especially_ in the arena of cleaning the o.c.r. results.
juliet stuck her head in the sand to avoid hearing it.
(and goodness, charlez hasn’t been around in years.)
the thing is, you can improve that shitty o.c.r. nicely,
with after-the-fact processing, and it’s not even hard.
it’s complicated — they are lots of steps involved –
but none of them are difficult or challenging steps…
so there’s no question google will be able to do it.
and probably already has. (that’s my best guess.)
but no, of course they’re not gonna give us good text.
not yet, anyway. this is just an interim step for them,
another “good deed” they hope will garner support for
their settlement, so they can rip us off big-time later.
-bowerbird
August 27th, 2009 at 9:03 am
Err, I’m working with books from ‘08-’09 now on Google’s site. Mostly, it’s been Leblanc. Best quality of scans for OCR I’ve seen have come from Harvard’s Library. We’re not getting handprints on the images anymore, but it’s still horrible compared to anyone else…
August 27th, 2009 at 10:43 am
perhaps i didn’t speak clearly enough.
i believe google knows how to fix that o.c.r., but
they are deliberately giving us the unfixed text…
(they’re also providing us low-resolution copies,
instead of their high-resolution originals, which
might — or might not — impact o.c.r. quality.)
-bowerbird
August 27th, 2009 at 12:54 pm
Hey, unclear’s my department
.PDFs of images from google run 3-10 megs, a little smaller than the archive, but not much anymore. I’ve gone as high as 1800 DPI on some scans, working w/ libraries and so forth. 300s the norm, 400 used to be better. Higher rez helps… in a few cases… but only so much.
The software itself seems behind the curve, so maybe they’re hiding something, but, they’re the ones trying to convince publishers to sign on to a complete capture of the market… it’s more likely WYSIWYG.
Ever get bored, read Gawker and Valleywag for what they say about GBS staffing…
August 27th, 2009 at 11:37 pm
david-
> The software itself seems behind the curve,
> so maybe they’re hiding something
they’re definitely keeping some things to themselves.
> Ever get bored, read Gawker and Valleywag
> for what they say about GBS staffing…
if i ever get bored enough to read either of those,
i will shoot myself in the head instead, but thanks.
once g.b.s. announced that they were aiming at
“original intent x.m.l. markup”, i knew they were
gonna be fattening up the profit pig for slaughter.
there are worse ways to waste money (e.g., giving
bailouts to bankers), but not many, when compared
to doing x.m.l. markup on millions of scanned books.
-bowerbird
August 27th, 2009 at 11:57 pm
> if i ever get bored enough to read either of those,
> i will shoot myself in the head instead, but thanks.
Well, if you’re just sitting around scanning books, you gotta find mindless stuff to do
September 8th, 2009 at 10:18 am
[…] One other site did. Munsey’s Technosnarl: Did Google Just Give a Million Reasons for Publishers to Opt Out? Possibly related posts: (automatically generated)Free eBooks Till August 4th25 Sites We Can’t […]