Arnaud Valle
Technical leader & Full-stack engineer
Crafting web solutions that are reliable, simple, and impactful
Get in touchHow a 3€ book ended up costing me several hundred euros
Wed Dec 28 2022
Backstory
I finally made the jump to an e-reader about a year ago (I know, I'm very late to the party, as usual) and I really like the experience so far.
The first reason I like it, is that I’m constantly battling for space in my flat. So, having several dozen books fit into a tiny tablet is a big win for the sort of minimalism I am aiming for.
Being able to organise all those books into neat lists and getting reading statistics is also a plus. Controlling the brightness or screen colour (depending on the time of day) is also a nice touch.
Syncing my pocket articles is even better.
But the real killer feature for me is this: being able to choose my font face, font size, font weight, line height, margin etc... This is a frontend web developer dreamland 😊
So, a few weeks back, when I bought an e-book and realised (too late) it was no more than a PDF scan of the original book (you could actually see through some of the pages and guess what was written on the other side… 🤯), I got a bit irritated.
The main problem was that, since the pages of the book were essentially just images, all I could do was zoom in on my page (and that’s really annoying to do - e-readers weren’t really built with great hand gesture support in mind). Then, every time I turned a page, I had to reposition the image so it fitted ok on the screen.
Overall, I went from a very comfortable reading experience with fine grained settings to a complete nightmare of a user experience 😱
Also, on a side note, the book was massive: over 200MB when the average book I have bought so far is about 2MB (in epub format).
Research and proof of concept
So, the first thing I did was jump to my favourite search engine and try to figure out if anybody had built a magic tool to convert my scanned book to some other format my e-reader could handle (and recover all the sweet features I liked to much): HTML, RTF or, hell, even a good old TXT file would do!
Online tools
I tried some online tools, but they would generally only handle files up to 15MB - quite far off the size of my book. So that looked like a dead end.
Along the way, I did learn what OCR (Optical Character Recognition) was though: it’s basically the technology that allows to convert a print document into a digital one. In my case, it would be the difference between a PDF with an image in it (one of the pages of my the book) and a PDF in which you can select or copy/paste text.
Software
Then I tried software like Wondershare’s PDF Element, but, again, the free version was a bit limited: it would only handle a few pages and, when I ran it on my book, the end result was read-only. I couldn’t export the files I had generated via the software using the free version so I’d eventually have to pay for something I’m not even sure would be what I’m after in the end…
npm and github to the rescue
Next up, I moved on to my beloved npm and github and started searching for some libraries or packages that would do what I was after. There’s always somebody way clever than me that has figured out a nice solution to the problem.
I tried a few variations of searches and finally landed on tesseract.js: “Pure Javascript OCR for more than 100 Languages” they say. Ok, cool, that seems nice. Tesseract is an OCR engine and this is a JavaScript port of it. My day-to-day job involves quite a bit of JavaScript, so I could probably work with that… Finally, I was making some progress 🙂 The plan started to take shape in my head:
- take the book in its ‘image’ format
- run it through OCR to extract its text
- convert it to a better ‘text’ format of some sort so my e-reader could read it
Putting it all together
Although I ended up adding a few extra steps, that is essentially very close to my final workflow. Here is the detailed process I came up with:
- Use Automator to split the original PDF book into several documents (1 document per page) I’ve always loved Automator but I keep forgetting it exists somehow. It has saved me quite a lot of time over the years…
- Use Automator to transform each PDF page into an image tesseract.js expects images so I’ll feed it images
- Run each image (= page of the book) through tesseract.js to extract its text
- Create a text file with the extracted content of each image
- Merge all text files into one file
- Format the text file to HTML the final result just looks better on my e-reader compared to a basic txt file
Here is the repository of my prototype
Conclusion
So between the time spent researching and implementing my prototype, that’s how a 3€ e-book ended up costing me several hundred euros. I love it when code can help me solve a real-world problem. Sure, I could have just bought the book again in a different file format from some other online platform and that would have been it. But where is the fun in that?
Note: I did search a fair bit and since my book is actually a very old book, I didn’t find any other version in my particular case so I was stuck with it
But, along the way of this little project, I learnt quite a few things on a subject I knew little about and discovered a whole new world of terms, ecosystem etc… that who knows, may just help me in the future 🤷♂️
Oh and the funny thing is, after all those hours and efforts spent converting that e-book, well… I still haven’t read it… 😂