In this multi-part series, I am going to talk about my recent discoveries in the realm of document binarization. I will start with the basics and background and then build up an explanation of Nicholas Howe's remarkable algorithm for doing it piece by piece.
I have been obsessing about this algorithm over the past week, picking it apart and trying to improve upon it. So far my efforts to improve it have been fruitless. But these efforts have given me a newfound respect for just how hard the binarization problem is.
For those who want to skip ahead and read about the algorithm from primary sources, you can find details here:
- The first paper describes the base algorithm.
- A second paper shows a method for automatically tuning parameters for the base algorithm. This idea is what really piqued my initial interest because all of these algorithms have parameters that drastically impact their effectiveness and finding the right parameters is time consuming work.
- Nicholas Howe has also released his Matlab code and it is by converting and tweaking the code that I really came to understand how it all works.
Binarization is the process of turning a grayscale image into a black and white image. Every pixel is converted from a gray pixel into either a completely white or a completely black pixel.
One reason that people are interested in binarization is that it is an essential step for OCR algorithms that convert a document into plain text and other image processing algorithms. Another reason is that a binarized document can be compressed down a lot more, an order of magnitude at least.
The latter is what I am most interested in. With binarization and the best compression techniques, you a scanned 400 page book can be compressed down to about 10 MB. That is small enough to carry thousands of books on any modern tablet. Once I scan library, I want to be able to carry the whole kit and kaboodle with me all the time.
One issue with binarization is that there is no single right way to binarize every image. It inevitably means a loss of information. There are dozens of methods and each one will work well for different purposes or with different source material.
I want to binarize documents rather than photographs or illustrations. My materials will be relatively evenly lit and printed with ink. I want to maintain readability and preserve small scale features like serifs and thin strokes in a letter. And I want an automated process. If I have to spend a lot of manual labor on each page, then books with hundreds of pages are not feasible.