Open Access News points us to the Digital Library of India, which has as its goal the digitization of significant literary, artistic, and scientific works for free distribution and appreciation. Unlike the Carnegie-Mellon Univeral Library (which helps to coordinate the DLI), the DLI is focusing on works primarily in Indian languages. Books are digitized by scanning (and readable as page images), and are made available in text via optical character recognition, or OCR. This presents some interesting challenges:
- There are1500 spoken Indian languages and 17 scripts.
- Unlike English, where the number of characters to be recognized is less than 100, Indian scripts have several hundred characters to be recognized.
- Non-uniformity in the spacing of the characters within a word because of the presence of Consonant Conjuncts (vowel + consonant) makes OCR more difficult. Also, the presence of Consonant Conjuncts results in improper line segmentation. Programs will have to do further processing to segment the lines.
- Consonants take modified shapes when attached with the vowels. Vowel modifiers can appear to the right, on the top or at the bottom of the base consonant. Such consonant-vowel combinations are called modified characters. In addition, two, three or four characters can combine to generate a new complex shapes called compound characters. These characters are very difficult for a machine to recognize.
- In scripts like Bangla and Devnagari, all the characters in a word are connected by a unique line called shirorekha (also called head line). In these scripts, character segmentation is especially difficult.
- In south Indian scripts, vowels occur only at the beginning of a word as against the vowels in Oriya, where they occur anywhere within a word. So, the language morphology for some groups of scripts is different from the others.
- There is no universally acceptable standard encoding scheme for Indian scripts. This necessitates a scheme where the output labels from the OCR system can be mapped to the labels used by the typesetter through a mapping table.
At this point, they've scanned about 100,000 books -- 10% of their eventual goal, a million books available to anyone, anywhere, with a web connection.