Participation in the Development of FineReader XIX
ATAPY participated in the development of the ABBYY FineReader XIX – an OCR system for the conversion of old European books to “modern” digital formats
Meta-E is a collaborative initiative established by a consortium of 14 universities from 7 European countries and the US that is co-funded by the European Union. The project is focused on providing a technology base for the digitization and web-publishing of valuable old printed sources spanning several centuries of European history. This required an OCR system capable of recognizing historical texts for the period 1800 - 1938, including those printed with Frakturschrift (an old-styled black-letter typeface that was prevalent). There were no omnifont-Frakturschrift systems available: all OCR products had to be trained on each individual book before processing it. Meta-E coordinators started looking for a high quality OCR package to augment according to their requirements. ABBYY FineReader was chosen due to its unrivalled recognition accuracy, support for 176 modern languages, and user-friendliness. ABBYY Software House, the international manufacturer of FineReader products, began work as a direct contractor to develop the omnifont element of the project (introducing the Frakturschrift graphics to FineReader). The linguistic part of the project was subcontracted to ATAPY Software, ABBYY's long-term partner in OCR and linguistic development.
ATAPY’s role in the Meta-E project was constructing Old Language Models (LM) for 5 European languages: English, French, German, Italian, and Spanish. LM is a computer database that describes the vocabulary of a language. FineReader uses LMs during recognition to build OCR hypotheses and for spell-checking. LMs are not just full lists of words in all possible grammatical forms because such a database would be enormous and hard to manage. FineReader’s LMs store only the stems of each word and describe the grammar as a set of flexible rules (paradigms). Each stem is assigned a list of paradigms; applying them to the stem produces all possible forms of the word. ATAPY studied a large number of authentic dictionaries and original old European texts dating back to the targeted time period, reviewed the word stock, added the words that were phased out of the languages, and corrected the paradigm assignments to synchronize the LMs with the actual grammatical practices used at the time.
To complete this task, ATAPY’s linguists carefully selected 10 dictionaries published between 1808 and 1930 that reflected the state of the 5 languages. ATAPY also analyzed thoroughly 105 authentic books from that period comprising more than 50 MB of text. The next step was to build FineReader LMs. ATAPY’s linguists manually compared the information from the authentic dictionaries and texts — about 500 000 entries overall — to the existing FineReader vocabularies. This work amounted to a total of 458,767 words out of which 61% remained unchanged, and 36% were added to the vocabularies from the analyzed sources. About 3% of the words had their paradigms corrected according to XVIII-early XX century grammar rules; to make this correction the linguists added 159 historical grammar paradigms that were missing in the contemporary models.
Finally, the LMs were compiled and tested on the control text corpus.
98.91% vocabulary coverage for Old English language
99.16% vocabulary coverage for Old French language
96.58% vocabulary coverage for Old German language
98.58% vocabulary coverage for Old Italian language
98.79% vocabulary coverage for Old Spanish language.
To illustrate the above, let’s look at a few examples where the regular FineReader package, or any other contemporary OCR system, will make a lot of mistakes. ‘Alterthumskunde’ may become ‘Allerlhumskunde’ in the first fragment and in the second fragment, ‘UEBERSICHT’ (‘Übersicht’ in modern German) gets recognized as two words ‘UEBER SICHT’, etc. These mistakes happen for two reasons. The first is poor printing quality and there is no way to improve it at this stage. The second is the old spelling used in the incorrectly recognized words. All existing OCR systems are targeted at modern texts and therefore only know modern spelling.
Once the 5 LMs were merged into the FineReader 7 shell, ABBYY was able to offer a specialized product that "knows" the spelling specifics of old European languages - FineReader XIX. There is much less chance that this product will make mistakes in areas similar to those mentioned above. Users are now able to OCR old texts with higher quality and save a lot of time that was previously spent on error correction.
ABBYY FineReader XIX has become a powerful tool assisting the Meta-E consortium in its large-scale digitization work. In addition, as the industry’s first box OCR product to recognize Renaissance and Late Medieval sources, it is specially targeted at European libraries and public organizations engaged in the preservation and publication of cultural assets.
‘I’ve got FineReader XIX installed here on my computer. The Frakturschrift recognition is very good. Even though old text recognition is not a large and growing market, I am sure all the service bureaus here in Germany will be ordering 1 or 2 copies and have it run 7×24.’
ABBYY Europe GmbH
About ABBYY Europe GmbH:
ABBYY Europe GmbH is a European department of ABBYY Software House, based in Munich, Germany. ABBYY Software House is a manufacturer of FineReader OCR products — one of the world’s best optical character recognition technologies.
More information about ATAPY Data Digitization and Conversion Services: