Pulling out secrets: New technologies help military find intelligence in degraded documentsStory
September 25, 2009
Recent research has produced practical multilingual text processing systems that are assisting today's warfighters when they need to read captured paper documents that are in poor condition, hard to read, and written in a foreign language - immediately.
Modern warfighting emphasizes the development of intelligence to obtain strategic and tactical advantage. Recent technological advances have broadened and accelerated this trend enormously and, today, it is finally practical to make use of captured paper documents for intelligence purposes while still on the battlefield.
Today’s soldiers need to determine the gist of paper documents that are in poor condition, hard to read, and written in a foreign language – immediately. Because soldier-linguists are in very short supply, it is a challenge to quickly assess a document’s intelligence relevance and obtain actionable information from it to support the warfighter.
An emerging solution to the challenge of battlefield document exploitation employs field-based systems that integrate advanced forms of document image capture, multilingual Optical Character Recognition (OCR), multilingual machine translation, and multilingual word or phrase spotting. Of particular interest are the recent developments in multilingual OCR and multilingual word spotting that make field-based exploitation systems practical.
The current military operations in Iraq and Afghanistan underscore the need for OCR systems that effectively transcribe Middle Eastern and Asian languages. While OCR software for Latin languages has long existed, systems that can recognize languages such as Arabic, Persian, Pashto, and Urdu as well as Chinese, Japanese, and Korean are just emerging. Recently developed multilingual OCR systems address these military-significant languages and have a number of unusual capabilities that directly fulfill the needs of battlefield systems. Prominent among these are: accurate recognition of degraded documents and identification of words and phrases signifying people, places, or things.
Multilingual text transcription and word spotting
Accurate transcription of Middle Eastern and Asian text is difficult. Degraded document images must be enhanced before accurate recognition is possible. Figure 1 displays an example of the extent to which images can be enhanced and readable information recovered. Since degraded document images are the norm in battlefield situations, the integration of sophisticated enhancement capabilities with a recognition system makes military document exploitation much more successful.
Another key to accurate recognition of degraded documents lies in the recognition process itself. One superior method uses the concept of “over-segmentation,” which overcomes many of the limitations of traditional character segmentation when working with low-resolution imagery or connected scripts (for example, Arabic). The goal of over-segmentation is to split, or segment, an image of text into primitives: pieces containing an individual character or a portion of a character. Then the task of correctly assembling the primitives into recognized characters can be performed and language-specific constraints incorporated to achieve maximum accuracy regardless of whether the source document is degraded, cursive, or even handwritten.
Figure 1: Before – After
(click graphic to zoom by 1.9x)
High word-spotting accuracy is obtained through the use of query-time OCR. A typical general-purpose OCR lexicon is designed to cover the most frequently used words in the target language to maximize recognition performance without making any assumptions about document content. While this strategy provides the best generic recognition, it is not ideal for word spotting, or equivalently, search queries, because queries against a document (or a corpus) are almost always concerned with less frequent words representing entities such as people, places, or things. Since these types of words occur only in specialized contexts, they are not usually included in a general-purpose lexicon. Consequently, they are more likely to be incorrectly recognized by a generic OCR engine, particularly in the case of low-quality document imagery where word-spotting accuracies are significantly decreased.
Query-time OCR is implemented by constructing a supplemental lexicon from the keywords of each query and providing it to the OCR when word spotting is performed. Though not obvious, query-time OCR turns out to be a very practical approach to word spotting that results in accuracy improvements of up to 15 percent compared to conventional methods.
Practical tools for document exploitation
The ascent of dissimilar enemies calls for new tools for the warfighter that address the need to obtain relevant intelligence information from degraded foreign language documents while on the battlefield. Modern document exploitation systems aim to provide that support by integrating software components that enable image capture, transcription, translation, and search (word spotting). VERUS, a product of NovoDynamics Inc., incorporates the OCR advances described for multilingual recognition of degraded Middle Eastern and Asian language documents. The aforementioned query-time OCR module has been proven in a laboratory setting and can be integrated with VERUS when needed. The result: highly accurate, rapid readings of degraded documents, to help the military find actionable intelligence.
Dr. Steven Schlosser is senior scientist at NovoDynamics, Inc., an In-Q-Tel portfolio company headquartered in Ann Arbor, MI with offices in Vienna, VA. He holds a Ph.D. in Mathematics from SUNY at Buffalo and a B.S. in Physics from Rensselaer Polytechnic Institute. He can be contacted at [email protected].
NovoDynamics, Inc. 734-205-9126 www.novodynamics.com