Java OCR components. The toolkit is made of 2 main components. It includes a Java wrapper for the Tesseact OCR engine which will convert image files (faxes or scans) to text files. The OCR engine is free and based on the Apache 2 license. The text files can then be processed using the document parser which will extract business information from the text and it will create either a Java object or an XML file. It parser will be able to understand the information even if the OCR engine did not accuratelly decode all characters. The software also includes a ready to use servlet for web environments.
Keywords: java, ocr, document, parser, scan
|
|
|