قالب وردپرس درنا توس
Home / Tips and Tricks / How to Use AWS Textract OCR to Extract Text and Data from Documents – CloudSavvy IT

How to Use AWS Textract OCR to Extract Text and Data from Documents – CloudSavvy IT

AWS logo

Many companies use human workers to enter manual data on forms, applications and other physical documents. While this is very accurate, it is slow and costly. AWS Textract uses machine learning to automate this process.

Why use AWS Textract?

Textract is by no means the only optical character recognition tool; there are plenty of free open source solutions available such as Tesseract OCR. You can read our guide to using that for more information.

Textract is much more than simple OCR, however, as it is for analyzing and extracting data from forms, tables and other documents. It is able to extract important key / value pairs, tables and other key strings, making it actually useful as an interface between scanned documents and a database (although you have to set up that automation yourself).

The other allure is that Textract makes OCR available as a fully managed cloud service. You don̵

7;t need to set up your own application servers to perform OCR and understand the output; Just configure Textract, and send it some documents, it will output the results.

For companies that still manually enter data, Textract can provide you with a a lot of of money, both in the reduced man hours spent typing on a keyboard and the fact that it can handle many items at once, greatly increasing the speed of data entry.

In terms of price, Textract is the cheapest for plain text, such as scanning pages of books. For that it costs only $ 1.50 per 1000 pages. To analyze tables it costs $ 15.00 per 1000 pages. For key-value pairs it costs $ 50.00 per 1000 pages. While that’s not exactly free, it’s definitely better to pay a human to do it manually.

Textract is fairly accurate, but if you’re concerned that the machine is doing something wrong, AWS has a solution for that too. You can set up Textract to use Amazon’s Augmented AI workflow, which automatically points unreliable results to humans for review.

Using Textract

Go to the Textract Management Console and click on “get started”. If you are using the console manually, you can upload documents using the button here:

Textract will process it immediately. You will soon see what makes Textract so useful; it knew which bits of text on this W2 form were important, which were part of key-value pairs, which were part of tables, and which ones it could discard.

On the right side you will find the output, which lists all the raw strings found, the key-value pairs, and any data tables. Note that these are not mutually exclusive, as in this case key-value pairs were found that were also parts of tables.

You can download the results and you will find a CSV file with all tables and key / value pairs, as well as a text file with plain text output.

If you want to automate Textract, you must use the AWS CLI or API. Textract has its own set of commands for working with it from the command line.

You can serialize the document to base64 encoded document bytes, or upload it to S3 and give Textract a key to find it. Then you can analyze-document to start a job:

aws textract analyze-document --document '{"S3Object":{"Bucket":"bucket","Name":"document"}}' --feature-types '["TABLES","FORMS"]'

This is a synchronous operation, but you can analyze asynchronously by starting a job and then manually retrieving the results.

aws textract get-document-analysis --job-id df7cf32ebbd2a5de113535fcf4d921926a701b09b4e7d089f3aebadb41e0712b --max-results 1000

Source link