Technical Debt and Technical Investment

03/03/2022 in Innovation, Software Development

 

“Too much technical debt”  is one of the most common complaints you will hear from software developers working on any long lived project. In this article, we will talk about technical debt and technical investment and learn that both can lead to positive, or negative outcomes depending on how they are managed.

Technical Debt describes any suboptimal implementation of technology which requires the maintainers to pay “interest” in the form of additional work fixing problems and dealing with bad performance.

Technical debt is a powerful term because it’s so easy to understand even amongst non technical stakeholders. The analogy to financial debt is so powerful that it’s easy to imagine the software team burdened by tech debt feeling like the cash strapped individual who has just maxed out their final credit card. 

The problem with most analogies is that they get pushed way beyond their applicability, but in the case of technical debt the similarity is so strong it often doesn’t get pushed far enough. Technical debt is mostly talked about in purely negative terms, but just as taking on financial debt in the form of a mortgage can be a very sound decision so too can be taking on technical debt provided it is done for the right reasons.

It may surprise you that Ward Cunningham the person who coined the term “Technical debt”  (and also the inventor of the Wiki) did so in a positive context rather than a purely negative way:

“With borrowed money, you can do something sooner than you might otherwise, but then until you pay back that money you’ll be paying interest. I thought borrowing money was a good idea, I thought that rushing software out the door to get some experience with it was a good idea, but that of course, you would eventually go back and as you learned things about that software you would repay that loan by refactoring the program to reflect your experience as you acquired it.”

 

Technical Debt Antipatterns

Before we look at some of the ways in which technical debt can be a benefit let’s consider some of the classic technical debt anti patterns:

Accidental Complexity

Accidental complexity is the kind of technical debt that occurs when code is written by people who lack the skills necessary to produce an efficient solution. This can be when code is written by inexperienced developers, but can also happen when working on novel, or cutting edge projects for which established patterns and tools are not easily available.

Accidental complexity can be an especially nasty kind of technical debt because the people creating it may not know that they have created it, or what to do about it.

Constant Crunch

In the software industry its very common for a project to go into a “crunch” before an important deadline where everyone is working especially hard and focussed on delivery. A team in crunch mode will typically not have much time for planning or elegant solutions and as a result technical debt will increase. Ideally crunch should be avoided, but if it happens occasionally and the technical debt that was created is noted and paid down in a timely manner then quality will not suffer too much. The problem occurs if a team is in constant crunch and no time is allocated to paying down the debt. This usually happens if the team is under-resourced, or the expectations placed on it are too high. A visible example of this is the games industry where crunch before a release is very common and patching after release day to fix major bugs is a frequent occurrence. The recent failed launch of the game Cyberpunk 2077 shows what happens if there is too much crunch in a project.

 

What happens when technical debt is too high?

This is another area where the analogy to financial debt is very apt. When paying off a loan part of the monthly payment goes to servicing the interest and part of it to paying off the principal. If the interest becomes too high it becomes impossible to pay back the debt. 

The same is true of technical debt. As technical debt increases more of the effort going into the project is spent dealing with the problems caused by the problems in the codebase.

The common experiences of a code base like this are that even simple changes require modifying code in many classes (or over multiple services in a microservice environment), or that fixing one thing breaks something else. This is especially problematic if the code lacks sufficient tests.

Eventually progress grinds to a halt and more time is spent dealing with problems than developing new functionality. 

Considered Technical Debt

We’ve considered some cases where technical debt can hurt an organization, but let’s think about some occasions where it makes sense to create some technical debt.

Prototype / MVP

Often the most critical question isn’t how a piece of software should be made, but if it should be made at all. In those circumstances it makes sense to build something as quickly as possible to support the product discovery process. When prototyping you will typically take any shortcuts necessary to get to working software and learn more about the product before committing significant resources. 

Where something is extremely technically challenging and the right mechanism to implement the solution isn’t clear, an engineering prototype can also be used for technical derisking of the project. “Mythical Man Month” author Fred Brooks described this as “Writing one to throw away”.

Engineers are often very skeptical about creating proof of concept prototypes with high technical debt because they are concerned that they may be required to productionize them without suitable refactoring time being allocated. To paraphrase Milton Freidman “There is nothing more permanent than a temporary solution”. 

Taking this into account it is very important that stakeholders understand that a high fidelity prototype they have just seen at a demo won’t be production ready in a week’s time.

 

Reducing Time to Value

This differs from the prototype case because in this instance you are sure of the value of the software you are building, but you are prepared to take some shortcuts to get there quicker. This is the sense that was referred to by Ward Cunningham in the original coining of the term “Technical Debt”.

This might be because you are writing an application in a very competitive market where there is a strong element of “winner takes all” and launching a few weeks earlier offers a significant advantage, or because the software must be prepared for a specific date that can’t move for example a Black Friday sale, or the launch of a movie.

In this situation it can make sense to deliberately take on technical debt to release more quickly. When doing this it’s essential that any technical debt is tracked and logged and time is set aside to pay it down later.

 

Technical Investment Antipatterns

The opposite of technical debt is technical investment. This is a less commonly used term, but means making technical improvements to support the future needs of the application. On the surface this would appear to be an activity with no downsides, but done without care this can have negative consequences.

Premature Optimisation

Creating optimisations which don’t reflect real performance problems, but make the code more complex and hard to support

Premature Generalization 

Less talked about than Premature Optimisation, but equally damaging this means turning straightforward functionality into a generalized framework where the flexibility is never used, but the additional complexity makes the application harder to understand.

Useless Features

Adding features to an application that doesn’t help the users, but adds complexity to the interface and more functionality to test and support.

 

Technical Investment Patterns

Clearly, some kinds of technical investment can be harmful, but what kinds can be relied upon to deliver benefits?

Improving Test Coverage

This is especially powerful if combined with refactoring the application to be more testable. A highly testable application is almost always a flexible and extensible application due to the layering and modularity that is required for effective tests.

Removing Unused Features / Code

Simplifying applications and streamlining a complex application helps users and engineers to better understand the product. Using another financial analogy, unused code should be considered as a liability rather than an asset.

Replacing and Refactoring

In the talk Software That Fits in Your Head Dan North talks about the concept of “Software Halflife”. That code that’s existed for a while should either be hardened or replaced. Adoption of a microservice architecture provides clearly defined boundaries between systems which allows a problematic or underperforming component to be replaced without risking a knock on effect to the system as a whole.

Hopefully this piece has highlighted the power and subtlety of “Technical Debt” and how with pragmatism it can be used to derive maximum value from the software delivery process. 

At Curvestone we love helping our customers to deliver value fast without sacrificing quality. If you have a problem we can help you with please get in touch.

Intro to packaging and dependency management for python with poetry

02/02/2022 in AI, Innovation

Intro

Python has an amazing learning curve! You don’t need to skip over multiple files in a complex project structure to find and run your code. Instead, you can just open up an interactive shell and start playing around. As soon as you reach the limits of typing into the python interactive shell just open up your favourite text editor, write your code and save it as .py and now you have a working python script. What if your script gets a bit too long? Take some of your code and put it into a different file. Then start importing. It just works.

But software engineering is not just the code you write, it’s also the collaboration that’s required to tackle the complexity and scale of modern development projects.

In most programming projects, you will want to use open source (and possibly internal or closed source) packages, document your code, test it, make sure it adheres to your (and community) standards and maybe even publish it either internally or to PyPI.

In many programming languages, a lot of tools for all of your collaboration needs are either a part of the core language ecosystem (like cargo in Rust) or are so ubiquitous that it’s really hard to learn the language without them (like npm in Javascript). In python… well, it’s a bit more complicated. There are standard tools and ways of doing things like setuptools and requirements.txt, but a lot of them seem antiquated and miss so many of the important functions that it’s really hard to recommend them. Some, like an autoformatter, are just entirely missing.

You might have gotten to the point at which you’re ready to abandon the relative simplicity of one file scripts, or maybe you’ve gotten tired of juggling between virtual environments to keep your dependencies clean and separate for every project, or you’re just curious about how to standardize your development in a modern and highly collaborative way. Looks like a great moment for you to be introduced to the python development stack we have at Curvestone.

Packaging python

In the age of cloud computing, you never know where your code will be deployed and with what it will have to be compatible with. That’s why we think, that it is really important to have a standard structure for all of your codes, that is well defined, contains all of the specifications required to use it, and is recognized by as many tools and people as possible. For python, it’s a package.

The problem is, python packaging standards are currently being redefined, and the tools for package management are lagging far behind their counterparts in other modern programming languages.

Out of many alternatives, we have decided on poetry. What convinced me was its similarity to Rust’s cargo (it also borrows a lot of features from tools like Javascript’s npm and yarn).

Poetry replaces all of the manual work involved in packaging python with handy automations and makes it so efficient and easy that you’ll want to put every little piece of your code into a package.

Your manually created requirements.txt files will all be replaced by completely automated poetry.locks. You will never have to maintain your own development virtual environments, all you need to do is jump into the project directory, and it will be already there.

What about defining and updating dependencies? You will get a simple cli with multiple useful commands, a fast resolver, and access to all of the modern ways of defining version constraints.

Creating a project

To install and understand the basics of poetry, follow the excellent instructions at https://python-poetry.org/docs/

For now, we’ll skip the basics and jump straight into the action with:

poetry new example-project --src

For a project name use lowercase + dashes, poetry will make sure to create the main package with underscores, so it is importable.

With packaged python, the trick is to never import modules from your local directory. Instead, you install them first and then import them from your installed packages. That way the package structure and dependencies will be managed by either poetry or pip.

Adding --src makes poetry put our package code in a folder named src. As it is common to run your code from the project root directory, we’ll avoid the ambiguity of having the same-named packages both installed and in the local path.

If you jump straight into the newly created project folder, you’ll get a bunch of new superpowers for your shell, all provided by poetry.

Using a dev environment

poetry install will install your package into your complimentary poetry environment that goes with your every poetry project. It is just a python virtual environment, and if you are already in one, it will be used by poetry instead. However, I recommend deactivating your environment and letting poetry do its magic. That way all of your dev environments will be separate and specific to the work you are currently working on.

What’s more, poetry will create a poetry.lock file, where all of the dependencies are pinned. That way, you’ll never end up experiencing the nausea of trying to understand why a slight version discrepancy in a dependency of your dependency breaks the code when your colleague tries to run it. If you’re already anxious by the micromanagement that requirements.txt usually requires, don’t worry. Poetry manages the lock file for you.

poetry shell is the simplest way to access your new environment. This works the same as activating a regular python virtual environment.

Instead of jumping into a poetry shell, you can also make your command use your env by prefixing it with poetry run. e.g. poetry run python will open up a python interactive shell with your poetry environment activated!

If you don’t have any code in your project yet, add some and try importing it from the interactive python shell. You don’t have to reinstall your package via poetry, it has been installed in an editable mode, every change you do is immediately importable (you might need to restart your interactive environments or use importlib.reload).

Poetry in VSCode

If like me, you’re using VSCode, you might wonder how to have your linter recognize poetry virtual environment as its interpreter. As VSCode is now compatible with poetry it should happen automatically.

If it doesn’t, make sure that your poetry project is open at the root of your VSCode project, and then try setting the interpreter for your project manually.

If you are in a monorepo, just add the project folder to your workspace. You don’t need to have a workspace, it will be automatically created when you add the folder. Now you should be able to set the interpreter manually for each of your roots. Try this also for your monorepos with other programming languages, this setup solves a lot of compatibility issues for VSCode!

Dependency management

We ML and data people love to import numpy. To do this we need to specify it as a dependency. In poetry, there are two ways to do this. You can go on and just modify the pyproject.toml file in the project directory, or use poetry commands:

poetry add numpy will add numpy as a regular dependency. It might break, as numpy tends to require a specific range of python versions that are not compatible with caret defined “^3.8”, just update the python version in pyproject.toml as required and try again.

But what is this caret requirement? Great question! This is one of the ways in which you can specify dependencies in poetry and many other modern tools (). Caret requirements focus on compatibility and conciseness, and if your dependency follows semantic versioning, you can be almost sure that it will not break until its version no longer fits your specification.

How do caret requirements work? “^1.0.1” means that your code will be compatible with any version from “1.0.1” up to the next major release. So “2.0.0”, which will be the first that will not work, which makes sense, as a change in major version will inevitably come with incompatibilities. Keep in mind that caret requirements work a bit differently for versions under “1.0.0”, make sure to have a look into those examples from poetry docs!

Don’t worry if you’re accustomed to requirements.txt or setuptools style of specifying dependencies and don’t want to change to using caret requirements. It still works!

If you instead decide to just modify the pyproject.toml, make sure to run poetry update afterward, which will fix your poetry.lock and install new packages. You can use it to update your poetry.lock to a newer version in case your locked packages get outdated. If you don’t want to update all of your packages or you don’t want to install them, have a look into the documentation for poetry update and poetry lock.

Dev dependencies

Sometimes you may want to add a dependency that won’t be used in production. Those are called dev dependencies. Some of the examples might be pytest – a python unit-testing package, black – a package that formats your code, or jupyterlab – often used to edit notebooks for data science and machine learning.

To add a dev dependency use a --dev flag with poetry add.

If you want to use jupyterlab, you will need to install ipykernel as well. You can do this by running:

poetry add jupyterlab ipykernel --dev

then start it with poetry run jupyter lab.

This will install jupyter in your dev environment. In case you don’t want it to be bloated when running tests, you might want to use extras instead.

Outdated dependencies

One of the most useful commands in poetry is:

poetry show -o

It will show all of the outdated (-o) dependencies in your project. It will include also indirect dependencies. Feel free to pin those in your pyproject.toml if you need to update them due to e.g. potential vulnerabilities.

To update a dependency in your pyproject.toml, either do it manually or use:

poetry add numpy@latest

Build and publish

If you want to install your code, you can do it via pip, thanks to PEP 517. Just run:

pip install path/to/your/project/.

This will, however, ignore your poetry.lock.

What you can do instead is install your package using poetry install in any virtual environment, just make sure it is activated. If you want to go back to using the poetry environment, just deactivate your env. Poetry will never install anything in the global python environment, when in the global env, it will always use its own environments.

Remember to add a --no-dev flag, when installing in production!

If you are unable to access your source code during installation, like with codes installed from a package index, make sure that your code is tested around multiple versions of your dependencies. Tox is a great tool for this.

If you need a wheel (a standard format for installable python packages), poetry build will do the trick.

For publishing use poetry publish or follow the instructions for publishing wheels in your package index of choice. We at Curvestone use Azure Artifact Feed with twine authentication.

Monorepo

What if you are trying to use poetry in a monorepo? Sadly it’s not perfect. Functionalities similar to cargo workspaces or yarn workspaces are not there, poetry.lock is always individual for every package. If you want a detailed example of how a monorepo using poetry can be organized, have a look at this great medium post from Opendoor. Here I’ll provide some basics:

You can use path dependencies in your pyproject.toml, but you will have to edit it manually, e.g.:

example-dependency = {path = "../example-dependency", develop = true}

develop here is really important, it basically means editable.

The problem with this is that now building and publishing basically doesn’t work, as the requirements for the built packages will contain those paths and will not be resolvable by wheels.

If you want to publish your package and have it work with poetry install, you can set it up with both a versioned dependency and a path dev dependency.

[tool.poetry.dependencies]
...
example-dependency = "0.1.0"

[tool.poetry.dev-dependencies]
...
example-dependency = {path = "../example-dependency", develop = true}

This works almost perfectly. Installing with poetry install will work. Building a wheel will completely ignore the path dependency and work correctly. Just make sure to put all of your path dependencies into a wheelhouse (a directory with wheels) and install with pip, using -f path/to/your/wheelhouse.

Sadly, poetry install --no-dev will not see the path dependency at all, and if you have a path dependency that itself has different path dependencies, the setup will break. Nevertheless, it might be a helpful setup to someone, who has a relatively small monorepo and is willing to pin path dependencies of path dependencies.

There is also a small poetry workspace plugin, that we didn’t yet get to experiment with. We really hope that poetry will provide more tools around monorepos in the future.

Next steps

This was a short intro to poetry, but this is not where tools for collaborative code management end. I’ve mentioned tox, black, and pytest, but there are many more! My favourite best practice is to regularly review and evolve the tools for and ways of working. If you have a gut feeling something can be automated, there’s probably a bunch of tools that do exactly that. Have fun learning!

Comparison of Optical Character Recognition (OCR) services from Microsoft Azure, AWS, & Google Cloud

24/01/2022 in AI, Document Intelligence, Innovation

Introduction

The recent developments in machine learning (ML) has enabled the automation of more complex tasks across multiple fields. For example, the explosion of deep learning has made extracting valuable information from data much more practical, flexible, and robust due to its ability to identify complex patterns in unstructured data. One very practical area where we’ve seen a lot of improvement recently is in so-called Intelligent Document Processing (IDP). The underlying goal of IDP is to extract relevant information from documents and convert them into structured data that can be later used either by a person, or another intelligent algorithm. This technology offers increased productivity, faster processes, reduced errors, and consequently brings more value to the customer.

For scanned or handwritten documents, extracting text accurately is a crucial step for IDP workflows, as it cannot be directly read from the file. Such extraction can be achieved with the use of Optical Character Recognition (OCR) – a technique that detects and classifies characters in an image to find the content of a document. Although OCR is one of the earliest addressed computer vision tasks, the recent developments in deep learning-enabled algorithms has led to much greater accuracy of detected words and sentences. This trend developed in parallel to the popularisation of cloud computing, and as the Big 3 cloud platforms, Microsoft Azure, Google Cloud Platform and Amazon Web Services, competed for market share, integrating the latest ML capabilities became part and parcel.

The cost of using the OCR service from one of the Big 3 is fairly comparable and relatively cheap, therefore they are primarily competing on performance. However, surprisingly we haven’t found any good comparisons, either from the vendors themselves, nor from a third-party. In fact the vendors do not even share the performance of their own service. This makes it difficult to evaluate which vendor to choose when integrating an OCR related feature into a product. This post was created to provide a comparison across the major OCR services. Curvestone has conducted experiments to assess their performance over three different datasets and we calculated metrics that indicate effectiveness for both detection of words bounding boxes and correctness of detected words.

Methodology

To compare OCR services, we needed to unify the method used to benchmark them. We were particularly interested in the performance of detection and the correctness of detected words. The general approach to assess single service performance over a single dataset was:

1. Flatten the dataset in a way that each page is a separate image

2. For each page:

a. Detect words and their bounding boxes (using OCR service)

b. Normalize format of bounding boxes

c. Match detections with corresponding ground truth

3. Calculate metrics

4. Average per file metrics to assess service performance across the dataset

This section describes the approach to those problems and provides a list of compared OCR services.

Bounding boxes structure and matching

Different datasets and services use different bounding box formats. Before we matched detections, we needed to normalize their format. We transformed detections from each source so the resulting bounding boxes have the following format:

  • Each bounding box contained information about the word and its location on a page
  • We used the upper-left and lower-right corners to define the bounding box

To match the detection with ground truth we calculated the Intersection over Union (IoU) [explanation below] between all bounding box pairs and matched the ones that have IoU larger than a threshold (we found 0.5 to work well).

Metrics

We use the following metrics to compare OCR services:

  • Intersection over Union (IoU) – provides information about the overlap between two bounding boxes. It is calculated by dividing the area of intersection between bounding boxes by their combined area.
  • Accuracy – indicates the performance of the detection. It is calculated by dividing the number of matched bounding boxes by their  total number on a page
  • Levenshtein distance – it indicates the difference between the detected word and ground truth. Its value represents the number of single-character edits needed to match two words.

Providers and services

We wanted to provide the comparison between the three most widely used providers:

  • Microsoft Azure – cloud platform from Microsoft, it provides two OCR services:
  • Azure OCR – older service, presumably exists due to legacy reasons.
  • Azure Read – newer service using state-of-the-art techniques
  • Google Cloud Platform
  • Cloud Vision API – only OCR service from Google, using state-of-the-art techniques
  • Amazon Web Services
  • Amazon Textract – only OCR service from Amazon, using state-of-the-art techniques

Experiments

The experiments were performed over three datasets to check the performance of the services on documents of varying quality. We used two datasets that are commonly used to benchmark Intelligent Document Processing tasks:

  • DocBank – The dataset is composed of documents created using LaTex. It contains good quality documents that we are using to emulate digital documents (e.g., image-based PDF files)

Fig 1. Samples of documents from DocBank dataset.

  • FUNSD – The dataset contains noisy document scans, we are using it to check the performance for poor-quality documents

Fig 2. Samples of documents from FUNSD dataset.

From DocBank, we derived a synthetic dataset by scaling each image down and up by the factor of 2 and introducing blur. We did it to emulate the scanning of physical documents.

Fig 3. Comparison between real DocBank (left) and synthetically degraded DocBank dataset (right).

We refer to the datasets as:

  • Digital – DocBank
  • Degraded – DocBank with synthetically lowered quality
  • Noisy scan – FUNSD

Experiment 1 – Comparing services from Microsoft Azure

Microsoft Azure have two OCR services – Azure OCR and Azure Read. The former is an older service with an outdated model that does not perform as well as the latter, presumably still available to support legacy projects. Azure Read uses a state-of-the-art model for OCR, hence it should perform better. However, Microsoft does not share any insights into the difference in the quality of detections between these two services, hence we decided to check whether it is worth the effort of upgrading.

The numerical results in Table 1. indicate that Microsoft Read outperforms Azure OCR for all datasets. The difference between services increases with the decrease in the quality of input data. The obtained results are in line with the intuition – researchers now are focusing their efforts on developing networks that work well for data of increasingly worse quality.

Dataset

Service

Accuracy

Mean IoU

Levenshtein

Digital

Azure OCR

0.855

0.662

0.153

Azure Read

0.889

0.715

0.153

Degraded

Azure OCR

0.852

0.679

0.186

Azure Read

0.886

0.709

0.158

Noisy scan

Azure OCR

0.575

0.647

0.584

Azure Read

0.864

0.721

0.417

Table 1. Numerical comparison between performance of two Microsoft Azure OCR services. Azure Read outperforms the Azure OCR across all datasets, the difference between them increases with a decrease in the quality of the data.

The distribution of mean IoU is visualised in Fig 4. It presents information about how closely the detections match the ground truth. We can observe that Azure Read not only outperforms Azure OCR, but its results are also more consistent – its performance is more stable for a wide range of documents. In Fig 5. the distribution of Levenshtein distance between detected words and ground-truth is visualized. This metric indicates the correctness of detections, the lower the distance is, the closer it is to the ground truth. Again, Azure Read outperformed Azure OCR, especially for lower quality documents. It is worth noting that Azure Read had a few outliers, but on average it performed much better.

Fig 4. The distribution of IoU for all files in a given dataset. Azure Read not only outperforms Azure OCR, but its results are also much more consistent.

Fig 5. The distribution of Levenshtein distance for all files in a given dataset. The difference between the two services increases with a decrease in data quality.

Experiment 2 – Comparing the providers

In the second experiment, we compared OCR services from the Big 3 services: Azure Read from Microsoft Azure, Cloud Vision API from Google Cloud Platform and Amazon Textract from Azure Web Services. We decided not to include Azure OCR as it does not use the state-of-the-art model to detect text and performs worse than Azure Read.

The numerical results for this experiment are gathered in Table 2. For digital and degraded documents neither of the services outperforms the others, although Cloud Vision API seems to work worse than services from Amazon and Microsoft. For noisy scans, Amazon Textract turned out to be better in terms of the correctness of detected words, it also achieves moderately better accuracy and mean IoU.

Dataset

Service

Accuracy

Mean IoU

Levenshtein

Digital

Azure Read

0.889

0.715

0.153

Cloud Vision API (Google)

0.876

0.754

0.384

AWS

0.877

0.724

0.150

Degraded

Azure Read

0.886

0.709

0.158

Cloud Vision API (Google)

0.870

0.718

0.376

AWS

0.872

0.740

0.184

Noisy scan

Azure Read

0.864

0.721

0.417

Cloud Vision API (Google)

0.835

0.679

0.471

AWS

0.870

0.751

0.370

Table 2. The numerical comparison between the performance of OCR services. Results are similar across all providers for digital and degraded datasets. For noisy scans, Amazon Textract outperforms Azure Read and Cloud Vision API for all metrics.

The distributions of IoU and Levenshtein distance are visualized in Fig. 6 And Fig. 7, respectively. Although Amazon Textract and Azure Read yield similar mean IoU results, the latter seems to be more consistent across documents. In terms of correctness, Cloud Vision API is lagging behind its competitors. Surprisingly, it yields the best mean IoU score and the worst Levenshtein distance for digital documents. It means that it detects the location of words better than other services but it fails to read them correctly.

Fig 6. The distribution of IoU for all files in a given dataset.

Fig 7. The distribution of Levenshtein distance for all files in a given dataset.

Conclusions

We compared four Optical Character Recognition (OCR) services provided by the Big 3  cloud platforms: Microsoft, Google and Amazon. We used 3 documents data sets (2 unique, 1 synthesized) to analyse the performance across a range of variability to emulate real world situations. We found that the newer Azure Read, as expected, outperformed the older Azure OCR service, with the difference in performance increasing with decreasing quality of document. For consistently high quality documents, a migration from the older Azure OCR to Azure Read might not be justified purely by the performance depending on the use case. It is also worth noting that there are other factors that might impact the decision, e.g., pricing, asynchronous processing.

In terms of the cross-vendor comparison, Amazon Textract moderately outperformed the other services on noisy scanned documents, both in terms of word transcription and word location. For good quality documents, the differences between services was negligible and probably shouldn’t impact the decision to adopt. Instead, other factors such as; the wider cloud platform services/ecosystem, and pricing would be more important.