4 things you need to know about AI: accuracy, precision, recall and F1 scores

By info@lawtomated.com on October 9, 2019

AI software vendors routinely champion the superior accuracy of their tech vs. the equivalent human effort it seeks to replace. But what is “accuracy”? Likewise, when buyers ask “how accurate is the system at X?”, is that the correct question to consider?

As this article will explain, accuracy is not a single metric. In fact, accuracy is but one of four potential metrics.

The other three metrics are precision, recall and F1 score.

Each metric measures something different about the system’s performance. For this reason, it is also often desirable to optimise, and therefore prioritise, one metric over the other. Which metric to optimise depends on the context and objectives of the system. Therefore – spoiler alert – asking “how accurate is the system at X?” is the wrong question to ponder! The correctness of both the question and the answer each depends on the context of the business problem.

It’s a tricky topic, but a critical one. This guide is for non-technical individuals, and primarily lawyers at law firms, in-house legal teams and other legal service providers. That said, it should be useful for any interested business professional dabbling in artificial intelligence or data science (so we hope!).

You’ll learn:

Why “accuracy” is not a single metric, but merely one of four useful metrics, and often the least insightful.
What do “accuracy”, “precision”, “recall” and “F1 score” mean, how are they different and what do they tell us about a system’s performance?
When to prioritise precision over recall and vice versa?
How this applies to AI in legaltech, particularly for AI-assisted contract review software such as Kira Systems, Seal, iManage Extract, Eigen Technologies, Luminance, e-Brevia, Diligen etc or their toolkit competitors such as Google Document Understanding AI.
Why these performance measures are themselves only one variable in your ROI calculation.

But first, a non-legal but real-life example: AI-assisted tumour diagnosis for cancer detection. What follows is a breakdown of such a system (with dummy data), and why looks can be deceiving when it comes to a singular and unexplained % “accuracy” claim.

The vendor’s product

A vendor offers an AI-powered tumour testing system. This system uses a type of supervised machine learning to build a classifier.

A classifier is an algorithm that learns how to detect whether something belongs to one class or another. In this case, whether a tumour scan is either an:

Actual Positive, a malignant tumour (cancerous); or
Actual Negative, a benign tumour (non-cancerous).

The AI learns this ability after training on (i.e. mathematically analysing) a dataset comprising a set of tumour images. This dataset is known as the Training Dataset.

Each tumour image in the Training Dataset represents an input-output pair, i.e. an image of a tumour (input) plus a label (output). Each label – benign or malignant – was applied to each image by oncologists (cancer specialist doctors), and verified as true via further medical testing.

In other words, for each tumour image in the Training Dataset, we know with 100% precision whether a tumour is benign or cancerous. We simply look at each image’s label. This is the source of truth against which the system was trained.

The vendor then uses an additional and separate dataset against which to assess the system’s ability to replicate this behaviour, i.e. tumour classification. This further dataset is known as a Test Dataset. Like the Training Dataset, it is a collection of tumour images human labelled with their corresponding tumour classifications, i.e. benign or malignant. Unlike the Training Dataset, this data is held back and not used to train the system. A copy of the Test Dataset is fed into the system, but this time the human applied labels have been removed beforehand.

The sole purpose of the Test Dataset is to test the system. It’s essentially a blind test for the machine. After the system finishes classifying the labelless images from the Test Dataset we then compare the system’s classifications against the human applied labels for the same images. In this way, we score the system, much like marking an exam paper – how many correct answers did it manage?

Let’s explore these results.

The vendor’s claim

The vendor claims their system is “99% accurate”. Sounds great, but what does it mean? Let’s break this down with the aid of their data.

1. The Test Dataset

The “99%” figure is based on the system’s performance against the below Test Dataset. These are the actual labels, applied by human reviewers:

1,000,000 tumours in total
999,000 out of 1,000,000 are actually benign (Actual Negatives)
1,000 out of 1,000,000 are actually malignant (Actual Positives)

This is the same Test Dataset we described above. To be clear, the Test Dataset is distinct from the much larger Training Dataset.

2. The vendor’s performance data

Having been fed the Test Dataset, the system’s classifications (i.e. its exam results!) for each tumour are summarised in the below grid. The vendor’s “99%” accurate claim derives directly from the results summarised below:

A confusion (or confusing) matrix

The above graphic is a confusion matrix. A confusion matrix is a table used to describe the performance of a classifier on a set of test data for which the true values are known (i.e. the Actual Positives and Actual Negatives).

The green quadrants summarise the correct classifications made by the system, and the red quadrants summarise the incorrect classifications made by the system.

Another way to represent these results is as follows:

As you can see we describe results in terms of the following:

True Positive: tumour predicted malignant + actually malignant (i.e. (Actual Positive)
False Positive: tumour predicted malignant + actually benign (i.e. Actual Negative)
True Negative: tumour predicted benign + actually benign (i.e. Actual Negative)
False Negative: tumour predicted benign + actually malignant (i.e. Actual Positive)

Correct performance is when the system produces either True Positives (1) or True Negatives (3). Incorrect performance occurs when the system produces False Positives (2) or False Negatives (4).

In summary, we want the system to be correct at labelling positive results for cancer (malignant tumours) and negative results for cancer (benign tumours).

Anything else is incorrect.

3. So how does the system perform?

When AI vendors talk about “accuracy” what they mean – or should be talking about – is the relationship and applicability of four metrics:

Precision;
Recall;
Accuracy; and
F1 Scores.

Likewise, buyers and users of such systems should assess such systems in these terms.

Let’s breakdown each of these, both in general terms and their specific application to the cancer detection system and lastly to legal scenarios.

3.1 Precision

Precision is the ratio of system generated results that correctly predicted positive observations (True Positives) to the system’s total predicted positive observations, both correct (True Positives) and incorrect (False Positives).

In other words, precision answers the following question:

How many of those tumours labelled by the system as malignant are actually malignant?

In formula the precision ratio is this:

Or slightly simplified:

Applied to our cancer example we get this result:

Precision = 990 / (990 + 9,990) = 990 / 10,980 = 0.09 = 9%

Ouch! 9% isn’t 99% and doesn’t sound great… but we will come back to this and explain why this isn’t a disaster nor evidence of a vendor falsehood.

3.2. Recall (aka Sensitivity)

Recall is the ratio of system generated results that correctly predicted positive observations (True Positives) to all observations in the actual malignant class (Actual Positives).

In other words, recall answers the following question:

Of all the tumours that are malignant, how many of those did the system correctly classify as malignant?

In formula the recall ratio is this:

Or slightly simplified:

Applied to our cancer example we get this result:

Recall = 990 / (990 + 10) = 990 / 1,000 = 0.99 = 99%

3.3. Accuracy

Accuracy is the most intuitive performance measure. It’s what most people are taught at school, often in isolation and without consideration of precision, recall and F1 score.

Accuracy is simply a ratio of the correctly predicted classifications (both True Positives + True Negatives) to the total Test Dataset.

In other words, accuracy answers the following question:

How many tumours did the system correctly classify (i.e. both True Positives or True Negatives) out of all the tumours?

Accuracy is a great measure but when you have symmetric datasets. A symmetric dataset occurs when the split between Actual Negatives and Actual Positives approaches an even distribution. This is unlike our cancer example where there are significantly more Actual Negatives than Actual Positives.

We’ll come back to this and explain why accuracy is potentially unhelpful for asymmetric datasets.

In formula the accuracy ratio is this:

Or slightly simplified:

Applied to our cancer example we get this result:

Accuracy = (990 + 989,010) / 1,000,000 = 0.99 = 99%

3.4. F1 Score

The F1 Score is the weighted average (or harmonic mean) of Precision and Recall. Therefore, this score takes both False Positives and False Negatives into account to strike a balance between precision and Recall.

So what is the difference between F1 Score and Accuracy?

Returning to our earlier thread, an accuracy % can be disproportionately skewed by a large number of Actual Negatives. This is true of our cancer example where there is high asymmetry in the data, i.e. a large number of people who don’t have cancer, hence a very high accuracy score of 99% (because accuracy assesses competence at identifying both Actual Positives and Actual Negatives) despite a low precision score, which assesses competence at identifying Actual Positives.

To make this even clearer (hopefully), consider briefly this separate example:

In this example, the system is 99.9% accurate. It’s also 100% precise. Sounds amazing. But its recall is only 50%. Hmm… what’s going on? In other words, it missed half of the Actual Positives. But does that matter?

Well, it depends.

What if that missing 50% of Actual Positives represent terrorists or individuals carrying a zombie plague? The system starts to look pretty dire in this context. Just one False Negative and the cost could be gargantuan and irrecoverable.

And this is the crucial point: context is key.

In most business circumstances, we do not focus on True Negatives. Instead, we almost always care most about False Negatives and False Positives. This is because False Negatives and False Positives will have relative business costs (tangible and intangible).

Intuitively this makes sense – getting stuff wrong is costly. For this reason, we typically wish to minimise one over the other depending on which has the higher cost associated with it (more on this below).

Thus F1 Score might be a better measure vs. accuracy if we need to seek a balance between Precision and Recall AND there is an uneven class distribution, e.g. a large number of Actual Negatives as in the above mini example and our cancer example. For completeness, the F1 Score for the above mini example is 67%.

In formula the F1 score ratio is this:

Returning to to our cancer example we get this result:

F1 = 2 (0.09 x 0.99) / (0.99 + 0.09) = 0.1782 / 1.08 = 0.165 = 16.5%

4. So what does all of the above mean?

Based on the above we can say the cancer diagnosis system has high recall and high accuracy, but low precision. If asked:

“what is the chance that a tumour identified by the system as malignant is in fact malignant” what would you say?

If you said 99% you’d be wrong.

The answer is only 9%! But 9% sounds terrible… or does it? In other words, this means the majority of tumours identified as malignant will actually be benign. Surely the system is garbage?

Not so fast. Actually the system is very good when considered in light of its business objective. This is because the system is designed to prioritise better recall vs. precision. Another way of describing this is to say the system is deliberately overly inclusive, i.e. statistically errs on the side of classifying a tumour as malignant even if it might, in fact, be benign.

But why might this make sense in the context of tumour classification? Let’s find out!

Which metric to prioritise depends on your business objective and the relative costs of False Positives vs. False Negatives. To reiterate the point: context is key.

When to prioritise Recall over Precision?

Recall should be optimised over precision when there is a high cost associated with a False Negative, i.e. system predicts benign when tumour is in fact malignant.

Our cancer detection scenario is a good example. If a patient’s tumour is actually malignant (Actual Positive) yet the system incorrectly predicted it as benign (False Negative), that patient may die if this means their cancer goes undetected and untreated. There is a high – and in this case irreversible – cost to getting the diagnosis wrong (death). As such, we accept the lower cost of additional worry and further tests that more False Positives might entail before cancer is ruled out. In other words, we want to minimise False Negatives at the risk of increasing False Positives.

Applying this to Legal AI

When using AI-powered contract analysis software such as Kira Systems it is possible to adjust configurations and prioritise recall over precision and vice versa. Users will want to optimise recall over precision when using such tools for due diligence, i.e. to analyse contracts and flag clauses undesirable for their client, e.g. “indemnities” or “termination at will”.

This is because there is a high cost associated with a False Negative in such circumstances, i.e. the system failing to red flag such provisions, leading the lawyers to miss key information that adversely impacts their client’s position.

Another example is eDiscovery and predictive coding. There is a high cost associated with the system failing to identify something as responsive to the litigation when it is in fact responsive to the litigation (i.e. a False Negative). This is because the missed piece of evidence might be the smoking gun that wins the case. To avoid that, prioritise recall over precision and accept the lesser cost of having to wade through more False Positives (things labelled responsive to the litigation but in fact unresponsive).

When to prioritise Precision over Recall?

Precision should be optimised over recall when there is a high cost associated with a False Positive, i.e. spam detection.

In email spam detection, if an email is actually non-spam (Actual Negative) yet incorrectly predicted as spam (False Positive), that email is sent to the spam folder and / or possibly deleted without the addressee’s knowledge. If precision is too low for spam detection, the user may lose important emails and therefore we can say there is a high cost associated with a False Positive in this scenario.

Applying this to Legal AI

Returning to the AI-assisted contract analysis example, assume the software is instead being used to extract clauses for a clause library. The clause library will collate clauses verified by lawyers to be on market, that is clauses worded in a manner generally accepted by the legal market for that contract or transaction type.

Unlike in due diligence, there is here a high cost associated with a False Positive – a clause labelled as on market when it is in fact the opposite, i.e. off-market.

The cost is high because a junior lawyer may inadvertently rely upon this off-market clause by virtue of it being listed in the clause library, which could have adverse consequences. Those adverse consequences could include inserting a non-standard contract provision into a contract that inadvertently damages the client’s position.

Accuracy, precision, recall and F1 scores are critical to master. Nevertheless, they are one variable in a much larger ROI calculation via which to assess AI solutions.

For this reason, it’s crucial to zoom out and ask this question:

In light of the performance metrics, correctly understood in their business context (e.g. whether you need to prioritise precision vs. recall or vice versa), does using the AI solution improve your ability to get stuff done, and done well?

Specifically, does using the AI solution improve:

the quantity of usable outputs;
the quality of usable outputs (including risk identification and minimisation where applicable, e.g. due diligence);
the time (i.e. reduces) to produce usable outputs; or
the cost (i.e. reduces) to produce usable outputs;

where “usable output” represents a satisfactory unit of work product for either: (a) the next step in a wider process; or (b) an internal or external stakeholder deliverable. If you can answer “yes” to these questions, the AI solution might be a goer. If not, best avoid.

For example, using AI-assisted contract analysis software reduces the time and, in a billable hour model, cost. This may hold true even if the system is less than perfectly performant under the relevant performance metrics. The fact the software can only tackle some tasks (i.e. data extraction), but not all tasks (i.e. the subsequent legal analysis) in the contract review is still a benefit vs. the manual alternative where all tasks happen via human hands alone. This often holds true even if the data extraction requires some level of second pass human verification.

If you made it this far, give yourself a pat on the back. This topic isn’t a walk in the park. There’s a lot of similar-sounding definitional terms to take in, and even some basic maths. Hopefully, you now understand:

Why “accuracy” is not a single metric, but one of four useful metrics
What do “accuracy”, “precision”, “recall” and “F1 score” mean and what do they tell us about a system’s performance?
When to prioritise precision over recall and vice versa?
How this applies to AI in legaltech, particularly for AI-assisted contract review software such as Kira Systems, Seal, iManage Extract, Eigen Technologies, Luminance, e-Brevia, Diligen etc or their toolkit competitors such as Google Document Understanding AI.
Why these performance measures are themselves only one variable in your ROI calculation.

Remember: “accuracy” is in the eye of the beholder!