AI Dermatology: Part 2 - Granular Material

In the last post, I discussed the possible broader implications of Google’s recent foray into making an AI dermatology tool. In this follow up post, I want to focus on the research behind the product announcement, bringing a slightly critical eye.

The main scientific paper associated with the proposed tool was published in Nature Medicine in May of 2020. (A second paper focuses on the consistency of diagnoses among non-experts when using this tool, and is of less relevance here). The paper basically tries to evaluate how accurate the tool is for common dermatological conditions, in comparison to professional dermatologists and other medical professionals. To do so, the authors assemble a large dataset of images and patient information submitted to a teledermatology service. As is standard in machine learning, most of this data is used to train the model, and the remainder is used to evaluate.

Generally speaking, the study seems very well done. Unfortunately, there are several factors which make it difficult to rigorously evaluate the claims and results. Paramount among these issues is that the data used for this study has not been released, so it is impossible to try to replicate the findings. Nor is the code available. Even in cases where the data is not available, having access to the code can still be a very useful way to discover the implementation details, many of which will never be reported in a paper. Here, however, the authors just point to the generic TensorFlow link as the framework they used to develop the model.

A broader issue, however, is that it’s actually surprisingly difficult to summarize the performance of a system in a domain such as dermatology using a single number. The authors are clearly aware of this, and so provide a range of analyses, trying to anticipate possible concerns. This is certainly useful, but not nearly as useful as the data that would allow a re-analysis would be.

Among the factors that make such an evaluation difficult are a) the set of conditions includes both very common but relatively benign conditions (e.g., acne), and relatively rare but extremely serious conditions (e.g., melanoma); b) we may care about certain conditions more than others; c) we may care more about certain types of errors (false positives vs false negatives); d) performance may vary across groups, especially in settings like this, where base rates are likely tied to skin color.

To further complicate matters, true gold standard labels are hard to come by in dermatology, and are in fact more or less absent from this particular study. The problem is that many conditions cannot be definitively diagnosed based purely on visual inspection, but most are not serious enough to justify an invasive diagnostic procedure, such as a biopsy (which involves removing a small piece of skin for inspection under the microscope). In fact, even determining how reliably dermatologists are is itself quite challenging.

The best source I could find on this is a Cochrane review focused specifically on melanoma (a type of skin cancer). It contains what I see as three key takeaways: 1) visual inspection alone is not good enough to be relied on for the diagnosis of melanoma; 2) the reliability of professional diagnosis from visual inspection is quite variable; 3) in-person diagnosis is much better than a diagnosis based on images alone.

In terms of how well Google’s system works as a diagnostic tool in comparison to humans, there are several reasons why it is hard to answer this seemingly simple question based on this paper.

A first issue is that performance will almost certainly vary across conditions. As such, the overall accuracy will depend on the prevalence of various conditions in the overall population. The sample of teledermatology cases used here is probably somewhat representative of this, at least for the United States, but the fact that the cases were sent to a teledermatology service at all suggests that they are perhaps atypical in some way.

Moreover, it seems reasonable to assume that even a representative sample would deviate from the distribution that might be observed if a system were to be rolled out as a phone-based app, as Google is proposing to do. As discussed in the last post, the ease and convenience of such a system might lead more people to request information about all kinds of marks on their skin, compared to those for which they would seek a professional medical opinion.

In addition, based on lessons learned from domains like face recognition, it seems plausible that the system might perform more or less well on some patients than others. In particular, skin color seems like a reasonable variable to investigate, given how badly some past work in computer vision has failed to take account of this. The authors here do make an effort to account for this, and report accuracy by group using the Fitzpatrick Skin Type scale (as well as by certain racial categories), showing relatively small differences between groups.

That being said, some concerns remain. First, there is still some variation across results for different skin types, in a way that suggests systematic variation, and the authors don’t discuss whether this is truly systematic or just noise. Second, there are extremely few data points from people with Skin Type I or VI, to the point where those assessments don’t seem robust. In fact, there was only a single case with Skin Type VI (darkest skin) included in the evaluation data, meaning there was effectively no meaningful evaluation of accuracy for that group.

Third, and perhaps most importantly, there were unfortunately no gold standard labels available, in terms of definitive diagnoses. Because most skin conditions will only ever be diagnosed by visual inspection, the only target labels that were available for this study were based on the judgements of dermatologists. Given that decisions made by one group of dermatologists are being taken as the equivalent of gold standard labels, the fact that another group of expert dermatologists only gets around 70% accuracy is one indication that there is a lot of disagreement among experts.

Why should we take the judgments from the first group of dermatologists as gold labels, and the judgements of the evaluation group as less reliable? It seems to be primarily a matter of aggregation. For each case in the validation data, three diagnoses were obtained from a pool of fourteen dermatologists, and the results combined using a voting procedure. For the training data, labels were obtained from between 1 and 39 dermatologists. Although, it would seem reasonable to augment these judgments with biopsy results when available, it is unclear if this was done.

Moreover, this study primarily evaluates on a subset of 26 conditions, grouping all others into a 27th category called “Other”. It seems that once they were most of the way through collecting annotations, the authors settled on the 26 most commonly appearing conditions, which accounted for about 80% of all cases, and took these as their primary set of targets. Using this set gave them at least 100 images for each condition in training and 25 in the evaluation data. The authors do provide results on a full set of over 400 conditions, but given that most of these will hardly if ever appear in the dataset, it’s not entirely credible to suggest that the system is truly being evaluated on all of these. My guess is that many of these would almost never be predicted by the system, and this is another place where having access to the raw data, or even some more comprehensive aggregate statistics would be very useful.

To complicate matters further, the authors also chose to discard cases that were diagnosed with multiple conditions, as well as those that were not diagnosable, which led to the exclusion of 2% of images from the training data and almost 10% of images from the evaluation data. Although excluding these from the training data makes sense, excluding them from the evaluation data in some sense makes the task easier than it would be in the real world.

Indeed, a common issue with machine learning systems is that they are typically not very good at knowing when their own predictions are unreliable. As such, it would be extremely useful to know how this system would respond to the cases that the experts deemed to be non-diagnosable? Would it also reject these images, or would it just return the three most likely conditions according to its matching? (Presumably the latter).

Finally, in evaluating the experts, it seems that some clinicians identified some cases as “contact dermatitis”, which was not specific enough, as it could refer to two different possible conditions. These were apparently converted to “other” and treated as mistakes, even if one of the two conditions would have been correct. This would seem to slightly punish the clinicians, though the authors comment that conclusions did not change when these conditions were excluded all together.

Most of the above is basically proceeding as if all conditions were equally serious, but of course that is not the case. To some extent, this comes back again to the question of “what kind of system is this?” If the primary user base will be people who have some strange, transient, rash on their arm, and want to know what it is, then the above metrics may be quite relevant. However, it is hard to imagine that many users would not want to try to use this system to inspect various strange looking moles on their body and get an assessment as to whether or not they might be cancerous.

Here, the results are somewhat less encouraging. In the main paper, the authors do report the performance of their system on diagnosing benign vs malignant (a subset of categories), and show that it is again comparable to the experts. However, they only report this when using the top-3 accuracy (i.e., was the correct condition among it’s top 3 guesses). However, the authors also pointed out that the top-3 number from experts is not an entirely fair comparison, as the experts sometimes failed to provide 3 guesses, even when prompted to do so.

The authors don’t specifically report the top-1 accuracy by condition, but looking at Extended Data Fig. 1(b), we can see that the system tends to have the greatest improvement over humans (in terms of sensitivity) on conditions that affect pigmentation, such as Melanocytic Nevus and Tinea Versicolor. (To be clear, I have no particular knowledge of dermatology, I am basing my claim that these conditions affect pigmentation on information in Wikipedia). By contrast, the category where the system does least well in comparison to experts is melanoma, where the top-1 sensitivity of the system is only about 20%, meaning it would miss 80% of such cases (though of course the sensitivity is much higher for the top-3 prediction). By contrast, the experts in this study still only had a top-1 sensitivity of around 40%, so this is clearly a difficult task, but the gap is still concerning.

In addition, I hate to pick on such small things, but it is unfortunate to see that there also seems to be minor arithmetical errors. For example, Table 1 lists 16,530 total cases, with 142 and 271 cases excluded (for having multiple conditions and being non-diagnosable, respectively), which should leave 16,117, and yet the table says only 16,114 cases were included in the study (and similarly for the validation data). Perhaps there was another reason why an additional 3 cases were excluded, but I don’t see this made explicit anywhere.

Finally, one minor oddity that deserves comment is that there are actually two versions of the paper. There is the one that was officially published in Nature Medicine, and there is a preprint on arXiv. Putting a preprint on arXiv is quite common in machine learning, but typically authors update the arXiv paper with the final version once it is published (although the original remains accessible). Here, for some reason, the authors haven’t done so. Comparing the arXiv version to the published version, the papers are nearly identical in terms of written content, but some of the numbers are strangely different.

Most of the changes are quite small, although some are curious. For example, in the arXiv version, the full list of conditions includes 421 conditions. In the published version “Dermatitis of anogenital region” and “Polymorphous light eruption” were apparently dropped, bringing the total down to 419. There is also a slight change to the number of cases (including those that were excluded), with one being added to the evaluation set and 9 being dropped from the full training set. It’s entirely possible that these were simply errors that got fixed, although it is still somewhat strange, especially given the minor arithmetical errors in the final published table.

There are also a few larger changes, however. First, the number of cases excluded due to having multiple conditions dropped from 1394 to 142 (in the training data). Similarly, 1124 were originally listed as non-diagnosable, and this dropped to 271 in the published version. As a result, the number of cases and images included in the usable training set increased considerably, from 14,021 cases with 56,134 images to 16,114 cases with 64,837 images. It also appears that the number of raters dropped from 39 to 38, so perhaps one bad rater was found and excluded, leading to the inclusion of more data? However, if this is the case, it seems like it should have been explained in the methodology.

These changes don’t seem to have had much impact on the system’s performance, although top-1 accuracy and top-1 sensitivity do seem to have fallen slightly in some evaluations. The only place where the results seem significantly different are in the evaluation of the image-only model (i.e. training a model without the metadata), which seems to have improved dramatically for cases with large numbers of images.

The real concern, however, is more about the principle. The paper on arXiv includes evaluations on all the held out data. If the authors then went back and changed their training data to make new models and redo evaluations before final publications, this means that those evaluation data weren’t truly held out in the same way. It seems to have not made much difference in this case, but it begs the question about how carefully the data split was respected, and whether there were any other preliminary evaluations on the evaluation data before choosing the final model parameters. It goes without saying that evaluation data can only be counted on to provide a reliable estimate of performance if it is not used until after all choices have been finalized, and that does not seem to be the case with the published version of this paper.

In the end, as I emphasized in the previous post, it is entirely possible that this system will lead to real benefits in terms of morbidity and mortality, and it could even lead to dramatic changes in the medical system more broadly (though the full effects remain to be seen). Similarly, we should celebrate the fact that the team behind this system is carrying out such high quality research and doing a careful job of evaluating their system, including across dimensions such as skin type. However, there is always room for improvement in science, and a few things stand out here, including the difficulty of obtaining truly reliable gold standard labels, and questions about how things might have been different with different choices.

In the end, it seems virtually certain that any commercial system will be based on a different or expanded training regime, and would likely be updated over time, meaning that the specific findings in this paper won’t necessarily tell us all we need to know about eventual performance. Nevertheless, it is unfortunate that the authors were unable to release their data in this case, or at least their code, which would have allowed for a more careful assessment of the research that has been published here.