AI Dermatology: Part 1 - Granular Material

Midway through last year Google announced a new foray into the medical technology space, sharing that it was developing an “AI-powered dermatology assist tool”—a phone-based app that would allow users to take photos of skin lesions and retrieve information about relevant medical conditions from the web. Similar apps already exist, but it’s fair to say that a comparable effort by Google is likely to have much more significant effects on how people interact with the medical system, their personal data, and even their own bodies.

In a follow up post I’ll dig into the main scientific publication that this app is based on, and ask what we can conclude about how accurate it is likely to be, but here I want to start by discussing broader implications. A key point I want to emphasize, which I think is broadly true for a lot of things happening in the AI space, is that the likely effects of such a system have far less to do with how accurate it is compared to other systems, and far more to do with things related to affordances, perceptions, and power.

Medicine has long been a testbed and area of promise for AI (including expert systems like MYCIN from decades ago), but widespread deployment of actual medical AI systems has been relatively rare, at least in part due to issues related to regulation, fragmentation, liability, and vested interests. Nevertheless, it is surprising in many ways that there isn’t yet a massively popular phone-based app for diagnosing skin conditions.¹ As Google points out, there is huge demand for such a service, based on how often people search for related information using traditional search engines. Moreover, websites which provide medical information, such as WebMD, represent a massively influential part of today’s biopolitical infrastructure—both enabling and encouraging individuals to proactively monitor their state of health, and funneling them into the more traditional medical system. A phone-based app which provides similar information using a more convenient interface seems likely to be popular.

Moreover, much like radiology, dermatology intuitively seems like it should be particularly amenable to automation; medical professionals make their initial diagnoses of skin conditions primarily based on visual inspection, so images should (at least in theory) be able to capture most of the relevant information. A key difference, however, is that whereas radiology requires hugely expensive technology such as MRI machines, the information relevant to dermatological diagnosis appears to lie literally on the surface. Relying on little more than the bright light on their smartphones, professional dermatologists give us a confident yes or no about serious health questions, based on just a brief visual inspection. If there is one area where we might seem justified in hoping for the equivalent of a tricorder, (representing the perpetual dream of a perfectly reliable, all-purpose, non-invasive handheld medical diagnostic tool), preliminary diagnosis of skin conditions would seem to be it.

On the other hand, much like many areas of medicine, the science behind dermatology seems imperfect at best. Diagnosing skin conditions is something that we can do very well after a biopsy (a small excision of skin for inspection under a microscope), but correctly diagnosing moles as cancerous or not is still extremely difficult without removing a small piece of skin from the body. One would hope that there would be good data on how accurate dermatologists are in their diagnoses, but unfortunately such data seems to be lacking. In fact, the analysis of Google’s system in part helps to shed some light on the reliability of dermatologists more broadly, which I’ll discuss more in my next post.

On the technological side, Google’s proposed system seems like a fairly straightforward application of machine learning. A user would take a few photos of some part of their skin and upload them, along with other personal health information. The app would then process this information through a classifier, and return a score for each of the conditions that it has been trained to recognize.²

According to the press release, this system would then provide expert information and web-based search results for the highest scoring conditions (those that the system thinks are most probable). As with other apps in this space, Google is clear that they do not consider this tool to be providing a diagnosis, but rather something more like a specialized search engine (though the boundary between these two concepts seems somewhat blurry).

Whether it is officially providing a diagnosis or not, it seems likely that the affordances of the system will ultimately determine how people come to think of it, and how it is used. While many of us are likely reluctant to get an expert opinion on every strange looking mark on our bodies—both because of the time and cost involved, and because of structural features that discourage us from asking too many such questions—the prospect that an app on our phone could take a look for us, and tell us whether or not we need to worry, might be too tempting to resist. Moreover, something that we can use in the privacy and comfort of our own homes has the potential to create a much more “user friendly” experience.

When it comes to medicine, however, that is clearly a double edged sword. There are obviously many people who would benefit from being more proactive in seeking medical advice, but over-diagnosis is also a major concern, including in dermatology. Nearly all medical interventions involve some cost (both in terms of cost to the system, and risk to the individual), and in other areas we are seeing active efforts to reduce the amount of testing that is done, especially in cases where the cost of unnecessary interventions outweighs the potential benefit of positively identifying more cases. Having a phone-based app literally at one’s fingertips, especially one calling itself “AI” and branded with the clout of a name like Google, is almost certain to change people’s behaviour with respect to their own health practices.

Moreover, even if the app were to have identical or better performance than general practitioners at identifying potentially dangerous conditions for referral to a dermatologist, this could still lead to a massive increase in the demand for services, as far more people will be attempting self-diagnoses. This in turn could increase the amount of excess testing and treatment, much as demand from patients has contributed to a massive over-prescription of antibiotics. For those who obsess over such things, there may even be a morbid addictiveness to such an app, which will allow users to retrieve endless examples of similar lesions on other people’s bodies.

There will no doubt be many cases of people who are prompted by such an app to seek professional care, and who are then eventually diagnosed and successfully treated. Such individuals will no doubt end up feeling like the app has saved their life, (indeed, the website for SkinVision provides several such testimonials), but the counterfactual will be impossible to evaluate, including the possibility that the disease would have been caught through more traditional channels. More importantly, it will be far less clear what costs will be borne by those who end up spending time, money, and attention on unnecessary or ineffective diagnosis or treatment.

In principle, the greatest potential benefit could be to those who have limited access to the health care system, whether due to location, mobility, wealth, or other reasons. In principle, a free phone-based app that can give an accurate preliminary diagnosis seems like a huge potential benefit. Again, however, for the most serious conditions, this may not matter much unless people are also able to access more rigorous diagnosis and professional treatment. In some cases, getting a recommendation to seek an expert opinion might only increase the anxiety experienced by someone who is unable to get one, and could even lead them to seek out unnecessary or ineffective treatments outside the mainstream medical system.

While the overall effects on people’s health will likely be complicated, the effects in terms of personal data seem fairly predictable. As with other technologies for self-monitoring, such as activity and sleep trackers, this kind of app seems likely to further normalize the idea that we can and should share our most personal information with the broader corporate-technological ecosystem. We already do this in numerous ways of course; people type questions into Google’s search engine that they would never ask another human being, and many of us have invited always-on recording devices to live in our homes. Those systems which do successfully convince us to begin actively sharing our personal data typically do so in a way that gives the impression of a closed world, where we somehow expect things to remain private, even as we know that our data is being aggregated and analyzed en masse in centralized systems.

Google has of course insisted that they will not sell people’s information from this app, or use it to target them with ads. The data-for-services model is familiar enough today that no one will be terribly surprised by Google’s plan to make it freely available to all users (unlike some of its competitors). More than ever, however, there is reason to think that there is potentially enormous value in that data could be gathered through such a system, independent of individual ad targeting.

It has long been known that more and better data has been the key to developing more accurate machine learning models; mostly, however, the focus has been on collecting labeled data, which in this case would mean expert diagnoses, or, ideally, biopsy results. Work over the past few years, however, has revealed that having massive amounts of relevant but unlabeled data can be incredibly valuable—as discussed in the recent Foundation Models paper—as such data can be used to pretrain massive models (such as GPT-3) which can then be adapted to many purposes.

Having millions of people upload images of skin lesions, even without knowing what conditions they truly represent, would allow the aggregator of that data to train a massive dermatological foundation model. Based on how well this approach works in other domains, such as text and images more broadly, we would expect that such a model could then be fine-tuned to particular applications (such as particular skin conditions) using only a small amount of labeled data, in a way that would be far more accurate than using only the labeled data. Arguably, whoever is the first to achieve widespread adoption in a particular market, such as dermatology, could attain a permanent advantage, as the data they collect in the process would then enable them to create a far more accurate system, which would then be even more widely adopted, providing yet more data for the next iteration.

I have no way of knowing if this is part of Google’s plans, but it would in many ways be a logical step for this domain, and represents a business model that we are likely to see deployed more and more in the future—getting users to do the work of generating relevant data for a particular domain, even if it is unlabeled, and then using it to bootstrap a system that will attract even more users.

Time will tell whether such a system will catch on, or if it will even be released. Although Google notes that it has been “CE marked as a Class I medical device in the EU”, a source quoted in WIRED explains that this is basically just a type of reporting and self-certification. There are clearly some other products pushing the personal data frontier which have (so far) been rejected by the public, such as glasses with embedded cameras. However, a phone-based app providing in-demand information, which people are able to use in private without having to reveal to others that they are doing so, may end up having far greater uptake.

What this means for the larger medical establishment is harder to predict. One can certainly imagine a world in which this work leads to a system which provides earlier and better diagnoses of life-threatening conditions, and overall reduces mortality and morbidity from skin conditions. On the other hand, a positive diagnosis is of little value without treatment, and numerous barriers remain in providing affordable access to health care for most of the world, a problem which is only partly related to an inadequate number of dermatologists. More generally, we can expect the medical establishment to try to ensure they maintain some control over revenue streams related to such conditions, but it is hard to imagine that they will be able to prevent Google or some competitor from garnishing some part of it in the future.

Over the long term, it seems highly likely that a machine learning system will eventually be considerably more reliable at preliminary diagnosis than general practitioners across a wide range of skin types and conditions, but that there will nevertheless be a relatively low ceiling on how much can be predicted based on purely non-invasive techniques. In addition, one can imagine that a similar sort of tool will eventually become integrated into professional diagnoses, as part of standard practice, despite the many difficulties with integrating AI systems into human decision making.

The irony, of course, is that the broader medical system already likely has all the data required to make a much more accurate system (including extensive patient data tied to actual outcomes), but is incapable of acting in a coordinated fashion, both because of the fragmentary nature of the existing system, and because of real concerns about privacy (and how we treat medical data more broadly). In the end, it seems possible that a specialized phone-based search tool that categorically does not provide a diagnosis is the thing that ultimately changes the status quo and leads to greater consolidation of medical data, though whether this leads to a net benefit or not remains to be seen.

Coming up next: Exploring the evidence

SkinVision claims to have 1.8 million users, but it is a paid service, and is not approved for use in the US or Canada. ↩︎
The press release suggests the final system will include 288 conditions; in the accompanying paper, they trained a system on 419 conditions, but only evaluated it on the most common 26. More details on this coming in Part 2. ↩︎