Journal of Investigative Dermatology, 2025 (SCI-Expanded)
In this study, we introduce MILK10k, a multimodal image dataset designed to enhance machine learning and artificial intelligence–driven applications for diagnosing skin lesions suspected to be malignant neoplasms. MILK10k includes both close-up and dermatoscopy images expanding upon existing datasets with several enhancements. It expands disease coverage to 48 International Skin Imaging Collaboration-Designated Diagnoses–matched diagnoses, encompassing both neoplastic and non-neoplastic as well as pigmented and nonpigmented skin lesions. In addition, MILK10k incorporates images from individuals with diverse skin tones as well as rare skin diseases. The dataset includes 10,480 images from 5240 cases, retrospectively collected across 5 different centers. Of these, 95.7% (n = 5016) have been biopsied or excised, with histopathology serving as the ground truth. Accompanying metadata includes information on age, sex, skin tone, anatomic site, and diagnosis with varying levels of granularity. In addition to the dataset, we provide results from a machine-learning pipeline that evaluates images on the basis of common, human-interpretable concepts such as pigmentation, ulceration, hair, and skin markings. Furthermore, we provide a benchmark test set of 948 multimodal images from the same sources and an online tool for assessing key metrics on this set, creating a platform for evaluating both machine-learning models and human reader studies.