User:Datakeeper/pageskeleton

List of datasets for machine learning research

This is a list of noteworthy datasets for machine learning research. This list is limited to noteworthy, high-quality datasets that have been used in peer reviewed publications such as academic journals. This list is not exhaustive.

Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce. This list aggregates high-quality datasets that have been shown to be of value to the machine learning research community from multiple different data repositories to provide greater coverage of the topic than is otherwise available.

Image data
Datasets consisting primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.

Text data
Datasets consisting primarily of text for tasks such as sentiment analysis, translation, and cluster analysis.

Sound data
Datasets of sounds and sound features.

Signal data
Datasets containing electric signal information requiring some sort of Signal processing for further analysis.

Multivariate data
Data sets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used.