Data-driven astronomy

Data-driven astronomy (DDA) refers to the use of data science in astronomy. Several outputs of telescopic observations and sky surveys are taken into consideration and approaches related to data mining and big data management are used to analyze, filter, and normalize the data set that are further used for making Classifications, Predictions, and Anomaly detections by advanced Statistical approaches, digital image processing and machine learning. The output of these processes is used by astronomers and space scientists to study and identify patterns, anomalies, and movements in outer space and conclude theories and discoveries in the cosmos.

History
In 2007, the Galaxy Zoo project was launched for morphological classification of a large number of galaxies. In this project, 900,000 images were considered for classification that were taken from the Sloan Digital Sky Survey (SDSS) for the past 7 years. The task was to study each picture of a galaxy, classify it as elliptical or spiral, and determine whether it was spinning or not. The team of Astrophysicists led by Kevin Schawinski in Oxford University were in charge of this project and Kevin and his colleague Chris Linlott figured out that it would take a period of 3–5 years for such a team to complete the work. There they came up with the idea of using Machine Learning and Data Science techniques for analyzing the images and classifying them.

Methodology
The data retrieved from the sky surveys are first brought for data preprocessing. In this, redundancies are removed and filtrated. Further, feature extraction is performed on this filtered data set, which is further taken for processes. Some of the renowned sky surveys are listed below:


 * The Palomar Digital Sky Survey (DPOSS)
 * The Two-Micron All Sky Survey (2MASS)
 * Green Bank Telescope (GBT)
 * The Galaxy Evolution Explorer (GALEX)
 * The Sloan Digital Sky Survey (SDSS)
 * SkyMapper Southern Sky Survey (SMSS)
 * The Panoramic Survey Telescope and Rapid Response System (PanSTARRS)
 * The Large Synoptic Survey Telescope (LSST)
 * The Square Kilometer Array (SKA)

The size of data from the above-mentioned sky surveys ranges from 3 TB to almost 4.6 EB. Further, data mining tasks that are involved in the management and manipulation of the data involve methods like classification, regression, clustering, anomaly detection, and time-series analysis. Several approaches and applications for each of these methods are involved in the task accomplishments.

Classification
Classification is used for specific identifications and categorizations of astronomical data such as Spectral classification, Photometric classification, Morphological classification, and classification of solar activity. The approaches of classification techniques are listed below:


 * Artificial neural network (ANN)
 * Support vector machine (SVM)
 * Learning vector quantization (LVQ)
 * Decision tree
 * Random forest
 * k-nearest neighbors
 * Naïve Bayesian networks
 * Radial basis function network
 * Gaussian process
 * Decision table
 * Alternating decision tree (ADTree)

Regression
Regression is used to make predictions based on the retrieved data through statistical trends and statistical modeling. Different uses of this technique are used for fetching Photometric redshifts and measurements of physical parameters of stars. The approaches are listed below:


 * Artificial neural network (ANN)
 * Support vector regression (SVR)
 * Decision tree
 * Random forest
 * k-nearest neighbors regression
 * Kernel regression
 * Principal component regression (PCR)
 * Gaussian process
 * Least squared regression (LSR)
 * Partial least squares regression

Clustering
Clustering is classifying objects based on a similarity measure metric. It is used in Astronomy for Classification as well as Special/rare object detection. The approaches are listed below:


 * Principal component analysis (PCA)
 * DBSCAN
 * k-means clustering
 * OPTICS
 * Cobweb model
 * Self-organizing map (SOM)
 * Expectation Maximization
 * Hierarchical Clustering
 * AutoClass
 * Gaussian Mixture Modeling (GMM)

Anomaly detection
Anomaly detection is used for detecting irregularities in the dataset. However, this technique is used here to detect rare/special objects. The following approaches are used:


 * Principal Component Analysis (PCA)
 * k-means clustering
 * Expectation Maximization
 * Hierarchical clustering
 * One-class SVM

Time-series analysis
Time-Series analysis helps in analyzing trends and predicting outputs over time. It is used for trend prediction and novel detection (detection of unknown data). The approaches used here are:


 * Artificial neural network (ANN)
 * Support vector regression (SVR)
 * Decision tree