List of text mining methods

Different text mining methods are used based on their suitability for a data set. Text mining is the process of extracting data from unstructured text and finding patterns or relations. Below is a list of text mining methodologies.


 * Centroid-based Clustering: Unsupervised learning method. Clusters are determined based on data points.
 * Fast Global KMeans: Made to accelerate Global KMeans.
 * Global-K Means: Global K-means is an algorithm that begins with one cluster, and then divides in to multiple clusters based on the number required.
 * KMeans: An algorithm that requires two parameters 1. K (a number of clusters) 2. Set of data.
 * FW-KMeans: Used with vector space model. Uses the methodology of weight to decrease noise.
 * Two-Level-KMeans: Regular KMeans algorithm takes place first. Clusters are then selected for subdivision into subclasses if they do not reach the threshold. K-Medoids Clustering.gif
 * Cluster Algorithm
 * Hierarchical Clustering
 * Agglomerative Clustering: Bottom-up approach. Each cluster is small and then aggregates together to form larger clusters.
 * Divisive Clustering: Top-down approach. Large clusters are split in to smaller clusters.
 * Density-based Clustering: A structure is determined by the density of data points.
 * DBSCAN
 * DBSCAN-density-data.svgibution-based Clustering: Clusters are formed based on mathematical methods from data. Cyber word bubble.png
 * Expectation-maximization algorithm
 * Collocation
 * Stemming Algorithm
 * Truncating Methods: Removing the suffix or prefix of a word.
 * Lovins Stemmer: Removes longest suffix.
 * Porters Stemmer: Allows programmers to stem words based on their own criteria.
 * Statistical Methods: Statistical procedure is involved and typically results in affixes being removed.
 * N-Gram Stemmer: A set of 'n' characters that are consecutive taken from a word
 * Hidden Markov Model (HMM) Stemmer: Moves between states are based on probability functions.
 * Yet Another Suffix Stripper (YASS) Stemmer: Hierarchal approach in creating clusters. Clusters are then considered a set of elements in classes and their centroids are the stems.
 * Inflectional & Derivational Methods
 * Krovetz Stemmer: Changes words to word stems that are valid English words.
 * Xerox Stemmer: Removes prefixes.
 * Term Frequency
 * Term Frequency Inverse Document Frequency
 * Topic Modeling
 * Latent Semantic Analysis (LSA)
 * Latent Dirichlet Allocation (LDA)
 * Non-Negative Matrix Factorization (NMF)
 * Bidirectional Encoder Representations from Transformers (BERT)
 * Wordscores: First estimates scores on word types based on a reference text. Then applies wordscores to a text that is not a reference text to get a document score. Lastly, documents that are not referenced are rescaled to then compare to the reference text.
 * Wordscores: First estimates scores on word types based on a reference text. Then applies wordscores to a text that is not a reference text to get a document score. Lastly, documents that are not referenced are rescaled to then compare to the reference text.