User:Linaist/LDA

Research Report: Topic: Probabilistic Modeling

This week:

This week, I studied one of generative probabilistic models, Latent Dirichlet Allocation (LDA), which is a three-level hierarchical bayesian model, and its variations in recent year. I also studied LDA-based ad-hoc retrieval. Besides that, I briefly scanned introduction to graphical models and bayesian networks.

Generative probabilistic models are widely investigated and used in the field of text mining. Among those models, LDA and its variations, attract most interests of researchers. In general, LDA is a three-level hierarchical Bayesian model in which each item in a collection is modeled as a finite distribution over latent topics and each topic is modeled as an infinite distribution over topic probabilities. Although LDA is promising in terms of finding latent topics in text corpora, it suffers from several problems, such as, the bag-of-words assumption could make the model generate meaningless results, the directed graph model can be difficult to manipulate when adding several heterogeneous textual data., the posterior inference over hidden topics is intractable.

To solve the above three problems, researchers are exploring mathematical methods and other models. To make the results generated by LDA meaningful, the first attempt is to combine syntactic and semantic generative models. The model will use HMM to detect syntax and LDA to detect topics. Experiment results shed some light in eliminating functional words in the process of detecting topics. However, the performance is similar to LDA using stop list, with the latter method easier to conduct. Other researchers are exploiting probabilistic language models with respect to LDA. Wallach first suggests using bigrams instead of unigram in the third layer of LDA. Grifitths extends that work by adding a variable between bigrams, which is more natural to language processing. Wang extends Griffiths' work by improving the expression of that additive variable. Although Wang insists that their algorithm can be taken as using ngram, it's only valid under first order markov model. In addition to the debate of bigram and ngram, Rosenfeld reviews different statistical language modeling, and suggests that decision tree models are better than ngram models in terms of complexity while at the cost of worse performance results. Other exploration includes using undirected graph model instead of directed graph model in the context of heterogeneous textual data

I also studied ad-hoc retrieval for LDA-based models. The basic idea is using the query likelihood model. In the query likelihood model, each document is scored by the likelihood of its model generating a query Q. LDA-based retrieval extends that idea by adapts different probabilistic variables to the query likelihood model. Because LDA reduces the total amount of words significantly comparing with traditional IR models, the data is more sparse or say coarse. I read "A Study of Smoothing Methods for Language Models Applied to Information Retrieval" by Zhai, which reviews smoothing methods for probabilistic models like LDA. LDA-based retrieval results are generally poor. One possible reason has been stated above that LDA output data is sparse. They all using standard test data like TREC for evaluation. We know that LDA is promising for its capacity to find latent topics. In a large heterogeneous source of text corpora, we can imagine that LDA's output would be very sparse, to the contrast, in a special domain, I assume the output would be relatively denser, thus benefiting from LDA models.

Next Week:

Next week, my focus will be on the following three directions: 1. Further study N-gram LDA, try to understanding every mathematical transformations in the articles, lists their assumptions and simplification. Look for empty place in the model that have not been developed. 2. Further study LDA-based retrieval. Look for mathematical methods for improvements. 3. Consider adapting LDA to ChemXseer. In particular, consider unique attributes reside exclusively in Chemical papers, consider how can those attributes mixed with an undirected graph model or a directed graph model, try to figure out the performance.

Paper to Read:

Bingjun Sun, Prasenjit Mitra, C. Lee Giles, "Mining, Indexing, and Searching for Textual Chemical Molecule Information on the Web"