User:Singerep/Word2vec

Approach
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus — that is, they are semantically and syntactically similar — are located close to one another in the space. More dissimilar words are located farther from one another in the space. Word2vec can utilize either of two model architectures to produce a these distributed representations of words: continuous bag-of-words (CBOW) or continuous skip-gram. In both architectures, word2vec iterates over the entire corpus and considers each individual word and a context window of words surrounding the current word. In the continuous bag-of-words architecture, the model predicts the current word from the window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram does a better job for infrequent words.

Extensions
There are a variety of extensions that follow the framework established by word2vec. One, called doc2vec, generates fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. doc2vec has been implemented in the C, Python[9][10] and Java/Scala[11] tools (see below), with the Java and Python versions also supporting inference of document embeddings on new, unseen documents.

doc2vec estimates distributed representations of documents much like how word2vec estimates representations of words. doc2vec utilizes either of two model architectures, both of which are allegories to the architectures used in word2vec. The first, Distributed Memory Model of Paragraph Vectors (PV-DM), is identical to CBOW other than it also provides a unique document identifier as a piece of additional context. The second architecture, Distributed Bag of Words version of Paragraph Vector (PV-DBOW), is identical to the skip-gram model except that it attempts to predict the window of surrounding context words from the paragraph identifier instead of the current word.

Another extension of word2vec is top2vec, which leverages both document and word embeddings to estimate distributed representations of topics. top2vec takes document embeddings learned from a doc2vec model and reduces them into a lower dimension using UMAP. Clusters of similar documents are then found using HDBSCAN. Next, the centroid of documents identified in a cluster is considered to be that cluster’s topic vector. Finally, top2vec searches the semantic space for word embeddings located near to the topic vector — this allows users to discern what a topic actually represents.