User:JonathanSchoots/sandbox

Dimensions of word similarity and the preservation of semantic and syntactic relationship
The word embedding approach is able to capture multiple different degrees of similarity between words. Mikolov et al (2013a) found that semantic and syntactic patters can be reproduced using vector arithmetic. Patters such as “Man is to Woman as Brother is to Sister” can be generated through algebraic operations on the vector representations of these words such that the vector representation of “Brother” - ”Man” + ”Woman” produces a result which is closest to the vector representation of “Sister” in the model. Such relationships can be generated for a range of semantic relations (such as Country—Capital) as well as syntactic relations (e.g. present tense—past tense)

Accessing the quality of a model
Mikolov et al (2013b) develop an approach to accessing the quality of a word2vec model which draws on the semantic and syntactic patterns discussed above. They develop a set of 8869 semantic relations and 10675 syntactic relations which they use as a benchmark to test the accuracy of a model. When accessing the quality of a vector model, a user may draw on this accuracy test which is implemented in word2vec, or develop their own test set which is meaningful to the corpora which make up the model. This approach offers a more challenging test than simply arguing that the words most similar to a given test word are intuitively plausible.

Parameters and model quality
The use of different model parameters and different corpora sizes can greatly affect the quality of a word2vec model. Accuracy can be improved in a number of ways, including the choice of model architecture (CBOW or Skip-Gram), increasing the training data set, increasing the number of vector dimensions, and increasing the window size of words considered by the algorithm. Each of these improvements comes with the cost of increased computational complexity and therefore increased model generation time. In models using large corpora and a high number of dimensions, the skip-gram model yields the highest overall accuracy, and consistently produces the highest accuracy on semantic relationships, as well as yielding the highest syntactic accuracy in most cases[ref]. However, the CBOW is less computationally expensive and yields similar accuracy results. Accuracy increases overall as the number of words used increase, and as the number of dimensions increases. Mikolov et al report that doubling the amount of training data results in an equivalent increase in computational complexity as doubling the number of vector dimensions.