User:OrenBochman/Stylometrics

Experiment 1: Known author verification tasks
Cannonizers Event Drivers Event-Culling Analysis Methods
 * 1) Punctuation Separator
 * 2) Unify case
 * 1) Stanford Part of speech N-Grams using the english-left3words-distim (faster but 1% less accurate)
 * 1) NONE
 * 1) Linear [SVM
 * 2) Gaussian SVM
 * 3) JW Cross Entropy
 * 4) WEKA J48 Decision Tree Classifier

Results
3 of 4 Analysis were correct for both of the known authors.

Conclusions
The results indicate that under the given parameters POS NGRAMS processed by the above 4 methods provide a sound basis for a Bayesian analyzer for accurately estimating the author of the given texts.

Further work is needed to identify the point of emergence of significant stylistic signatures based and the dependence of the

methods on corpus size author data and dimension of the data (2-gram v.s. 4-gram).

Additional Bibliography

 * Abbasi, A., & Chen, H. (2005). Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems, 20(5), 67-75.
 * Argamon, S. (2008). Interpreting Burrows’ Delta: Geometric and probabilistic foundations. Literary  and Linguistic Computing, 23(2), 131-147.
 * Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing.
 * Argamon, S., Saric, M., & Stein, S. (2003). Style mining of electronic messages for multiple authorship discrimination: First results. In Proceedings of the 9th ACM SIGKDD (pp. 475-480).
 * Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., & Levitan, S. (2007). Stylistic text classification using functional lexical features. Journal of the American Society for Information Science and Technology, 58(6), 802-822.
 * Argamon-Engelson, S., Koppel, M., & Avneri, G. (1998). Style-based text categorization: What newspaper am I reading?, In Proceedings of AAAI Workshop on Learning for Text Categorization (pp. 1-4).
 * Baayen, R., van Halteren, H., Neijt, A., & Tweedie, F. (2002). An experiment in authorship attribution. In Proceedings of JADT 2002: Sixth International Conference on Textual Data Statistical Analysis (pp. 29-37).
 * Baayen, R., van Halteren, H., & Tweedie, F. (1996). Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3), 121–131.
 * Benedetto, D., Caglioti, E., & Loreto, V. (2002). Language trees and zipping. Physical Review Letters, 88(4), 048702.
 * Binongo, J. (2003). Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance, 16(2), 9-17.
 * Brank, J., Grobelnik, M., Milic-Frayling, N., & Mladenic, D. (2002). Interaction of feature selection methods and linear classification models. In Proceedings of the ICML-02 Workshop on Text Learning.
 * Burrows, J.F. (1987). Word patterns and story shapes: The statistical analysis of narrative style. Literary and Linguistic Computing, 2, 61-70.
 * Burrows, J.F. (1992). Not unless you ask nicely: The interpretative nexus between analysis and information. Literary and Linguistic Computing, 7(2), 91–109.
 * Burrows, J.F. (2002). ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267-287.
 * Can, F., & Patton, J.M. (2004). Change of writing style with time. Computers and the Humanities, 38, 61-82.
 * Chaski, C.E. (2001). Empirical evaluations of language-based author identification techniques. Forensic Linguistics, 8(1), 1-65.
 * Chaski, C.E. (2005). Who’s at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1).
 * Cilibrasi R., & Vitanyi P.M.B. (2005). Clustering by compression. IEEE Transactions on Information Theory, 51(4), 1523-1545.
 * Clement, R., & Sharp, D. (2003). Ngram and Bayesian classification of documents for topic and authorship. Literary and Linguistic Computing, 18(4), 423-447.
 * Collins, J., Kaufer, D., Vlachos, P., Butler, B., & Ishizaki, S. (2004). Detecting collaborations in text: Comparing the authors’ rhetorical language choices in the Federalist Papers. Computers and the Humanities, 38, 15-36.
 * Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., & Rosso, P. (2006). Authorship attribution using word sequences. In Proceedings of the 11th Iberoamerican Congress on Pattern Recognition (pp. 844-853) Springer.
 * Deerwester, S., Dumais, S., Furnas, G.W., Landauer, T. K., & Harshman R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391-407.
 * Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109-123.
 * Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
 * Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289-1305.
 * Forsyth, R., & Holmes, D. (1996). Feature-finding for text classification. Literary and Linguistic Computing, 11(4), 163-174.
 * Frantzeskou, G., Stamatatos, E., Gritzalis, S., & Katsikas, S. (2006). Effective identification of source code authors using byte-level information. In Proceedings of the 28th International Conference on Software Engineering (pp. 893-896).
 * Gamon, M. (2004). Linguistic correlates of style: Authorship classification with deep linguistic analysis features. In Proceedings of the 20th International Conference on Computational Linguistics (pp. 611-617).
 * Goodman, J. (2002). Extended comment on language trees and zipping. http://arxiv.org/abs/condmat/0202383.
 * Graham, N., Hirst, G., & Marthi, B. (2005). Segmenting documents by stylistic character. Journal of Natural Language Engineering, 11(4), 397-415.
 * Grant, T. D. (2007). Quantifying evidence for forensic authorship analysis. International Journal of Speech Language and the Law, 14(1), 1 -25.
 * Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 251-270.
 * Halliday, M.A.K. (1994). Introduction to functional grammar (2nd ed.). London: Arnold.
 * van Halteren, H. (2007). Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing, 4(1), 1-17.
 * Holmes, D.I. (1994). Authorship attribution. Computers and the Humanities, 28, 87–106.
 * Holmes, D.I. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing, 13(3), 111-117.
 * Holmes, D.I., & Forsyth, R. (1995). The Federelist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10(2), 111-127.
 * Holmes, D.I., & Tweedie, F. J. (1995). Forensic stylometry: A review of the cusum controversy. In Revue Informatique et Statistique dans les Sciences Humaines. University of Liege (pp. 19-47).
 * Honore, A. (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 7(2), 172–177.
 * Hoover, D. (2004a). Testing Burrows’ Delta. Literary and Linguistic Computing, 19(4), 453-475.
 * Hoover, D. (2004b). Delta prime? Literary and Linguistic Computing, 19(4), 477-495.
 * Houvardas, J., & Stamatatos E. (2006). N-gram feature selection for authorship identification. In Proceedings of the 12th International Conference on Artificial Intelligence: Methodology, Systems, Applications, (pp. 77-86), Springer.
 * Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (pp. 137-142).
 * Juola, P. (2004). Ad-hoc authorship attribution competition. In Proceedings of the Joint Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing (pp. 175-176).
 * Juola, P. (2006). Authorship attribution for electronic documents. In M. Olivier and S. Shenoi (eds.) Advances in Digital Forensics II (pp. 119-130) Springer.
 * Juola, P. (2007). Future trends in authorship attribution. In P. Craiger & S. Shenoi (eds.) Advances in Digital Forensics III (pp. 119-132) Springer.
 * Juola, P., & Baayen, R. (2005). A controlled-corpus experiment in authorship attribution by crossentropy. Literary and Linguistic Computing, 20, 59-67.
 * Karlgren, J., & Eriksson G. (2007). Authors, genre, and linguistic convention. In Proceedings of the SIGIR Workshop on Plagiarism Analysis, Authorship Attribution, and Near-Duplicate Detection (pp. 23-28).
 * Keselj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Association for Computational Linguistics (pp. 255-264).
 * Khmelev, D.V., & Teahan, W.J. (2003a). A repetition based measure for verification of text collections and for text categorization. In Proceedings of the 26th ACM SIGIR, (pp. 104–110).
 * Khmelev, D.V., & Teahan, W. J. (2003b). Comment: “Language trees and zipping”. Physical Review Letters, 90, 089803.
 * Khosmood, F., & Levinson, R. (2006). Toward unification of source attribution processes and techniques. In Proceedings of the Fifth International Conference on Machine Learning and Cybernetics (pp. 4551-4556).
 * Kjell, B. (1994). Discrimination of authorship using visualization. Information Processing and Management, 30(1), 141-150.
 * Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273-324.
 * Koppel, M., Akiva, N., & Dagan, I. (2006). Feature instability as a criterion for selecting potential style markers. Journal of the American Society for Information Science and Technology, 57(11),1519–1525.
 * Koppel, M., Argamon, S., & Shimoni, A.R. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4), pp. 401-412.
 * Koppel, M., & Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis (pp. 69-72).
 * Koppel, M., & Schler, J. (2004). Authorship verification as a one-class classification problem. In Proceedings of the 21st International Conference on Machine Learning.
 * Koppel, M., Schler, J., Argamon, S., & Messeri, E. (2006). Authorship attribution with thousands of candidate authors. In Proceedings of the 29th ACM SIGIR (pp. 659-660).
 * Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research, 8, 1261-1276.
 * Kukushkina, O.V., Polikarpov, A.A., & Khmelev, D.V. (2001). Using literal and grammatical statistics for authorship attribution. Problems of Information Transmission, 37(2), 172-184.
 * Li, J., Zheng, R., & Chen, H. (2006). From fingerprint to writeprint. Communications of the ACM, 49(4), 76–82.
 * Luyckx, K., & Daelemans, W. (2005). Shallow text analysis and machine learning for authorship attribution. In Proceedings of the Fifteenth Meeting of Computational Linguistics in the Netherlands.
 * Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., & Ye, L. (2005). Author identification on the large scale. In Proceedings of CSNA-05.
 * Marton, Y., Wu, N., & Hellerstein, L. (2005). On compression-based text classification. In Proceedings of the European Conference on Information Retrieval (pp. 300–314) Springer.
 * Matthews, R., & Merriam, T. (1993), Neural computation in stylometry: An application to the works of Shakespeare and Fletcher. Literary and Linguistic Computing, 8(4), 203-209.
 * Matsuura, T., & Kanada, Y. (2000). Extraction of authors’ characteristics from Japanese modern sentences via n-gram distribution. In Proceedings of the 3rd International Conference on Discovery Science (pp. 315-319) Springer.
 * McCarthy, P.M., Lewis, G.A., Dufty, D.F., & McNamara, D.S. (2006) Analyzing writing styles with coh-metrix. In Proceedings of the Florida Artificial Intelligence Research Society International Conference (pp. 764-769).
 * Mendenhall, T. C. (1887). The characteristic curves of composition. Science, IX, 237–49.
 * Merriam, T. & Matthews, R. (1994), Neural compuation in stylometry II: An application to the works of Shakespeare and Marlowe. Literary and Linguistic Computing, 9(1), 1-6.
 * Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. Advances in Data Analysis (pp. 359-366) Springer.
 * Mikros, G, & Argyri, E. (2007). Investigating topic influence in authorship attribution. In Proceedings of the International Workshop on Plagiarism Analysis, Authorship Identification, and NearDuplicate Detection (pp. 29-35).
 * Mitchell, T. (1997). Machine Learning. McGraw-Hill.
 * Morton, A.Q., & Michaelson, S. (1990). The qsum plot. Technical Report CSR-3-90, University of Edinburgh.
 * Mosteller, F. & Wallace, D.L. (1964). Inference and disputed authorship: The Federalist. AddisonWesley.
 * Peng, F., Shuurmans, D., Keselj, V., & Wang, S. (2003). Language independent authorship attribution using character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (pp. 267-274).
 * Peng, F., Shuurmans, D., & Wang, S. (2004). Augmenting naive Bayes classifiers with statistical language models. Information Retrieval Journal, 7(1), 317-345.
 * Rudman, J. (1998). The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 31, 351-365.
 * Sanderson, C., & Guenter, S. (2006). Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the International Conference on Empirical Methods in Natural Language Engineering (pp. 482-491).
 * Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1).
 * Stamatatos, E. (2006a). Authorship attribution based on feature set subspacing ensembles. International Journal on Artificial Intelligence Tools, 15(5), 823-838.
 * Stamatatos, E. (2006b). Ensemble-based author identification using character n-grams. In Proceedings of the 3rd International Workshop on Text-based Information Retrieval (pp. 41-46).
 * Stamatatos, E. (2007). Author identification using imbalanced and limited training texts. In Proceedings of the 4th International Workshop on Text-based Information Retrieval (pp. 237-241).
 * Stamatatos, E. (2008). Author identification: Using text sampling to handle the class imbalance problem. Information Processing and Management, 44(2), 790-799.
 * Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4), 471–495, 2000.
 * Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35(2), 193-214.
 * Stein, B., & Meyer zu Eissen, S. (2007). Intrinsic plagiarism analysis with meta learning. In Proceedings of the SIGIR Workshop on Plagiarism Analysis, Authorship Attribution, and NearDuplicate Detection (pp.45-50).
 * Tambouratzis, G., Markantonatou, S., Hairetakis, N., Vassiliou, M., Carayannis, G., & Tambouratzis, D. (2004). Discriminating the registers and styles in the Modern Greek language – Part 2: Extending the feature vector to optimize author discrimination. Literary and Linguistic Computing, 19(2), 221-242.
 * Teahan, W., & Harper, D. (2003). Using compression-based language models for text categorization. In W.B. Croft & J. Lafferty (eds) Language Modeling and Information Retrieval, 141–165.
 * Teng, G., Lai, M., Ma, J., & Li, Y. (2004). E-mail authorship mining based on SVM for computer forensic. In Proceedings of the International Conference on Machine Learning and Cybernetics, 2 (pp. 1204-1207).
 * Tweedie, F., & Baayen, R. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323–352.
 * Tweedie, F., Singh, S., & Holmes, D. (1996). Neural network applications in stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 1-10.
 * Uzuner, O., & Katz, B. (2005). A comparative study of language models for book and author recognition. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (pp. 969-980) Springer.
 * de Vel, O., Anderson, A., Corney, M., & Mohay, G. (2001). Mining e-mail content for author identification forensics. SIGMOD Record, 30(4), 55-64.
 * Yule, G.U. (1938). On sentence-length as a statistical characteristic of style in prose, with application to two cases of disputed authorship. Biometrika, 30, 363-390.
 * Yule, G.U. (1944). The statistical study of literary vocabulary. Cambridge University Press.
 * Zhang, D., & Lee, W.S. (2006). Extracting key-substring-group features for text classification. In Proceedings of the 12th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 474-483).
 * Zhao Y., & Zobel, J. (2005). Effective and scalable authorship attribution using function words. In Proceedings of the 2nd Asia Information Retrieval Symposium.
 * Zhao Y., & Zobel, J. (2007). Searching with style: Authorship attribution in classic literature. In Proceedings of the Thirtieth Australasian Computer Science Conference (pp. 59-68) ACM Press.
 * Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing style features and classification techniques. Journal of the American Society of Information Science and Technology, 57(3), 378-393.
 * Zipf, G.K. (1932). Selected studies of the principle of relative frequency in language. Harvard University Press, Cambridge, MA.