User talk:ABR14422/sandbox

1	Introduction Academic publications can be considered as one of the products in the “knowledge market”. Researchers in the knowledge market are consumers of knowledge when they read papers, and become producers of knowledge when they publish papers. After a paper is published, it is publicly viewed by researchers, and its citations accumulate when researchers cite it in the subsequent published papers. Corresponding to total sales of products in other markets, a paper’s cumulative citation counts can be viewed as the total sales of the paper in the knowledge market. However, compared with consumption in other markets, there is a dearth of research in the consumption in the knowledge market, even though an increasing share of economic growth comes from the technological improvements and breakthroughs that are often reported in academic research. Price (1965) and Redner (1998) found that the numbers of citations received by academic papers were not evenly distributed, and paper citation distribution followed a power law distribution. Additionally, as this paper will portray, the 10-year citation distribution of papers in top economics journals is also highly right-skewed, and the upper tail of the distribution is well approximated by a power law. However, this paper studies a deeper question: What are the drivers of paper citations? Previous studies have found that the ultimate impact of a paper is associated with its early citation history (Wang et al., 2013), and citations received per year change over the “life cycle” of academic papers (Hamermesh, 2015; Ke et al., 2015; Anauati et al., 2016). In addition, empirical evidence suggests that citation counts received by academic papers are associated with research field (Griffith et al., 2009; Card and DellaVigna, 2013; Linnemer and Visser, 2016; Angrist et al., 2017), the novelty and conventionality of paper topics (Uzzi et al., 2013), “poetic” title (Mialon, 2010), position in a journal (Coup´e et al., 2010), and profile of authors (Smart and Waldfogel, 1996; Mialon, 2010; Card and DellaVigna, 2017). However, previous studies only use low dimensional features of papers and authors to explain the discrepancy in paper citations. This paper studies whether higher dimensional features of papers and authors can help expand our understanding of the drivers of paper citations. In addition to the studies on paper citations, Card and DellaVigna (2013) and Angrist et al. (2017) determined that the fields and style of economic papers evolve over time, and Ellison (2002) proposed a model of paper quality that explained the change of presentation style of papers. Since paper citations reflect the response of academicians to published papers, investigating the drivers of citations may help understand the evolution of economic research. Paper citations are not only useful for measuring scientific impact but also potentially useful in various decision-making processes in academia, especially when available resources are constrained. For instance, it could be used to help editors to pre-screen or select publishable papers from large numbers of submissions to assist the reviewing committees in locating scarce resources among competing projects in the research funding applications and to help universities and colleges to make tenure and promotion decisions. Previous studies have developed citation indices to compare the productivity of researchers (Hirsch, 2005; Ellison, 2013; Perry and Reny, 2016). Empirical studies have found that paper citations (or author citations) are associated with decision making in academia, including referee recommendations and editor decisions (Card and DellaVigna, 2017), National Science Foundation(NSF) reviewed scores (Li and Agha, 2015), promotions at universities (Tuckman and Leahey, 1975; Hansen et al., 1978; Hilmer et al., 2015), and elections of fellows of the econometric society (Hamermesh and Schmidt, 2003; Hamermesh and Pfann, 2012). Hamermesh (2015) provides a survey on the use of citations in economics. The potential usefulness of paper citations raises the following questions: 1. Is it possible to predict citations of individual papers with the information available as of the year of publication? And 2. Can we use these predictions to help improve academic decision makings, such as editorial decision and promotion? This paper investigates the drivers of paper citations and uses machine learning methods to predict paper citations with the information available as of the year of publication. The entire research matches yearly citation data, full texts, and yearly author data of 4,482 papers in the top 5 economics journals – The American Economic Review (AER), Econometrica (ECMA), Journal of Political Economy (JPE), The Quarterly Journal of Economics (QJE), and the Review of Economic Studies (RES) during 1990-2011 based on bibliographic information of papers. The following facts can be seen from the data: 1. The 10-year citation distribution is highly right-skewed, and the upper tail of the distribution is well approximated by a power law. 2. The slopes of paper citation paths are quite different with each other. 3. Highly cited papers differ widely in research fields, topic words, and author information. Given the noticeable differences in paper citation paths and diversities in features of highly cited papers, higher dimensional measurement of features of papers and authors may be useful for explaining the variation of paper citation paths. To measure features of papers and authors, I use dictionary-based textual analysis to parse unstructured paper data and author data. For each paper, I parse its full text to construct high dimensional vectors that measure its research fields, topic words, presentation style, and journal information. For authors of each paper, I parse their publication lists to construct high dimensional vectors of variables measuring their publication records, cumulative citations, and collaboration networks. The dictionary-based textual analysis used in this paper is consistent with other economic studies that use dictionary-based textual analysis, including the measurement of investor sentiment (Tetlock, 2007), media slant (Gentzkow and Shapiro, 2010), tone in financial text (Loughran and McDonald, 2011), and economic policy uncertainty (Baker et al., 2016). Gentzkow et al. (2017) provide a survey of economic research using text as data. The results of the textual analysis show the following facts: 1. Papers in ECMA and RES on average have higher “Mathematical and Quantitative Methods” and “Microeconomics” intensity and cover more topics in “Mathematical and Quantitative Methods” and “Microeconomics” than the other three journals. 2. Papers in QJE have the highest average coverage of popular topics, while papers in ECMA have the highest average paper complexity. 3. Average author cumulative citations of papers in AER and QJE are relatively higher than the papers in the other three journals as of the year of publication. 4. Authors of papers in QJE have the strongest average collaboration network as of the year of publication. 5. A bigger proportion of authors in ECMA and RES are from institutions in the lower quantiles of rankings of economic research. To investigate paper effect and author effect on paper’s long-term scientific impact, I estimated the coefficients of the measures of paper and author information as of the year of publication on the 10-year citations. The results show that “QJE effect” exists for paper 10- year citations, meaning that papers in QJE get higher citations even after controlling various paper and author information, though the QJE effect decreases after more control variables are added. One potential explanation for the QJE effect might be that QJE performed better in advertising publications. It is also possible that editors of QJE preferred papers that would be highly cited, while editors of the other journals did not have strong preferences for highly cited papers. Another cause of the QJE effect could be the differences in the pools of submitted manuscripts. However, the QJE effect becomes much less important in prediction models with many variables. For the prediction model with the smallest out- of-sample Mean Squared Error, adding journal ID variable only marginally improves the prediction performance. The estimation results confirm the importance of paper research field in determining paper citations, and papers with higher 10-year citation counts are associated with higher “Macro, Monetary Econ” intensity and lower “Micro” intensity. The results also show that higher 10-year citation counts are associated with higher popular topic coverage and lower paper complexity. In addition, papers with higher 10-year citation counts are associated with the appearance of some topic words and word pairs (e.g., GDP, correl, bank, school, (‘capital’, ‘share’), (‘product’, ‘develop’, ‘growth’). Within the variables measuring author information, the number of authors, number of authors’ co-authored publications, and total citations and numbers of the top 5 publications of authors’ co-authors are positively correlated with the 10-year citation counts, while numbers of authors’ top field publications and numbers of top field publications of authors’ co-authors have negative coefficients. However, the coefficients of author experience and numbers of authors’ publications are not significant in any of these regressions. To investigate the drivers of paper citation paths, I used a quadratic function to estimate the effects of variables measuring paper information and time-varying author information on papers’ cumulative citations. The results showed that a steeper slope of paper citation path is associated with higher “Math, Quant Methods”, “Econ Development, Growth”, “Econ Systems” intensity, popular topic coverage, number of pages, number of authors, and author cumulative citations. The analysis of paper citation paths reveals substantial heterogeneity among journals. Notably, papers in QJE have extremely large positive coefficients of “Econ Development, Growth” and “Econ Systems” intensity, papers in AER and QJE are negatively affected the most by higher paper complexity, and papers in JPE are benefited the most from bigger teams of highly cited authors. In addition, the heterogeneity among journals also exists in the adjusted R-squared. The regression for JPE gives the largest adjusted R-squared of 0.54, while the regression for RES gives the smallest adjusted R-squared of 0.16. However, the coefficients seem to have no clear trend across author groups. The estimation results could help deepen our understanding of the drivers of paper citations, while the low adjusted R-squared (less than 0.6 in all of these regressions) shows that simple regression models might be inadequate to model the variation of paper citation counts. To better model the variation of the citation counts and predict out-of-sample, I turn to use some state-of-the-art machine learning methods that can use more covariates and higher-order interactions. The machine learning methods face two challenges: 1. Constructing a map of the measured features of papers and authors to paper citations. 2. Assigning a weight to each variable. Recent studies investigate the use of machine learning (including Ordinary Least Squares) in predicting human decision and improving the performance of decision-making, including predicting at-risk youth (Chandler et al., 2011), hiring and promoting workers (Hoffman et al., 2015; Chalfin et al., 2016), and improving judge decisions (Kleinberg et al., 2017). Einav and Levin (2014), Varian (2014), and Mullainathan and Spiess (2017) provide surveys  on the use of big data and machine learning in economic research. Compared to the other studies which use machine learning methods to predict human decision, predicting paper citations is challenging due to the difficulty of measuring and assigning appropriate weights to a large number of features of papers and authors. I compare a variety of state-of-the-art machine learning methods, including regression shrinkage models (Lasso, Post-Lasso, Ridge, and Elastic Net) in Zou and Hastie (2005) and Belloni et al. (2012), Neural Network (Bishop, 1995; Abadi et al., 2016), Random Forest (Breiman, 2001), and Gradient Boosted Trees (Friedman, 2001, 2002) on their ability to predict papers’ 10-year citation deciles using the information available as of the year of publication. Based on my evaluation of state-of-the-art machine learning methods, I develop a hybrid method that combines variable construction of dictionary-based textual analysis, variable selection of regression shrinkage, and model fitting of gradient boosted trees for prediction with textual data. The Mean Squared Error (MSE) of various prediction methods, except for Ordinary Least Squares (OLS), generally decrease after adding more predictors constructed by textual analysis. The Shrinkage-Gradient Boosted Trees Hybrid method proposed in this paper gives the smallest MSE in 10-year citation out-of-sample prediction test, while only using a relatively small number of predictors compared to other machine learning methods. This property of the hybrid method significantly reduces the cost of data collection and computation for using it to predict 10-year citation counts of a new paper. However, it seems hard for the prediction models to predict the citation counts of the most highly cited papers. Even the prediction model with the smallest Mean Squared Error (MSE) cannot predict the citation counts of the highest decile and the lowest decile well. In a test of applying these machine learning methods to academic publishing process, the hybrid method predicts papers that are in the upper half of the citation distribution correctly at 72.7% of the time and predicts the papers that are in the lower half of the citation distribution correctly at 76.7% of the time. In addition, within the papers being predicted by the hybrid method to be “highly cited” (the top 30% of the distribution), 65.0% of them turn out to be “highly cited”, and only 4.7% of them turn out to be “lowly cited” (the bottom 30% of the distribution). Within the papers being predicted to be “lowly cited”, 66.7% of them turn out to be “lowly cited”, and only 2.6% turn out to be “highly cited” after 10 years of publication. Based on the prediction results, the hybrid  method  proposed  in  this  paper  may be helpful in identifying articles that will turn out to be lowly cited to enable editors to reject a significant fraction of inappropriate submissions, thereby enabling editors to focus their scarce time on evaluating the more promising subset of submissions to their journals. In addition, its performance in identifying highly cited papers may be helpful in preventing rejection of submissions that will turn out to be highly cited. Though the hybrid method cannot replace human referees and editors, it has potential to be used as a first stage screening tool to more efficiently direct the scarce time of editors and referees. One concern about the hybrid prediction model might be that it discriminates some types of authors or papers by assigning very large negative weights to few features of authors or papers. However, since the preferred hybrid prediction model predicts paper citations using hundreds of features of authors and papers, and there is not any feature with dominant weight, it is not likely to severely discriminate against some specific types of authors or papers. On the contrary, using the hybrid prediction model to help editors in editorial decision-making may attenuate possible discrimination in academic publishing process, because it captures and “objectively” assigns weights to hundreds of features of papers and authors that human referees may ignore. This paper contributes to existing literature in the following aspects. Firstly, this paper contributes to the research on the scientific impact of academic papers by investigating the factors that explain the variation of paper citations, as well as the factors that predict paper citations. The findings in this paper may not only deepen our understanding of the drivers of paper citations, but also contribute to the work on designing data-driven models to predict scientifically impactful work. Admittedly, the number of citations is not a perfect measure of a paper’s scientific impact. However, it is among the few quantitative indicators of a paper’s impact. Secondly, the estimation and prediction strategy developed in this paper has potential to be used to investigate the drivers of decision-making and predict decision-making in other markets, where information in textual data are likely to be important drivers of decision-making. Thirdly, this paper makes extensive use of unstructured data collection techniques, textual analysis and machine learning methods that may shed new insights into the application of these techniques in other economic studies that use large-scale high dimensional data. In Section 2, data collection, textual analysis and descriptive statistics are presented. Section 3 discusses estimation and prediction strategy. Section 4 presents estimation results. Section 5 presents prediction results. Section 6 concludes. 2	Data Collection and Textual Analysis 2.1	Data collection I collected and matched yearly citation data, full texts, and yearly author data of 4,482 papers in the top 5 economics journals. The main focus was on analyzing the data of papers in the top 5 economics journals for three reasons. First, the top 5 economics journals are top general interest economics journals, and the papers in these journals arguably represent the research fields and topics of economic research. Second, downloading a large number of full texts of papers from a broader range of journals from digital journal libraries may violate their terms and conditions. Third, it would take a considerable amount of time to collect and analyze a large amount of unstructured paper data and author data using the computing facilities available at the time of data collection and textual analysis. The data collection was conducted between January and August 2017. The details of data collection are documented in Appendix A. 2.1.1	Academic data I collected academic data including paper citation lists, paper information (including the journal of publication, publication date, title, abstract, and author name list), and author publication lists from Microsoft Academic (MA) database. An overview of Microsoft Academic database is provided by Sinha et al. (2015). I used MA database as the main source of the academic data for this paper because of the abundance of its academic data and the efficiency of using MA Application Programming Interface (API) to query and collect data from it. Due to the noisy nature of largescale academic data, the raw academic data collected from the MA API might have measurement error in paper citations and author publications. To reduce measurement  error in the academic data, I used preprocessing algorithms to check whether the retrieved author names matched with the names on the paper, subtract duplicated citations, and removed mistakenly listed publications. The preprocessing algorithms are described in Appendix A.1.1, and Appendix B.2. Google Scholar (GS) is another online database of paper citation data. However, the built-in barriers requested by the publishers to foil automatic queries and the other constraints in data retrieval increases the difficulty of collecting a large amount of paper and author information from its database. A comparison of the MA database and the GS database is presented in Appendix A.1.2. 2.1.2	Paper full texts The full texts of papers in the top 5 economics journals during 1990-2011 were collected from digital journal libraries (ScienceDirect and JSTOR). For the top 5 economics journals, I excluded papers in The American Economic Review: Papers and Proceedings, as well as papers published as comments, replies, papers that have less than 11 pages, papers that could not be recognized by the optical character recognition algorithm, and papers that did not return a correct paper ID from MA database when I attempted to look up its academic data. After these exclusions, my dataset contained 4,482 papers, with 1,299 papers in The American Economic Review, 941 papers in Econometrica, 754 papers in Journal of Political Economy, 767 papers in The Quarterly Journal of Economics, and 721 papers published in The Review of Economic Studies. 2.1.3	Other data Apart from the data sources above, I also collected the following data in public domain: keywords lists under the JEL classification codes on the AEA website, a list of selected adjective words, a list of advanced words, location information of academic institutions from Open Street Map, and their economic research score from The Tilburg University Economics Schools Research Ranking. These data were used in the paper and author information measurement. 2.2	Paper data analysis For each paper, I counted its yearly citations, constructed categorical variables based on its journal information, and measured its research fields, topic words, and presentation style using paper text in the first 10 sentences, the first 100 sentences, the first 200 sentences, and the full text. 2.2.1	Paper citations Paper citation paths were constructed based on the data from MA database. I searched for each paper’s paper ID in the MA database using the paper title. Then, I compared the author names in the matching entry in the MA database with the true author names to check whether the returned paper was correct or not. If the returned paper had matched title and authors, I used the publication year of each paper in paper’s citation list to count the number of citations after t years of publication and constructed Ci, t: the cumulative citations of paper i after t years of publication. Figure 1 shows the average 10-year citations of papers by publication year in the top 5 economics journals. The average 10-year citations of papers in all of these journals increase over time, and the curve of QJE locates relatively higher than the other four journals. Figure 10 shows the average cumulative citations of papers in the top 5 economics journals at the end of 2016. The trend shown in Figure 10 is consistent with the citation trend of a broader range of papers in the top 5 economics journals presented by Card and DellaVigna (2013) and Linnemer and Visser (2016). As shown in Figure 10, the papers published in recent years generally have fewer cumulative citations partially due to the short publication years. Figure 2 shows the empirical cumulative distribution function of the 10-year citations of papers with Ci,10 ≤ 1000. It can be seen that almost all of the papers published in the top 5 economics journals have less than 1,000 citations, and about 90% of them have less than 300 citations after 10 years of publication. The empirical cumulative distribution function of QJE is located to the right of the curves of the other journals, showing that the distribution of the 10-year citations of papers published in QJE is more spread. Table 14 presents the citation statistics of the top 5 economics journals. On average, the papers in QJE have the highest average citations, while the standard deviations of 10-year citations in these journals are all quite large. To model the upper tail of the distribution of 10-year citations, I use the Pareto (power law) distribution. The survival function of Pareto distribution is shown in Equation 1, and the probability density function of Pareto distribution is shown in Equation 2.

Figure 3 shows the complementary cumulative distribution function (CCDF) of paper 10- year citations with logarithmic horizontal and vertical axes. It can be seen that the CCDF of paper citations in all journals, except for ECMA, is well approximated by a straight line, indicating a power law distribution. Figure 3: Complementary cumulative distribution function of paper 10-year citations Table 1 presents the result of Pareto distribution estimation. The P-value in each column is generated via 1,000 times of bootstrapping by algorithm in Clauset et al. (2009), and it quantifies the plausibility of the null hypothesis that the data is generated from a power law distribution. According to the hypothesis test, the null hypothesis that the data is generated from a power law distribution is accepted for all journals except for ECMA. To capture the pattern near the bulk of the distribution and the pattern of the upper tail separately, I partition papers into two groups based on their 10-year citations and set the cutoff point as 300 (the cutoff between the roughly top 10% and the other papers). For the papers with less than 300 citations (Ci,10 < 300), I use Gaussian kernel density estimation to fit the distribution. For the papers with at least three hundred citations (Ci,10 >= 300), I use Pareto distribution estimation. Figure 4 shows the kernel density estimation of 10-year citation distribution of papers with Ci,10 < 300. The citation distribution is highly right-skewed, meaning that while most of the published papers have relatively low citations, a small portion of them are very highly cited, producing the long tail to the right of the distribution. Table 2 presents the results of the Pareto distribution estimation for papers with Ci,10 >=300. The QJE has the smallest α among the top 5 economics journals, meaning that the 10- year citation distribution of the most highly cited papers in QJE is the most spread among the top 5 economics journals. Figure 4: Kernel density estimation applied to papers with Ci,10 < 300

The papers in the top 5 economics journals are not only different in their 10-year citations, but also quite different in their citation paths. Figure 5 shows the citation paths of papers by 10-year citation quantiles. It can be seen that the slopes of paper citation paths are quite different from each other. The slope of citation path of papers in the top 25% is increasing over time, while the citation path of papers in the bottom 25% is almost a straight line with a lower slope. The discrepancy in paper citation paths raises the following questions: Is the discrepancy in paper citation paths caused by the differences in paper contents and author profiles? What are the common features of highly cited papers? Figure 5: Citation paths of papers in the top 5 economics journals 1990-2011

Table 15 lists the highly cited papers in the top 5 economics journals. The rank of papers is based on the citation counts in MA database. Surprisingly, the top 5 highly cited papers cover different topics in different research fields, spreading from applied economics papers (e.g., the highly cited QJE paper by Paolo Mauro) to econometric theory papers (e.g., the highly cited RES paper by Charles Manski). In addition, the profile and career stage of the authors of these papers as of the year of publication is also quite different. For instance, the highly cited JPE paper by Paul Krugman was published 14 years after he received his PhD degree, while the highly cited ECMA paper by Marc Melitz was his first publication. Thus, the determinants of paper citations may be far beyond simple measures of the research field and author profile of papers. In the remaining parts of this section, I describe the process of measuring high dimensional features of papers and authors using dictionary-based textual analysis. The dictionary-based textual analysis algorithms rely on the prior information supplied by the econometrician to create dictionaries that are used to parse textual data. If the prior information of econometrician is reliable, the dictionary-based textual analysis will be able to capture potentially important features that might be hard for unsupervised learning methods to capture. In addition, compared to unsupervised learning methods, the variables constructed by dictionary-based textual analysis are generally easier to interpret. 2.2.2	Research fields of papers I measured the research fields of papers by parsing paper texts using research field dictionaries based on the keywords listed under JEL classification codes. Firstly, I created research field dummy variable equals one if any keyword listed under a JEL classification code appears in the paper text. Secondly, I calculated the percentage of research fields identified in the text being parsed and named it as the coverage of research fields. Thirdly, I created research field intensity variables, which measured the frequency of keywords in each research field classified by JEL codes. Fourthly, I manually checked the research field measurement results of 20 randomly selected papers. The ranks of variables measuring the research fields of the 20 randomly selected papers were consistent with my impression of these papers, though the values of research field intensities were very similar for related fields (e.g., macroeconomics and financial economics). Compared to human coded research field measurement that only assigned one research field to each paper (e.g., Anauati et al. (2016) and Angrist et al. (2017), this machine coded research field measurement “objectively” assigned multiple research fields to each paper, and created a continuous intensity measure for each research field. The vector of variables measuring a paper’s research fields, Fi, includes a dummy variable for each research field, coverage of research fields, and research field intensity in the first 10 sentences, the first 100 sentences, the first 200 sentences and the full text of each paper. Table 3 presents the relative research field intensities of each journal. The relative intensities are calculated by journal, and the intensity of “Math, Quant Methods” of each journal is used as benchmark. For each journal, if the intensity of some research field other than “Math, Quant Methods” is higher than 1, it means the frequency of keywords from that field is higher than the frequency of keywords from “Math, Quant Methods”. It can be seen from Table 3 that papers in ECMA and RES have relatively higher “Mathematical and Quantitative Methods” and “Microeconomics” intensities than the intensities of the other research fields, while the research fields of papers in the other three journals are more evenly distributed.

Table 3: Paper research field intensities: “Math, Quant Methods” as benchmark

Note: The field intensity of “Math, Quant Methods” of each journal is set to be 1, and is used as benchmark. The field intensities of the other fields are the relative intensities compared with field intensity of “Math, Quant Methods” of each journal. The measures of paper research fields are constructed by parsing papers’ full texts. 2.2.3	Topic words of papers I used a dictionary-based on the keywords under JEL classification codes to parse the texts in each section and created dummy variables for the appearance of highly frequent keywords and keyword pairs, as well as popular topic word and word pair coverage variables measuring the coverage of highly frequent keywords and keyword pairs. The algorithm for measuring topic words of papers is detailed in Appendix B.1.1. The vector of variables measuring a paper’s topic words, Wi, includes popular topic coverage, popular two-word pair coverage, popular three-word pair coverage, dummy variables for highly frequent topic words, dummy variables for highly frequent two-word pairs, and dummy variables for highly frequent three-word pairs in the first 10 sentences, the first 100 sentences, the first 200 sentences, and the full text of each paper.

Figure 6 presents the cloud of popular topic words in the top 5 economics journals. The size of word measures the level of deviation from the average share of papers having that word in the top 5 economics journals. The colour of word is determined by the journal that has the largest share of papers having that word. It can be seen that the area of AER is the smallest due to the few number and small average size of words it has, meaning that it publishes papers related to a wide range of research fields. ECMA publishes relatively bigger share of papers having topic words related to microeconomics (e.g., “choice”), and mathematical and quantitative methods (e.g., “converge”, “probable”, “property”). JPE publishes relatively bigger share of papers having topic words related to macroeconomics and public economics (e.g., “reserve”, “service”, “budget”). QJE publishes relatively bigger share of papers having topic words related to some applied fields including development (e.g., “technology”, “institute”), labour (e.g., “labour market”), and educational economics (e.g., “school”). RES is similar to ECMA in publishing relatively bigger share of papers having topic words related to microeconomics, and mathematical and quantitative methods, but leaning to different topic words (e.g., “equilibrium”, “dynamic”, “discount”).

Figure 6: Cloud of popular topic words in the top 5 economics journals 1990-2011

2.2.4	Presentation style of papers I measured the presentation style of papers from several aspects: firstly, the frequency of selected adjective words, which measures the “descriptiveness” of paper’s writing style; secondly, the frequency of selected advanced words , which measures the “richness” of paper’s vocabulary; thirdly, the average length of sentences measures the “complexity” of paper’s sentences; fourthly, the number of words and the number of pages, which measures the length of papers. I did not create dummy variable for each adjective word nor advanced word because these words might be too noisy to be credible measures of the “inner quality” of a paper. For instance, the appearance of “innovative” does not necessarily mean a paper is as innovative as it has claimed. However, the frequencies of these words can be used as indicators of the presentation style of a paper. The algorithm for measuring presentation style of papers is presented in Appendix B.1.2. The vector of variables measuring a paper’s presentation style, Pi, includes frequency of selected adjective words, frequency of selected advanced words, number of words and average length of sentences in the first 10 sentences, the first 100 sentences, the first 200 sentences, and the full text of each paper, and number of pages of each paper. Table 4 presents the statistics of some selected variables measuring paper topic information and paper presentation style information, and Figure 7 shows time series of coverage of popular topics and paper complexity (measured by average length of sentences) as of the year of publication. Notably, all of these journals are publishing papers covering more popular topics than they were in the 1990s, and papers in QJE have the highest average popular topic coverage, while papers in ECMA have the lowest. In addition, papers in ECMA have the highest average paper complexity, and papers in QJE have the largest average number of pages. 2.2.5	Journal information of papers I retrieved journal of publication information from digital journal libraries and retrieved publication year information from MA database. The publication year information of paper in MA database was assigned based on the first date when the paper was accessible publicly, which could be years ahead or behind of the journal date. I manually checked the publication years of 20 randomly selected papers and found almost all of them were assigned correct publication year. The vector of variables measuring a paper’s journal information, Ji, includes dummy variables for publication years and dummy variables for the top 5 economics journals. 2.3	Author data analysis I measured author information using the data from MA database, as well as data from other sources linked to the author. Firstly, I measured each author’s academic experience by measuring the number of years of author citation record. Secondly, I calculated the number of author’s publications and author cumulative citations at the end of each year. Thirdly, I constructed variables measuring the publication records of author’s co-authors, in order to measure the strength of author’s collaboration network. Fourthly, I used author’s institution information in MA database and economic research score linked to that institution to measure economic research score of author’s institution and the country of author’s institution. The algorithms for measuring author information are documented in Appendix B.2. Table 5 summarizes the variables that were constructed to measure author information (including author’s collaboration network information). For the papers with more than one author, I used the average value of all of the authors’ information. The vector of variables measuring a paper’s author information is denoted as Ai, t. Table 6 presents the statistics of some selected variables measuring author publication information and author’s collaboration network information as of the year of publication. As shown in Table 6, noticeable differences in author information among the top 5 economics journals exist, and authors of papers in ECMA are most experienced with a highest average number of top publications, while papers in AER and QJE have a higher number of authors with higher average cumulative citations as of the year of publication. In addition, authors of papers in QJE have the strongest average collaboration network measured by a total number of top 5 publications and total cumulative citations of papers written by co-authors of the author, as well as the highest author affiliation score. Figure 8 shows time series of some selected author information (author experience, author cumulative citations, number of co-authors, and total cumulative citations of papers written by co-authors of the author) as of the year of publication. It can be observed that authors of papers published in the 2000s are more experienced, more highly cited, and have stronger collaboration network as of the year of publication than authors of papers published in the 1990s. 3	Estimation and Prediction Strategy In this section, I present the strategy of estimating paper effect and author effect on the 10-year citations and citation paths and the strategy of predicting paper’s 10-year citations with the information available as of the year of publication. 3.1	Estimation strategy Apart from estimating paper effect and author effect using low-dimensional measures of features of papers and authors, I also investigate the effect of the appearance of topic words and word pairs on paper’s 10-year citations. The vectors of word and word pair dummies constructed by textual analysis are high dimensional (p > n), which makes Ordinary Least Squares inadequate to estimate the regression equation. To estimate the model, I use Post- Lasso method suggested by Belloni et al. (2012). 3.2	Prediction strategy In this subsection, I present the strategy for predicting paper’s 10-year citations with the information available as of the year of publication. The dictionary-based textual analysis presented in Section 2 has constructed more than 6,000 variables measuring the features of papers and authors. Some of these variables might be powerful predictors of paper citations, while these high dimensional measures increase the risk of overfitting in traditional regression methods (e.g., Ordinary Least Squares). In addition, the large number of available variables makes it hard to construct interaction terms purely by intuition. For instance, it might make sense to construct interactions between some variables measuring paper information and some variables measuring author information, but simply including all pairwise interactions would produce more than 36,000,000 interaction terms, which would be infeasible in practice. Thus, I use machine learning methods that can handle high dimensional data and create the most predictive interactions between variables. Regression shrinkage methods can reduce the number of covariates, and detect the most powerful predictors. However, the parameters in regression shrinkage methods are fitted jointly, which makes the computational burden of fitting non-linear models with many higher-order interactions between measures of papers and authors fairly high. Thus, other machine learning methods might be more suitable for adding higher-order interactions into the prediction model. One way to add higher-order interactions into the prediction model is to use Neural Network. The Neural Network model is normally formed by one input layer, one output layer, and one or several hidden layers. Predictors enter the model via input layer, and they are weighted, combined and transformed by activation functions of each unit in the first hidden layer. Then, the outputs of units in the first hidden layer are weighted, combined and transformed by activation functions of units in the second hidden layer. This process continues until the output layer is reached, and the outputs of units in the last hidden layer are weighted and combined in the output layer to form outputs of the Neural Network. The interactions between variables are created by the combination of outputs of units in hidden layers, and nonlinearity is added by nonlinear activation functions. Since Random Forest method and Gradient Boosted Trees method could not compare the predictive power of each variable before building regression trees, their prediction performance might be negatively affected by variables that were not predictive. Thus, if some methods that could select the variables to be used in tree models were embedded, the prediction ability of Random Forest and Gradient Boosted Trees might be improved. Based on my evaluation of these machine learning methods, I develop a hybrid method that combines dictionary-based textual analysis, regression shrinkage, and gradient boosted trees, in order to combine the advantages and partially overcome the shortcomings of each method. This hybrid method uses dictionary-based textual analysis to construct variables measuring features of papers and authors, uses regression shrinkage to select variables that are powerful predictors of paper citations, and uses gradient boosted trees to fit non-linear prediction model with high order interactions. The steps are: Step 1: Variable Construction Use dictionary-based textual analysis to measure unstructured data to construct high dimensional vectors of variables measuring paper and author information. Step 2: Variable Selection Use shrinkage methods to fit a variety of linear models to predict 10-year citations of each paper, and select the model that gives the smallest out-of-sample Mean Squared Error (MSE). Then, use the variables with non-zero coefficients as predictors in Step 3. Step 3: Model Fitting Add the predictors selected in Step 2 sequentially into a variety of Gradient Boosted Trees models to predict 10-year citations of each paper, and select the Gradient Boosted Trees model that gives the smallest out-of-sample MSE. In Section 4, I present the results of estimating paper effect and author effect on paper citations. In Section 5, I compare the proposed hybrid method with other machine learning methods in terms of MSE and out-of-sample fit. 4	Estimation Results and Discussion 4.1	Effects on paper’s 10-year citations 4.1.1	Paper effect and author effect Table 7 presents the estimated coefficients of the variables measuring paper and author information as of the year of publication on paper’s 10-year citations. Papers in AER 1990 are used as a baseline. Compared with papers in the other journals, papers in QJE get higher citations even after controlling a variety of paper and author information, though the QJE effect decreases after more control variables are added. The coefficient of QJE in column (5) suggests that papers in QJE are predicted to get 2.33 more citation counts after 10 years of publication, controlling paper and author information. One potential explanation might be that QJE performed better in advertising publications. It is also possible that editors of QJE preferred papers that would be highly cited, while editors of the other journals did not have strong preferences for highly cited papers. In addition, the differences in the pools of submitted manuscripts could be another cause of the QJE effect. The estimation results confirm the importance of paper research field in determining paper citations, and papers with higher “Macro, Monetary Econ” intensity generally get more citations, while papers with higher “Micro”, “Public Econ”, “Labor Econ”, and “Agricultural, Environmental Econ” intensity get fewer citations. Specifically, the coefficients in column (5) suggest that 0.25% higher “Macro, Monetary Econ” intensity leads to 2 more 10-year citation counts, while 0.25% higher “Micro” intensity leads to 1 less 10-year citation counts. Paper topic and paper presentation style also determine paper’s citations, and papers that cover more popular topics and have more pages turn to have more citations, while papers with higher complexity get fewer citations. One potential explanation might be that a paper covering more popular topics is related to a broader range of subsequent studies that may cite it, while a complex paper takes longer for other researchers to read and cite it in subsequent published papers. Within the variables measuring author information, the number of authors and number of author’s co-authored papers are positively correlated with citations, while the coefficients of author experience and number of author’s publications are not significant. In addition, the coefficients of some variables measuring the strength of author’s collaboration network are also significant. Specifically, higher 10-year citation counts are associated with higher total citations and numbers of the top 5 publications of authors’ co-authors, while associated with lower numbers of publications and numbers of top field publications of authors’ co-authors. Surprisingly, the 10-year citation counts are negatively correlated with numbers of authors’ top field publications, and its coefficient in column (5) suggests that papers by authors with 10 more top field publications are predicted to have 2.3 less 10-year citation counts. One possible explanation might be that for an author with more top field publications, the competition on citations between her/his publications in the same field might be fiercer, which reduce average citations of her/his papers. Another possible explanation might be that higher number of top field publications sends a positive signal on the quality of their submissions to editors, even though the inner quality could be low. In Card and DellaVigna (2017), positive effect of the number of author’s top publications on paper citation counts is reported. However, the way of measuring the number of author’s top publications and some other paper/author information in this paper is different from their paper. The adjusted R-squared in all of these regressions is very low. The regression in column (5) gives the largest adjusted R-squared of 0.17, meaning only 17% of the variation of the 10-year citations can be explained by the variables measuring paper and author information as of the year of publication in linear regression. Thus, to better model the variation of the 10-year citations and predict out-of-sample, a model with more covariates and higher order interactions might be necessary. 4.1.2	Effect of topic word appearance Table 7 has shown that some of the research field measures are significantly correlated with papers’ 10-year citation counts, and the coefficient of popular topic coverage is positive in all of these regressions. To investigate the effect of each popular topic word on papers’ 10-year citation counts, I estimate the effect of the appearance of each popular topic word and word pair on papers’ 10-year citation counts. Since adding topic word dummies, word pair dummies, and control variables to Equation 3 yields more than 6,000 variables, I use Post-Lasso method to estimate the regression equation. Table 8 presents the top topic words ranked by the absolute values of their estimated coefficients. It can be seen that some topic words and topic word pairs related to macroeconomics (e.g., GDP, (‘capital’, ‘share’), (‘product’, ‘develop’, ‘growth’), and (‘distribute’, ‘income’), quantitative methods (e.g., correl), financial economics (e.g., bank), and educational economics (e.g., school) have the largest positive coefficients, and some topic words and topic word pairs related to microeconomics (e.g., (‘ration’, ‘inform’), law and economics (e.g., (‘prison’, ‘sentence’), and transportation economics (e.g., (‘vehicle’, ‘drive’, ‘port’) have the largest negative coefficients. However, some words with similar meaning have opposite estimated coefficients (e.g., “correl” and “regress”), and some topic words can represent multiple research topics (e.g., optim is possible to be stripped from optimal contract, optimal taxation, or other research topic words). The coefficients in Table 8 show what the main contributors to the positive coefficients of popular topic coverage and intensities of some research fields are. However, due to inevitable measurement error and multicollinearity, individual coefficients should be interpreted with caution. In addition, estimating the distribution of post-model-selection estimators is not trivial (for discussion on this issue, see Leeb and P¨otscher (2006) and Leeb and Po¨tscher (2008). 4.2	Effects on paper citation path In this subsection, I present the results of estimating paper effect and author effect on paper citation path with a panel of paper information and yearly author information. When discussing estimation results, I focus on the coefficients of the linear term of the quadratic Equation 4, because it is the main determinant of the slope of paper citation path. 4.2.1	Paper effect Table 17 presents the estimated coefficients of paper’s research field on its citation path. The results suggest that a steeper slope of paper citation path is associated with higher “Math, Quant Methods”, “Econ Development, Growth”, and “Econ Systems” intensity, while associated with lower “Macro, Monetary Econ”, “Public Econ”, “Labor Econ”, and “Industrial Organization” intensity. Notably, the negative effect of “Micro” on the slope of paper citation path fades away after variables measuring author information are controlled, which indicates that the negative effect of “Micro” intensity on the slope of paper citation path might be caused by features of authors, instead of “Micro” intensity itself. Table 18 presents the coefficients of paper topic and presentation style on its citation path. It shows that a steeper slope of paper citation path is associated with higher popular topic coverage and number of pages in all of these regressions, though the coefficients decrease after the variables measuring author information are controlled. Whereas, the coefficients of complexity, descriptiveness, and vocabulary richness of paper’s presentation style are not significant in most of these regressions. 4.2.2	Author effect Table 19 presents the coefficients of variables measuring author information on paper citation paths. The results show that a steeper slope of paper citation path is associated with higher number of authors and author cumulative citations, while associated with lower number of author’s top field publications, number of author’s top 5 publications, number of author’s co-authored publications, and number of publications written by co-authors of the author. These results indicate that the papers written by a bigger team of highly cited authors turn to have a steeper slope of citation path. The explanations to the negative effect of the number of author’s top field publications on the 10-year citation counts could help explain the negative coefficients of the number of author’s top publications. Another explanation might be that for an author with more subsequent top publications, the theory or method proposed in the original paper is extended in the subsequent top publications, which leads researchers to cite the author’s subsequent top publications, instead of the original paper. 4.2.3	Heterogeneity in effects on paper citation path Table 20 compares the estimation results of paper effect and author effect on paper citation path between the top 5 economics journals. It shows that papers in RES are the most negatively affected by higher “Macro, Monetary Econ” intensity, and QJE has extremely large positive coefficients of “Econ Development, Growth” and “Econ Systems” intensity. Regarding paper presentation style, AER and QJE are negatively affected by higher paper complexity, while coefficients of the other three journals are not significant. The coefficients of variables measuring author information also show heterogeneity among different journals. Notably, papers in JPE are benefited the most from bigger team of highly cited authors. The heterogeneity among journals also exists in the adjusted R-squared. The regression for JPE gives the largest adjusted R-squared of 0.54, while the regression for RES gives the smallest adjusted R-squared of 0.16, meaning that a much smaller portion of variation of RES papers’ citations can be explained by the variables in the fixed effect model. One potential explanation to the heterogeneity among journals could be that some important determinants of RES papers’ citations were not captured. Table 21 compares the estimation results of paper effect and author effect on paper citation path between different author groups. The authors were grouped according to the ranking of author’s most recent institution, and the authors without a matchable institution are excluded. Since more than half of the observations did not have matched institution information, the regression results in Table 21 should be interpreted with caution. Table 21 shows that the coefficients seem to have no clear trend across author groups, and most of them are insignificant. The estimation results could help deepen our understanding of the drivers of paper citations, while the low R-squared shows that a simple regression model might be inadequate to model variation of paper citation counts and predict out-of-sample. In the next section, I present the results of using machine learning methods to predict papers’ 10-year citations with the information available as of the year of publication. 5	Prediction Results and Discussion 5.1	Paper citation out-of-sample prediction In this subsection, I test the ability of machine learning methods to predict papers’ 10-year citation deciles with the information available as of the year of publication. I use the 10-year citation deciles instead of the 10-year citation counts as the variable to be predicted, in order to reduce the influence of extremely highly cited papers and extremely lowly cited papers. Even though predicting citation deciles reduces the difficulty of prediction, the low adjusted R-squared (less than 0.5) in Table 22 shows it is still hard for OLS models to explain the variation of paper citations. In addition, as will be shown in this section, even the prediction model that gives the smallest Mean Squared Error (MSE) cannot predict the citation counts of the highest decile and the lowest decile well. The 3,472 papers that have 10-year citation data are used in this test. I sort papers into 10 parts based on the deciles of papers’ 10-year citations and create variable Di,10, the 10-year citation decile of paper i, which has level 1 to 10. Then, I randomly select 70% of the papers as training samples and 30% of the papers as testing samples, and use prediction models to predict paper’s Di,10 with paper and author information available as of the year of publication. To compare models’ out-of-sample prediction performance, I used Mean Squared Error (MSE): The vectors of variables measuring paper and author information are added sequentially into prediction models. The vectors of variables in prediction models are shown in Table 23. Model (1) only includes variables measuring journal information, Model (2) adds variables measuring field information, Model (3) adds variables measuring popular topic coverage, Model (4) adds variables measuring presentation style, Model (5) adds variables measuring author information at the publication year, Model (6) adds lagged author variables, and Model (7) to Model (9) sequentially add high dimensional measures of paper topics. The linear model fitted by Ordinary Least Squares is used as the baseline model, and regression shrinkage methods (Post-Lasso, Lasso, Ridge, Elastic Net), Neural Network, Random Forest, Gradient Boosted Trees, and the hybrid methods developed in Section 3.2 are compared with each other. The implementation and the values of the key parameters of machine learning methods are presented in Table 24. The parameters of these machine learning methods are determined after testing a bunch of combinations of parameters. Table 9 presents the out-of-sample prediction results. It shows that the MSE of prediction methods, except for OLS, generally decreases after adding more predictors (from Model (1) to Model (9). The results from regression shrinkage methods, Random Forest, and Gradient Boosted Trees show that the MSE decreases by more than 10% after adding variables measuring research fields (from Model (1) to Model (2), and decreases by another roughly 10% after adding variables measuring author information (from Model (4) to Model (5). However, after adding the high dimensional vector of topic word dummies (from Model to Model (7), some machine learning methods begin to overfit. The Neural Network gives better prediction result than OLS, but it is not as good as the other machine learning methods in this prediction test. Instead, the result shows that it suffers from overfitting issue when many predictors are added, with MSE increases from 7.04 in Model (7) to 13.08 in Model (9). One possible reason might be that Neural Network has a large number of parameters, and the sample size in this paper is not big enough for Neural Network to show its merit. It is also possible that the prediction ability of Neural Network can be improved by more complex network structure and other combinations of parameters. The prediction performance of the proposed Shrinkage-Gradient Boosted Trees Hybrid method is much better than OLS and is marginally better than regression shrinkage methods, Random Forest, and Gradient Boosted Trees. In addition, the Shrinkage-Gradient Boosted Trees Hybrid Model (6), which gives the smallest MSE, only uses 221 predictors. Whereas, the Gradient Boosted Trees Model (7), which gives the second smallest MSE, uses 1,836 predictors. This property of the Shrinkage-Gradient Boosted Trees Hybrid method significantly reduces the cost of data collection and computation for using it to predict 10-year citation deciles of new papers. The row “SGBT Hybrid (No JID)” in Table 9 reports the results of Shrinkage-Gradient Boosted Trees Hybrid model without using journal ID variable. It shows that in prediction models with many predictors, adding journal ID variable only marginally improves the prediction performance. The top 50 predictive variables selected by the regression shrinkage model with the smallest MSE (Cross-Validation Elastic Net Model (9) are shown in Table 10. The result shows that being published by QJE predicts higher 10-year citations while being published by RES predicts lower 10-year citations. However, since there are more than 200 variables selected by the regression shrinkage model, the importance of journal dummies is much decreased. The top predictive variables measuring paper information are “Popular Topic Coverage in the Full Text” and “Number of Pages”, and their positive coefficients indicate that longer papers with higher coverage of popular topics are predicted to have higher 10-year citations. The top predictive variables measuring author information are “Total Number of the Top 5 Publications of Authors’ Co-authors”, “Author Cumulative Citations”, and “Number of Co-authors within the Top 5 Publications”. These variables measure author citations and the strength of author’s collaboration network, and their positive coefficients show that papers written by highly cited authors with stronger collaboration network are predicted to have higher 10-year citations. A big portion of the variables in Table 10 is topic words and word pairs. The coefficients of these variables confirm that the appearance of some research topics is powerful predictors of papers’ 10-year citations. In addition, some variables measuring paper research fields are powerful predictors of 10-year citations, and “Micro in Full Text” negatively predicts 10- year citations, while “Math, Quant Methods in the First 200 Sentences” positively predicts 10-year citations. Figure 9 shows the predicted citation distribution given by the preferred hybrid method and actual citation distribution of the papers in the testing set. As shown in Figure 9, the preferred hybrid method can partially capture the skewness of the citation distribution of the top 5 economics journals. However, it predicts a distribution concentrated around its mean value, while cannot predict the citation counts of the highest decile and the lowest decile well. One potential explanation could be that some important features of papers in the highest decile and the lowest decile are not well captured. Figure 9 also shows that the citation distribution predicted by the model with journal ID and without journal ID are virtually the same. Table 11 presents the result of two- sample Kolmogorov Smirnov test for the null hypothesis that the distribution predicted by the model with journal ID and without journal ID are drawn from the same distribution. The null hypothesis is accepted in for all of these journals, which provides additional evidence that journal information is not very important for the hybrid method to predict the 10-year citation deciles. 5.2	Application to academic publishing process In this subsection, I discuss the potential of the prediction models to be used as first stage screening tool in the academic publishing process. The preferred Ordinary Least Squares model (OLS model (1) is used as a baseline model to be compared with the preferred shrinkage model (Cross-Validation Elastic Net model (9), tree model (Gradient Boosted Trees model (7), and a hybrid model (Shrinkage-Gradient Boosted Trees Hybrid method (6). 5.2.1	Two-category case In the first test, I test prediction models’ ability to identify the papers in the upper half of the citation distribution. Firstly, I separate papers into two citation categories: the “highly cited” category if a paper is in the upper half of the citation distribution (Di,10 > 5), and the “lowly cited” category if a paper is in the lower half of the citation distribution (Di,10 ≤ 5). Then, I use each preferred prediction model to predict 10-year citation decile of papers in the testing set. Lastly, I code the predicted citation category of each paper in the testing set based on papers predicted 10-year citation decile. Define condition positive (P) as the number of real “highly cited” papers, condition negative (N) as the number of real “lowly cited” papers, true positive (TP) as the number of correctly predicted “highly cited” papers, true negative (TN) as the number of correctly predicted “lowly cited” papers, false positive (FP) as the number of incorrectly predicted “highly cited” papers, and false negative (FN) as the number of incorrectly predicted “lowly cited” papers in the testing set. Then, I use “Precision”, “Recall”, “Accuracy”, “Pr (High|Predicted high)”, and “Pr (Low|Predicted low)” to assess the prediction performance of these models. Their formulas are shown in Equations 23-27. Table 12 shows that the prediction result of Ordinary Least Squares model is almost equivalent to a random guess, whereas the Elastic Net, Gradient Boosted Trees, and Shrinkage- Gradient Boosted Trees Hybrid model has much higher precision rate, recall rate, and accuracy rate. In addition, Shrinkage-Gradient Boosted Trees Hybrid model has marginally better prediction performance than the Elastic Net model and Gradient Boosted Trees model. Within the papers being predicted by the Shrinkage-Gradient Boosted Trees Hybrid model to be highly cited, 72.7% of them turn out to be highly cited, and within the papers being predicted to be lowly cited, 76.7% of them turn out to be lowly cited after 10 years of publication. Suppose an editor used paper citation as one of the criteria in editorial decision making. Then, the Shrinkage-Gradient Boosted Trees Hybrid model may be a useful tool by giving “Accept” suggestion if a paper is predicted to be highly cited, and giving “Reject” suggestion if a paper is predicted to be lowly cited. 5.2.2	Three-category case In the second test, I test the prediction models’ ability to identify papers in three citation distribution categories: “highly cited” (Di,10 > 7), “middle” (4 ≤ Di,10 ≤ 7), and “lowly cited” (Di,10 < 4). Then, I test each prediction model’s ability to label the correct citation category of each paper in the testing set. Table 13 shows that the Ordinary Least Squares model cannot identify the “highly cited” papers, and only can identify the “lowly cited” papers correctly less than 50% of the time. Whereas, the other three models can identify the “highly cited” papers and the “lowly cited” papers more than 64% of the time. Within the papers being predicted by the Shrinkage-Gradient Boosted Trees Hybrid model to be “highly cited”, 65.0% of them turn out to be “highly cited”, and only 4.7% of them turn out to be “lowly cited”. Within the papers being predicted to be “lowly cited”, 66.7% of them turn out to be “lowly cited”, and only 2.6% turn out to be “highly cited” after 10 years of publication. Given that the chance of predicting “highly cited” papers as “lowly cited” and the chance of predicting “lowly cited” papers as “highly cited” are low, the hybrid method proposed in this paper may be helpful in identifying articles that are sufficiently below the acceptance threshold of a journal to enable editors to reject a significant fraction of inappropriate or low-quality submissions, as well as preventing rejection of submissions that will turn out to be highly cited. The prediction performance of the proposed Shrinkage-Gradient Boosted Trees Hybrid method is much better than OLS. However, it cannot predict the citation counts of the highest decile and the lowest decile well. In the citation prediction tests, the hybrid method shows its potential to be used as a first stage screening tool in the academic publishing process. 6	Conclusion The 10-year citations of papers in the top 5 economics journals is a highly right-skewed distribution, and the upper tail of the citation distribution is well approximated by a Pareto distribution. I use some new measures of features of papers and authors to estimate paper effect and author effect on the 10-year citation distribution and citation paths. The estimation results show that papers that turn to have higher 10-year citations are associated with higher popular topic coverage, numbers of authors, and total citations of authors’ co-authors, while associated with lower “Micro” intensity, paper complexity, and numbers of authors’ top field publications. I also use the measures of features of papers and authors as predictors in machine learning models to predict papers’ 10-year citations. The hybrid method developed in this paper performs much better than Ordinary Least Squares in 10-year citation out-of-sample prediction test, while using a relatively small number of variables compared to other machine learning methods. This property of the hybrid method significantly reduces the cost of data collection and computation for using it to predict 10-year citation counts of a new paper. The estimation and prediction strategy of analyzing large-scale high dimensional data used in this paper has potential to be used to investigate the drivers of decision making and predict decision making in other places, where large-scale high dimensional data are produced, such as media market, financial market, online shopping, and online social network. The hybrid method has shown its potential to be used to help find highly cited papers, and be used as a first stage screening tool in academic publishing process to more efficiently direct the scarce time of editors and referees. However, it cannot predict the citation counts of the highest decile and the lowest decile well. It would be interesting for future study to identify additional features of papers and authors that predict paper citations, as well as exploring other types of hybrid methods, such as the hybrid of regression shrinkage and deep neural network. It would also be interesting to see whether prediction performance can be further improved by embedding unsupervised learning methods. Papers in QJE get higher citations even after controlling a variety of paper and author information, though the QJE effect becomes much less important in prediction models with many variables. The QJE effect could either be caused by the differences in editors’ preferences for papers that would be highly cited, or be caused by the differences in the pools of submitted manuscripts. The data used in this paper only contain the papers that are published by the top 5 economics journals, and the rejected ones are not observed. In Guo (2017), I use confidential data on manuscript submissions (including rejected ones) and records of decision making for several academic journals, linked with yearly paper citation data and author data to investigate the drivers of editor decisions and paper citations. The confidential data in journal database might help deepen our understanding of journal effect on paper citations.

A.1.1	Comparison of the Microsoft Academic database and the Google Scholar database The objective of this subsection is not to compare the data quality of Microsoft Academic (MA) database and Google Scholar (GS) database. Instead, I discuss some differences between collecting academic data from these two databases. These differences led me to use MA database as the main source of academic data. Firstly, MA database provides Application Programming Interface (API) for sending automatic queries to its database, and the data collection process does not require human intervention. Whereas, GS have built-in barriers required by publishers to foil automatic queries, which impedes the efficiency of collecting data from their websites. Secondly, paper ID and author ID are accessible in MA database. This feature provides more flexibility in designing algorithms to reduce the measurement error of paper citation lists and author publication lists. Whereas, paper IDs and author IDs are not accessible in GS database. A scrapping algorithm may mix up publication lists of authors who have the same name, which causes miscounting. Thirdly, the rate limit of MA API is fairly high. This feature makes it possible to send a large amount of queries to the database to collect a large group of researchers’ collaboration network information. Fourthly, MA database allows retrieving paper’s full citation list. Compared with MA database, GS website exhibits at most 100 pages of the citation list of a paper, which only contain 1,000 of the citing papers. This limitation truncates the citation lists of papers with more than 1,000 citations.