Co-citation Proximity Analysis

Co-citation Proximity Analysis (CPA) is a document similarity measure that uses citation analysis to assess semantic similarity between documents at both the global document level as well as at individual section-level. The similarity measure builds on the co-citation analysis approach, but differs in that it exploits the information implied in the placement of citations within the full-texts of documents.

Co-citation Proximity Analysis was conceived by B. Gipp in 2006 and the description of the document similarity measure was later published by Gipp and Beel in 2009. The similarity measure rests on the assumption that within a document’s full-text, the documents cited in close proximity to each other tend to be more strongly related than those documents cited farther apart. The figure to the right illustrates the concept. The CPA approach to document similarity assumes the documents B and C to be more strongly related than the documents B and A, because the citations to B and C occur within the same sentence, whereas the citations to B and A are separated by several paragraphs.

The advantage of the CPA approach compared to other citation and co-citation analysis approaches is an improvement in precision. Other widely used citation analysis approaches, such as Bibliographic Coupling, Co-Citation or the Amsler measure, do not take into account the location or proximity of citations within documents. The CPA approach allows a more granular automatic classification of documents and can also be used to identify not only related documents, but the specific sections within texts that are most related.

Method of calculation
The CPA similarity measure calculates a Citation Proximity Index (CPI) for each set of documents cited by an examined document. Cited documents are assigned a weight of $$\frac{1}{2^n}$$, where n stands for the number of levels between citations. Beginning at the lowest level, levels may be defined as citation groups, sentences, paragraphs, chapters, and finally the entire document or even journal.

There are several variations of the CPA algorithm.


 * Basic-CPA – fundamental concept of CPA as described above
 * Extended-CPA – considers the tree structure and order of citations within citation groups
 * Multidimensional-CPA – uses additional information such as the impact factor
 * Hybrid-CPA – combines the CPI with other similarity measures, for example text-based measures. This boosts performance especially for documents with insufficient citation information.

Performance
The CPA similarity measure builds upon the co-citation document similarity approach with the distinctive addition of proximity analysis. Therefore, the CPA approach allows for the calculation of a more granular resolution of overall document similarity. CPA has been found to outperform co-citation analysis, especially when documents contain extensive bibliographies and in cases where documents have not been frequently cited together (i.e. have a low co-citation score). Liu and Chen found that sentence-level co-citations are potentially more efficient markers for use in co-citation analysis in comparison to the loosely coupled article-level only co-citations, since sentence-level co-citations tend to preserve the essential structure of the traditional co-citation network and also form a much smaller subset of all co-citation instances.

An analysis by Schwarzer et al. showed that the citation-based measures CPA and co-citation analysis, have complementary strengths compared to text-based similarity measures. Text-based similarity approaches reliably identified more narrowly similar articles out of a test collection of Wikipedia articles, e.g. articles sharing identical terms, while the CPA approach outperformed CoCit at identifying more broadly related articles, as well as more popular articles, which the authors claim to likely also be of higher quality.