Motivating the Problem
The sheer volume of research papers being published on a daily basis can make it challenging for researchers and organizations to identify potentially impactful research in a timely manner. In the summer of 2018, there were approximately 5000 new papers published every day, making it nearly impossible for any individual or organization to keep track of the latest research without the aid of specialized tools. The PaperRank framework is designed to help tackle this problem by providing a framework for bibliometrics and citation analysis of academic literature graphs.
By analyzing the citation network of a given paper, the framework is able to compute a probability-based trust score, representing the confidence that a given research community places in the claims made in the paper. This allows researchers to quickly and easily identify potentially impactful research, and to assess the fidelity of previously published research based on its position in the citation graph.
Introducing PaperRank
The PaperRank Framework is a tool designed to enable bibliometrics and citation analysis of academic literature graphs. It is built on the PageRank algorithm, which is used to rank node reputation in networks, and is designed to express the probability-based trust that communities place in academic articles by analyzing the article citation network. This allows users to understand the degree to which a scientific community trusts the claims made in a particular paper.
One of the key features of the PaperRank Framework is its extensibility. It is designed to be corpus-agnostic, allowing it to be used with a wide range of academic literature databases. During its development, it was configured for use with the NCBI PubMed database.
In addition to its extensibility, the PaperRank Framework is also highly generalized. It uses a probabilistic model known as the Gamma Mixture Model, which is well-suited to modeling subpopulations. This allows the framework to apply a fitting algorithm over a set of PageRank scores and infer a probability distribution representing the trust that communities place in their research publications.
The use of the Gamma Mixture Probabilistic Model allows the PaperRank Framework to provide a scalable computation of trust for the evidence cited within a given knowledge graph. This is useful in many applications, as it allows users to assign confidence scores to knowledge sources based on the trust that the broader community places in the articles that they are curated from. In particular, it has applications to corpus-enabled expert systems in the Medical domain.
Architecture Overview
The core functionality of the framework is divided into three main engines: the update engine, compute engine, and trust engine. The update engine is responsible for crawling and capturing the structure of the citation graph, including the inbound and outbound citations for each publication in the graph. The compute engine then computes a variant of the traditional PageRank score for each publication, known as the PaperRank score. Finally, the trust engine computes a trust score for each publication in the graph, which can be used to assess the trustworthiness of the publication.
The PageRank algorithm, explained
To improve efficiency, the PaperRank framework is designed to be multi-threaded and highly parallelized. It uses a greedy algorithm of bi-directional crawling to scrape the entire citation graph, allowing for fast scraping speeds. However, this approach can result in un-indexed isolated publications, which are publications that are not connected to the rest of the citation graph. To address this issue, the framework uses a novel non-parametric mixture model to reduce bias in the final trust scores.
The mixture model uses a one-dimensional gamma mixture model to fit the PageRank scores, which allows for the computation of a probability distribution representing the trust communities place in their research publications. This model is split into three sub-distributions, corresponding to no trust, moderate trust, and high trust. The initial PageRank scores are then normalized to a desired range of probabilities using this mixture model.
Exploring the Model
The PaperRank model (including previous iterations)
The model can be run on Google Colaboratory. The notebook includes previous iterations of the scoring algorithm that inspired the Gamma Mixture Model. The notebook includes a tool that allows users to search the computed scores by Paper or Author.
Distribution of Trust Scores output by the GMM-based scoring algorithm
The final distribution of trust scores output by the model above illustrate the expected behavior from the algorithm, with each of the three clusters of papers being well segmented. Overall, the PaperRank Framework is a powerful tool for understanding the trust that communities place in academic literature. It has the potential to be a valuable resource for those working with probabilistic knowledge graphs, providing a way to compute confidence scores and assess the trustworthiness of assertions within those graphs.