Patents, Papers, and the invisible threads that connect them

In my PhD thesis, I extracted all the links contained in US patent documents from 2008 to 2018, resulting in more than three million links. Due to time constraints, I was only able to perform an initial analysis, which did not allow for a deeper understanding of where those links were pointing. However, in the endless quest to better understand patents and their role in bibliometrics, I came across this paper by Marx and Fuegi (2020), which compellingly connects academic research with industrial inventions by studying the cites included in the first page of patents.

Marx, M., & Fuegi, A. (2020). Reliance on science: Worldwide front‐page patent citations to scientific articles. Strategic Management Journal, 41(9), 1572-1594.
https://doi.org/10.1002/smj.3145

To what extent do firms rely on basic science in their R&D efforts?

While the importance of science innovation is well-established, empirical work has been limited by data constraints. Patent citations to scientific paper (what the authors call Patent Citations to Science, PCS), provide a potential measure, but until now, such linkages were scattered, inconsistent, or locked behind proprietary paywalls.

Previous studies has used narrow samples (e.g., specific firms or sectors), manual matching, or databases like Scopus or Web of Science that could not be shared publicly. As a result, research on firms’ reliance on science was fragmented, with significant barriers to replication or cross-study comparison.

This study addresses those limitations head-on by building the largest open-access dataset of global PCS, linking over 22 million patent-pater pairs from more than 3 million patents and 4 million scientific articles, spanning 1800-2018.

The Method:

The authors designed a multi-step algorithm to match “unstructured” non-patent references found on the front pages of patents with structured metadata from Microsoft Academic Graph (MAG), a publicly available bibliographic database.

Each patent-paper match is assigned a confidence score (1–10) based on the likelihood of correct identification. Importantly, the dataset also includes whether a citation was added by the patent applicant or the examiner, as well as journal-level metadata (e.g., Impact Factor) to facilitate filtering.

To evaluate the accuracy of their algorithm, they manually verified thousands of matches. The algorithm achieved over 99% precision for high-confidence linkages and 93% recall, outperforming prior approaches that relied on proprietary sources.

The dataset is freely available at relianceonscience.org, with code, documentation, and crosswalks to PubMed and other sources.

Key Findings:

  • Around 17.6% of all USPTO patents since 1947 contain at least one citation to a scientific article, a figure that rose from just 6.7% in 1976 to 25.6% in 2018.
  • Patents assigned to universities have an average of 14 citations to science, compared with 2 for corporate patents and 1.3 for government patents. The difference widened dramatically after the 1990s.
  • Fields like Chemistry and Biotechnology are the most science-intensive, while Mechanical Engineering shows relatively low scientific citation density.
  • Journals most frequently cited by patents include PNAS, Journal of Biological Chemistry, Science, and Nature. This highlights the dominance of the life sciences in science-technology linkages.
  • Only about 1.5% of all scientific papers are cited by patents, with highly cited ones (like Altschul et al., 1990) shaping vast areas of innovation.

Limitations & What’s left to explore:

While the dataset is a significant leap forward, Marx and Fuegi acknowledge several limitations:

  • It focuses only on front-page citations, not those embedded in the body of patents, which might capture different kinds of knowledge flows.
  • Algorithmic matching still struggles with incomplete or inconsistent references (e.g., missing author, names or years).
  • Non-U.S. patents display different citation norms, sespecially since applicants outside the USPTO are not required to disclose prior art.

Nevertheless, the open and transparent nature of the dataset represesnts a major improvement over closed systems, providing researchers the ability to evaluate and refine future methods.

Conclusions:

Reliance on Science is both a dataset a manifesto. By releasing a global, reproducible linkage between patents and papers, Marx and Fuegi provide scholars with the tools to revisit fundamental questions about how firms innovate and how science fuels technological change.

The dataset enables studies on:

  • How firms’ scientific dependencies affect R&D strategies.
  • The geographic localisation of academic spillovers.
  • The monetary or social value of science-intensive inventions.

In a way, this paper democratises access to data that were once proprietary, allowing researchers (especially early-career scholars) to engage in large-scale studies of science-industry linkages without licensing barriers.

It is also a reminder that science and innovation are not separate worlds but parts of the same ecosystem. In a time when open science and data transparency are more crucial than ever, this article embodies the spirit of collaborative, cumulative research.

See you in the next paper =)