Neuroscience Literature Database — SQL Analysis

Published:

SQL SQLite Python Pandas Biopython Plotly Data Modeling Bibliometrics PubMed API iCite API
Explore the source code or view the interactive analysis report with all six Plotly figures.

Project overview

PubMed’s E-utilities API provides structured XML for biomedical publications: authors, affiliations, MeSH terms, and abstracts. The NIH iCite API adds citation metrics — citation counts, Relative Citation Ratio (RCR), and NIH percentile rank — for each article. This project fuses both sources into a normalized 6-table SQLite database and answers five analytical questions about how the spinal cord neuroscience literature has evolved since 1970.

Why this project: It combines two skills I want to demonstrate explicitly — relational database design and SQL analytical query writing — in a domain I know well. The domain knowledge lets me ask genuinely interesting questions rather than running canned tutorials. The scale (200k–500k records) also requires real attention to schema indexing and query performance.


Pipeline

The project runs as a five-step Python pipeline:

  1. Fetchentrez_fetch.py pages through the Entrez E-utilities API using broad spinal neuroscience MeSH terms, saving raw XML batches to disk in a resumable fashion (~1–3 hours for 200k–500k records with an API key)
  2. Parseparse_pubmed_xml.py uses lxml to extract article metadata and MeSH terms from each XML batch
  3. Loadload_db.py normalizes parsed records across the six schema tables and bulk-inserts them into SQLite via SQLAlchemy
  4. Enrichicite_fetch.py retrieves citation metrics from the iCite REST API for all PMIDs in the database
  5. Analyzeanalysis.ipynb runs all five SQL queries and renders interactive Plotly figures

Schema

Six normalized tables with foreign-key constraints and covering indexes on frequently joined columns:

TableDescription
publicationsCore article metadata: PMID, title, journal, year, DOI, abstract
authorsDisambiguated author records with first-listed affiliation
authorshipsJoin table: publication × author × position
keywordsNormalized MeSH terms and author keywords
pub_keywordsJoin table: publication × keyword
citation_metricsiCite fields: citation count, RCR, NIH percentile

SQL analyses

Five query files, each targeting a specific bibliometric question:

Rolling publication volume (01_rolling_pub_counts.sql) SUM() OVER (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) smooths year-to-year indexing noise to reveal genuine growth trends per MeSH term. Tracks terms such as “Spinal Cord,” “Interneurons,” “Central Pattern Generators,” and “Optogenetics” from 1970 to 2025.

Year-over-year growth (02_yoy_growth.sql) LAG() window function computes both absolute and percentage year-over-year change for targeted subfields — including CPG and optogenetics terms whose growth inflections are historically interpretable from domain knowledge.

Co-authorship patterns (03_coauthorship.sql) A self-join on authorships (a1.author_id < a2.author_id) produces all unique co-author pairs without duplication, restricted to papers tagged with core spinal MeSH terms. Aggregation adds pair_count and active_decades to characterize long-running vs. one-off collaborations.

Citation influence (04_citation_influence.sql) RANK() OVER (PARTITION BY subfield_term ORDER BY rcr DESC) identifies the most influential papers per subfield by Relative Citation Ratio — a field-normalized metric that adjusts for differences in citation culture across disciplines.

Keyword evolution (05_keyword_evolution.sql) Normalized MeSH term frequency over time reveals emerging topics (e.g. “Optogenetics,” post-2010) and declining ones, providing a data-driven view of how the field’s methodological focus has shifted.


Skills demonstrated

  • Relational schema design: normalization, foreign keys, composite indexes, performance considerations at 200k+ row scale
  • Advanced SQL: JOIN, GROUP BY, window functions (SUM OVER, LAG, RANK OVER), self-joins
  • Python ETL pipeline: paged API ingestion, resumable batch processing, lxml XML parsing, SQLAlchemy bulk loading
  • API integration: Entrez E-utilities (Biopython + requests), NIH iCite REST API
  • Interactive data visualization: Plotly figures embedded in a Jupyter analysis notebook