Neuroscience Literature Database — SQL Analysis
Published:
Project overview
PubMed’s E-utilities API provides structured XML for biomedical publications: authors, affiliations, MeSH terms, and abstracts. The NIH iCite API adds citation metrics — citation counts, Relative Citation Ratio (RCR), and NIH percentile rank — for each article. This project fuses both sources into a normalized 6-table SQLite database and answers five analytical questions about how the spinal cord neuroscience literature has evolved since 1970.
Why this project: It combines two skills I want to demonstrate explicitly — relational database design and SQL analytical query writing — in a domain I know well. The domain knowledge lets me ask genuinely interesting questions rather than running canned tutorials. The scale (200k–500k records) also requires real attention to schema indexing and query performance.
Pipeline
The project runs as a five-step Python pipeline:
- Fetch —
entrez_fetch.pypages through the Entrez E-utilities API using broad spinal neuroscience MeSH terms, saving raw XML batches to disk in a resumable fashion (~1–3 hours for 200k–500k records with an API key) - Parse —
parse_pubmed_xml.pyuseslxmlto extract article metadata and MeSH terms from each XML batch - Load —
load_db.pynormalizes parsed records across the six schema tables and bulk-inserts them into SQLite via SQLAlchemy - Enrich —
icite_fetch.pyretrieves citation metrics from the iCite REST API for all PMIDs in the database - Analyze —
analysis.ipynbruns all five SQL queries and renders interactive Plotly figures
Schema
Six normalized tables with foreign-key constraints and covering indexes on frequently joined columns:
| Table | Description |
|---|---|
publications | Core article metadata: PMID, title, journal, year, DOI, abstract |
authors | Disambiguated author records with first-listed affiliation |
authorships | Join table: publication × author × position |
keywords | Normalized MeSH terms and author keywords |
pub_keywords | Join table: publication × keyword |
citation_metrics | iCite fields: citation count, RCR, NIH percentile |
SQL analyses
Five query files, each targeting a specific bibliometric question:
Rolling publication volume (01_rolling_pub_counts.sql) SUM() OVER (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) smooths year-to-year indexing noise to reveal genuine growth trends per MeSH term. Tracks terms such as “Spinal Cord,” “Interneurons,” “Central Pattern Generators,” and “Optogenetics” from 1970 to 2025.
Year-over-year growth (02_yoy_growth.sql) LAG() window function computes both absolute and percentage year-over-year change for targeted subfields — including CPG and optogenetics terms whose growth inflections are historically interpretable from domain knowledge.
Co-authorship patterns (03_coauthorship.sql) A self-join on authorships (a1.author_id < a2.author_id) produces all unique co-author pairs without duplication, restricted to papers tagged with core spinal MeSH terms. Aggregation adds pair_count and active_decades to characterize long-running vs. one-off collaborations.
Citation influence (04_citation_influence.sql) RANK() OVER (PARTITION BY subfield_term ORDER BY rcr DESC) identifies the most influential papers per subfield by Relative Citation Ratio — a field-normalized metric that adjusts for differences in citation culture across disciplines.
Keyword evolution (05_keyword_evolution.sql) Normalized MeSH term frequency over time reveals emerging topics (e.g. “Optogenetics,” post-2010) and declining ones, providing a data-driven view of how the field’s methodological focus has shifted.
Skills demonstrated
- Relational schema design: normalization, foreign keys, composite indexes, performance considerations at 200k+ row scale
- Advanced SQL:
JOIN,GROUP BY, window functions (SUM OVER,LAG,RANK OVER), self-joins - Python ETL pipeline: paged API ingestion, resumable batch processing,
lxmlXML parsing, SQLAlchemy bulk loading - API integration: Entrez E-utilities (Biopython +
requests), NIH iCite REST API - Interactive data visualization: Plotly figures embedded in a Jupyter analysis notebook
