Neuroscience Literature Database — SQL Analysis

Published: February 28, 2026

SQL SQLite Python Pandas Biopython Plotly Data Modeling Bibliometrics PubMed API iCite API

Explore the source code or view the interactive analysis report with all six Plotly figures.

Project overview

PubMed’s E-utilities API provides structured XML for biomedical publications: authors, affiliations, MeSH terms, and abstracts. The NIH iCite API adds citation metrics — citation counts, Relative Citation Ratio (RCR), and NIH percentile rank — for each article. This project fuses both sources into a normalized 6-table SQLite database and answers five analytical questions about how the spinal cord neuroscience literature has evolved since 1970.

Why this project: It combines two skills I want to demonstrate explicitly — relational database design and SQL analytical query writing — in a domain I know well. The domain knowledge lets me ask genuinely interesting questions rather than running canned tutorials. The scale (200k–500k records) also requires real attention to schema indexing and query performance.

Pipeline

The project runs as a five-step Python pipeline:

Fetch — entrez_fetch.py pages through the Entrez E-utilities API using broad spinal neuroscience MeSH terms, saving raw XML batches to disk in a resumable fashion (~1–3 hours for 200k–500k records with an API key)
Parse — parse_pubmed_xml.py uses lxml to extract article metadata and MeSH terms from each XML batch
Load — load_db.py normalizes parsed records across the six schema tables and bulk-inserts them into SQLite via SQLAlchemy
Enrich — icite_fetch.py retrieves citation metrics from the iCite REST API for all PMIDs in the database
Analyze — analysis.ipynb runs all five SQL queries and renders interactive Plotly figures

Schema

Six normalized tables with foreign-key constraints and covering indexes on frequently joined columns:

Table	Description
`publications`	Core article metadata: PMID, title, journal, year, DOI, abstract
`authors`	Disambiguated author records with first-listed affiliation
`authorships`	Join table: publication × author × position
`keywords`	Normalized MeSH terms and author keywords
`pub_keywords`	Join table: publication × keyword
`citation_metrics`	iCite fields: citation count, RCR, NIH percentile

SQL analyses

Five query files, each targeting a specific bibliometric question:

Rolling publication volume (01_rolling_pub_counts.sql) SUM() OVER (ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) smooths year-to-year indexing noise to reveal genuine growth trends per MeSH term. Tracks terms such as “Spinal Cord,” “Interneurons,” “Central Pattern Generators,” and “Optogenetics” from 1970 to 2025.

Year-over-year growth (02_yoy_growth.sql) LAG() window function computes both absolute and percentage year-over-year change for targeted subfields — including CPG and optogenetics terms whose growth inflections are historically interpretable from domain knowledge.

Co-authorship patterns (03_coauthorship.sql) A self-join on authorships (a1.author_id < a2.author_id) produces all unique co-author pairs without duplication, restricted to papers tagged with core spinal MeSH terms. Aggregation adds pair_count and active_decades to characterize long-running vs. one-off collaborations.

Citation influence (04_citation_influence.sql) RANK() OVER (PARTITION BY subfield_term ORDER BY rcr DESC) identifies the most influential papers per subfield by Relative Citation Ratio — a field-normalized metric that adjusts for differences in citation culture across disciplines.

Keyword evolution (05_keyword_evolution.sql) Normalized MeSH term frequency over time reveals emerging topics (e.g. “Optogenetics,” post-2010) and declining ones, providing a data-driven view of how the field’s methodological focus has shifted.

Skills demonstrated

Relational schema design: normalization, foreign keys, composite indexes, performance considerations at 200k+ row scale
Advanced SQL: JOIN, GROUP BY, window functions (SUM OVER, LAG, RANK OVER), self-joins
Python ETL pipeline: paged API ingestion, resumable batch processing, lxml XML parsing, SQLAlchemy bulk loading
API integration: Entrez E-utilities (Biopython + requests), NIH iCite REST API
Interactive data visualization: Plotly figures embedded in a Jupyter analysis notebook

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Andrew Worthy