Lessons from Building a 100 GB Data Analysis Pipeline in Python
Published:
Very early in my PhD, I realized that an Excel spreadsheet wasn’t going to cut it.
The experiment in question produced 5–10 GB of binary electrophysiology recordings per session, spread across dozens of files, each containing millisecond-resolution time-series data from multiple recording channels. The existing analysis workflow — load a file in MATLAB, scroll through it manually, copy peak values to Excel, repeat — would have taken 2+ hours per session and would not have been reproducible from one person to the next.
So I built MonStim Analyzer: a full-featured PyQt6 desktop application for automated EMG data analysis. Here’s what I learned along the way.
1. You can’t load 100 GB into RAM — design your I/O layer first
The first constraint I hit was memory. A naive approach (load everything, then process) immediately fails when your data exceeds available RAM. The solution I ended up with was lazy loading with structured caching: on startup, the application indexes all session files and builds a metadata catalog; individual recordings are loaded from disk on demand and cached in a Session → Channel → Sweep hierarchy. Data that hasn’t been accessed recently is evicted from memory to keep the footprint manageable.
The lesson generalizes directly to production data engineering: design your I/O layer first, and never let your processing logic assume all data fits in memory. In a world of big data, that’s just not a safe assumption.
2. GUIs aren’t optional when your users aren’t programmers
My users are electrophysiologists. They are excellent scientists. They do not want to edit a Python script to change an axis limit.
Building a real GUI — with a Qt interface, interactive waveform visualization, and point-and-click peak editing — was more work than a command-line tool, but it was the right call. A tool that sits in a GitHub repo and requires conda activate to run will not be used daily by most wet-lab researchers. It’s hard enough to get them to go to a GitHub repo to download an executable, let alone open a terminal and run a script for the first time.
If you’re building data tools for people who aren’t engineers, invest in a good interface.
3. Reproducibility requires deliberate choices, not just documentation
Every analysis decision made through the GUI — peak overrides, channel exclusions, baseline correction parameters — is serialized and stored in the session file alongside the raw data. Reloading a session in three years produces exactly the same results, regardless of who ran it originally.
This didn’t happen automatically. It required designing a data model where every user action has a representation in a serializable structure, and where loading that structure fully restores application state. If you’re building tools where reproducibility matters, design for it from the start because retrofitting is painful.
4. Measure the right thing — human time is often the bottleneck
The analysis workflow dropped from ~2 hours to under 25 minutes per session. Most of that gain came not from algorithmic speedups but from removing repetitive, attention-heavy steps: automated file loading, consistent report generation, and a clearer UI that reduced manual decision time. Instrument your workflow and time the human-in-the-loop steps because those are often the cheapest and highest-impact things to improve.
Even when computation is expensive, profiling where time is actually spent will guide better trade-offs than guessing where the bottleneck lies.
Conclusion
Building tools for large, messy datasets requires deliberate engineering choices: design an I/O layer that doesn’t assume infinite RAM, invest in a usable interface for non-programmers, serialize every user action for reproducibility, and measure the true bottlenecks — often the human ones. Do those things, and you’ll build tools people use, trust, and return to.
The full source code for MonStim Analyzer is on GitHub. The companion figure-generation application is MonStim Plotter.
