James Routley

"What I cannot create, I do not understand."

In my last post, I talked about what molecular dynamics (MD) is and why it’s such a powerful tool for studying biological systems. Now I want to share the actual research project I’ve been working on for the past few months!

Spoiler alert: It’s been a journey of false starts, frustrations, joys, careful literature and data reading, some not so careful reading, unexpected successes, and ultimately, some hard decisions.

The Question: What does the Unbound TFAM protein actually look like?

After a decade away from the lab bench, this little protein still fascinates me. TFAM (Mitochondrial Transcription Factor A) is essential for mitochondrial DNA transcription and packaging. We have gorgeous crystal structures showing how TFAM looks when it’s bound to DNA, dramatically bending it into U-turns like some molecular strongman. Buuuuut we know surprisingly little about what it looks like when it’s just floating around in solution, unbound, chillin and doing its own thing. I was convinced that understanding the unbound state was critical for figuring out how TFAM recognizes and binds to DNA in the first place. It’s like trying to understand how a lock and key work when you’ve only ever seen the key already in the lock.

TFAM has this interesting architecture: two DNA-binding domains (called HMG-boxes) connected by a 29-residue linker domain. In the DNA-bound crystal structures, this linker forms a stable alpha-helix(which is shaped like a corkscrew).

Representation of TFAM structure and regions

Figure 1. Overview of TFAM Protein Residues and Domain Structure. This shows the domain structure of TFAM. The first 42 residues of the protein make up the mitochondrial targeting sequence and are removed once the TFAM is imported into the mitochondrial matrix.

But what happens to that linker when there’s no DNA around to stabilize it? Does it stay helical? Does the whole protein compact down or spread out? All of these were questions I was looking to figure out.

From the scientific literature, I had a benchmark. Back in 2011, Rubio-Cosials and colleagues published SAXS (small-angle X-ray scattering) measurements of unbound TFAM in solution. Their data showed a radius of gyration (Rg) [A/N: think of this as how tightly packed or spread out a protein is] of 3.2 nm, which is pretty large for a 24 kDa protein. Even more interesting, their ensemble analysis revealed that unbound TFAM samples a heterogeneous range of conformations, with Rg values spanning from 2.5 to 5.0 nm. There were both compact and extended populations coexisting in solution.

Could I reproduce this with MD simulations? Could I get atomic-resolution insight into what these different conformational states actually look like? It was time to find out!

Step 1: Validate your assumptions and benchmark

Before tackling the unbound state, I needed to validate my computational setup. This is like calibrating your pipettes before starting a wet-lab experiment. If your most basic tools aren’t working right, nothing else will matter. I took a crystal structure of TFAM bound to DNA (specifically, the LSP promoter) and ran a 300 ns MD simulation. The test was simple: would the linker stay helical like it does in the crystal structure, or would it spontaneously unfold? Ideally, I would be able to reproduce the helically folded structures reported in other papers.

Ten days later, the results were ready. The linker stayed perfectly helical for the entire 300 ns! The structure was stable, settling into a nice equilibrated state around 2.0-2.2 nm RMSD from the crystal structure. No weird artifacts, no spontaneous unfolding. This was actually more reassuring than it might sound. I’d read a 2018 paper where Rubio-Cosials showed that certain TFAM mutants spontaneously unfold their linker helix in about 160 ns. When my wild-type simulation didn’t show this, I briefly panicked, was my force field broken? Then I realized: wild-type TFAM, what I was using in my simulation, should maintain its helix when DNA-bound, unlike those mutants. My simulation was doing exactly what it should. Heck yea!

Step 2: Into the Unknown: Simulating Unbound TFAM

Feeling confident, I moved on to the main event. I took two different DNA-bound TFAM crystal structures (at the LSP and HSP promoters respectively, which are sites that DNA transcription begins from), deleted the DNA, and let TFAM relax in solution using the AMBER99SB-ILDN force field with TIP3P water, gradually reducing the restraints on it (to prevent artifacts). This is a well-established, well-tested combination that’s been used successfully for tons of protein simulations. I ran 500 ns from each starting structure, giving me 1.0 μs of total aggregate sampling. Then I sat back and waited for the results to roll in, expecting to see that beautiful 3.2 nm Rg value.

What I got instead:

The LSP trajectory: Rg peaked at 2.55 nm (range 2.4-2.8 nm)
The HSP trajectory: Rg peaked at 2.35 nm (range 2.2-2.7 nm)

I stared at these numbers for a long time. The experimental value was 3.2 nm. I was getting 2.35-2.55 nm. I was underestimating by 20-27%. Had I messed up the simulation setup? Was the force field wrong for this system? Did I make some rookie mistake in my analysis? I went through everything with a fine-toothed comb. The simulations looked fine, the HMG boxes stayed nicely folded, the protein wasn’t doing anything obviously crazy. But those Rg values were stubbornly low.

The Turning Point: Closer reading of the literature

I went back to the Rubio-Cosials 2011 paper and really, really poured over the paper. When I covered the SAXS data, including supplementary figures showing the full ensemble analysis, that’s when I saw it: their results weren’t describing a single conformation, but an ensemble average value (with a bimodal distribution to be precise)! There was a compact population centered around 2.8-3.2 nm and an extended population spanning from 3.5-5.0 nm. The 3.2 nm value was the weighted average across both populations! This helped me really understand that TFAM was a flexible, dynamic protein when unbound given those ranges.

My LSP simulation that gave Rg ~2.55 nm wasn’t completely wrong. It was sampling the lower end of the compact population range. The HSP simulation at 2.35 nm was showing too much collapse (a real artifact),The problem wasn’t that the force field was trash, but that it was the wrong one for the job. It turns out that the AMBER99SB-ILDN force field has an intrinsic bias toward compact, helical states, meaning that it wasn’t allowing my protein to explore the conformational space. The linker region was the issue, staying about 40-50% helical when it should probably be more disordered (<30% helical). This was keeping the protein more compact than it should be. There was still hope! I just needed to find a better force field and/or water model combination. Ideally, one that was specifically designed for proteins with intrinsically disordered regions.

Step 3: Bringing Out the Big Guns, ff14SB + TIP4P-D

After diving into the literature, I settled on trying out ff14SB force field in combination with TIP4P-D water model. The ff14SB force field offered improved backbone dihedral parameters that better capture helix-coil equilibria. TIP4P-D (the “D” stands for dispersion) is a four-site water model that was specifically parameterized for intrinsically disordered proteins. It strengthens protein-water interactions relative to protein-protein interactions, which helps to prevent artificial hydrophobic collapse. Multiple papers showed this combination worked well for partially disordered systems, producing more extended conformations and better agreement with experimental Rg values for IDPs (intrinsically disordered proteins).

There was a catch though. There’s always a catch. TIP4P-D is computationally more expensive than TIP3P. That fourth site (a virtual site on the water molecule) means more calculations per time step. Every simulation would take longer and cost more. Still…if it could capture the full conformational ensemble, it would be worth it!

I set up two test simulations using snapshots from my original trajectories of 200ns each (holding my breath as I submitted the jobs):

Start from a snapshot of he unbound TFAM in it’s most extended state (the highest Rg from the LSP trajectory, ~2.7-2.8 nm) and see if the ff14SB+TIP4P-D combo could maintain or extend it further
Start from a collapsed frame (from the HSP trajectory, ~2.4 nm) and see if the new setup could escape the collapsed state

The results

Test 1 was beautiful! Starting from that already-extended conformation (~2.7 nm), the simulation not only maintained it but actually expanded further. Over 200 ns, the Rg increased to around 2.9-3.1 nm, right in the experimental range for the compact population! The linker showed reduced helical content too, dropping to about 30-35%, much more reasonable for a partially disordered region. Test 2 was less dramatic but still promising. The collapsed state (~2.4 nm) didn’t stay collapsed, it expanded to around 2.6-2.8 nm. Not fully recovering to the experimental range, but definitely an improvement over the original force field.

I pulled up the trajectories in PyMOL and watched them play back. The protein looked alive in a way it hadn’t before. The linker was flexing, the two HMG boxes were moving somewhat independently while staying tethered, and the overall conformational sampling looked much more dynamic and realistic. This was it! This was the force field combination I needed. With ff14SB + TIP4P-D, I could probably capture the full experimental ensemble if I:

Used multiple starting structures (some compact, some extended)
Generated additional extended conformations using elastic network modeling (ClustENM)
Ran 4-5 trajectories of 500 ns each for robust sampling and statistics
Pooled everything together to get the full heterogeneous ensemble

I had sketched out a complete plan. It could work. I was confident it would work. The only question was cost. Simulations aren’t free. Every nanosecond of simulation time costs real money in cloud compute. TIP4P-D which runs slower than TIP3P due to those extra calculations would expand the costs. I decided that this was the end of the road for me on this project. Unfortunately, I couldn’t afford to sink more money into this.

What I took away from all this

Even though I didn’t reach the “finish line”, I learned an enormous amount!

Force field selection really, really matters! The same protein simulated with two different force fields gave very different results. There’s no “one size fits all” solution, you need to match the force field to your system and your question.
Reading experimental data carefully is crucial. I almost gave up on this entire project because I didn’t initially understand that the experimental Rg was an ensemble average, not a single conformation. - Computational costs add up quickly. Running sufficient sampling on a realistic system can cost hundreds or thousands of dollars. Partial results are still valuable.

I had characterized the compact population reasonably well and reproduced the bound state from the literature. I validated that an updated and improved force field and water model (ff14SB + TIP4P-D) could indeed capture more extended conformations. I developed a complete strategy for moving forward. Future me (or someone else) can pick this up and run with it. Most importantly, I proved to myself that I can tackle real research questions in computational biology. The cellular life simulation was fun, but this was actual science, hypothesis-driven, data-validated, publishable-quality work (well, almost publishable). It was messy and complicated and expensive and deeply satisfying. I wrote up a complete research proposal that captures my work thus far and extends it beyond just the unbound ensemble. The full vision I have includes using weighted ensemble simulations (WESTPA) to study TFAM-DNA binding and unbinding pathways, processes that happen on millisecond to second timescales, well beyond what conventional MD can reach. Email me or reach out if you want to collaborate and geek out together on this sort of thing!

⁂

I’m not going to lie, it smarts to pause this project when I was making progress. Those 200 ns test runs with ff14SB + TIP4P-D were working. I could almost taste the inevitable breakthrough. I’m learning to be okay with incomplete projects, to be alright with unfinished. Not everything needs to be finished to be valuable or to teach us something. I learned a ton, I contributed some careful thinking to a hard problem, and I documented everything thoroughly so that future work (by me or others) can build on it. The questions about TFAM’s unbound ensemble will still be there when I’m ready to tackle them. That’ll happen with a graduate program backing me or maybe if I win the lottery. Maybe there’s a small independent research grant award I can shoot for.

For now though, I’m taking everything I learned and applying it to new endeavors. Following your volitional muscles doesn’t always mean succeeding at what you’re working on, sometimes it means knowing when to pause, consolidate what you’ve learned, and look for the next opportunity.

If you’re interested in the nitty-gritty technical details, all the simulation parameters, analysis scripts, decision frameworks, and contingency plans I developed feel free to reach out. I’ve documented everything thoroughly as I could. Feel free to reach out if you want to chat about molecular dynamics, TFAM, force field selection, or the realities of doing computational research on a shoestring budget. Thanks for following along with this journey.

Science is messy, expensive, and full of hard decisions, but it’s also incredibly rewarding. Even the incomplete projects teach us something valuable. Until next time, stay curious!