Case Study · 01

Turbofan Predictive Maintenance.

A dual-head Transformer that predicts jet engine anomalies and Remaining Useful Life. My first model collapsed completely — so I diagnosed why, rebuilt the data it learned from, and tuned the operating point to the real cost of a missed failure.

Stack: Python · PyTorch · Gemini · Flask Period: 2025–2026 Status: v0.2.1 · live dashboard

Context

The original goal was a NASA-CMAPSS-inspired predictive maintenance pipeline: synthetic sensor data for 500 jet engines, a Transformer model predicting both anomaly state and Remaining Useful Life, and a live diagnostic dashboard. End-to-end, dual-head, multi-task — the architecture that everyone reaches for in time-series prognostics.

The early model didn't work. Both heads collapsed: the anomaly head predicted "normal" on every window, the RUL head predicted the training mean. Standard remedies — focal loss, class weights, joint-loss rebalancing — didn't move the metrics.

What started as a modeling project became something more useful: a lesson in telling when a dataset has nothing for a model to learn from — and then rebuilding it so it does. The diagnosis was the hard part. The fix, once I understood the real problem, was deliberate engineering.

The diagnostic chain

Going upstream when results don't match expectations.

The first instinct after a model collapse is to tune the loss function. Implement focal loss for the anomaly head, expecting class imbalance to be the culprit. So that's what I did. The first training run produced the same all-zero precision and recall.

Rather than tune gamma, I went upstream — to the labels themselves.

Bug 1 — labels that never fired

The simulator labeled a window as anomalous only if sensor values crossed fixed thresholds:

# Original labeling logic
anomaly = 1 if vibration > 0.05 or T50 > 1420 else 0

But the degradation parameters in the same simulator produced vibration baselines around 0.010–0.015 and worst-case increases of ~0.003 — never reaching 0.05. T50 baseline was ~1400 with +1°C degradation at most — never reaching 1420. Of five failure modes, only Foreign Object Damage ever produced positive labels (a single abrupt spike). Every other mode produced zero positives across its entire degradation window.

Effective class imbalance: 500:1, not the 100:1 I'd designed around.

Fixed it with the CMAPSS-convention labeling — anomaly defined by RUL proximity, not arbitrary sensor thresholds:

# Fixed
anomaly = 1 if rul <= 30 else 0

Class weights recomputed to a real 20% positive base rate. Retrained. The model still collapsed.

Bug 2 — signal below the noise floor

Accuracy of 0.7954 on a 20% positive base rate is the model predicting "normal" on every window. The confusion matrix showed zero true positives across 14,662 test windows.

The model classified every anomaly window as normal. Accuracy looks reasonable only because the class imbalance carries it — predict "normal" always and you'll be right 80% of the time on a 20% positive base rate.

I built a diagnostic script — scripts/diagnose_sensor_signal.py — to plot raw and scaled sensor traces for a single engine with the RUL ≤ 30 region marked. The result:

Sensor traces for one engine across its full lifetime. The shaded region marks the anomaly window (RUL ≤ 30) where the model is supposed to detect degradation. Inside that window, neither sensor shows a signal distinguishable from baseline noise.

The degradation signal was below the noise floor. A vib_increase of 0.002 on a baseline of 0.010–0.015 with comparable noise. T50 and P30 carried zero signal for failure modes like "high-pressure turbine wear" — only vibration was affected, not the other physically related sensors. The model literally could not distinguish anomaly windows from normal windows in feature space.

No loss function can save a dataset without learnable signal. v0.1's anomaly head collapse was inevitable from day one — the simulator wasn't producing learnable structure inside the anomaly windows.

What the diagnostic proved

Three things, in order:

Loss-function tuning was the wrong tool — the issue was upstream of the model.
Cheap baseline checks (logistic regression on raw features) should precede expensive Transformer training. If a linear model can't learn the signal, the Transformer won't either.
The simulator itself is a piece of engineering — not a fixed dataset. Sensor degradation magnitudes, multi-sensor signatures per failure mode, and SNR targets are design decisions, not parameters to crank.

The fix

Engineering learnable signal, on purpose.

The diagnosis pointed at the data, so I rebuilt the generator. The new design (generate_sensors_v2.py) doesn't just crank degradation magnitudes — it targets a specific signal-to-noise ratio across the degradation window (SNR_EOL = 4.0) and spreads each failure mode across multiple physically plausible sensors. High-pressure turbine wear now shows in T50 and P30 together, not vibration alone, so the model has cross-channel structure to learn from.

I also replaced the threshold-based labels that never fired with the CMAPSS convention — anomaly defined by RUL proximity (rul ≤ 30) — swapped weighted cross-entropy for focal loss (gamma = 2), and raised the RUL joint-loss weight from 0.001 to 0.1 so the regression head actually contributed to the objective.

Then the discipline the diagnostic taught me: before spending Transformer compute, I run a learnability gate. A cheap baseline checks per-mode recall on the regenerated data; every failure mode has to clear the bar before training starts. Know whether the thing can work before betting compute on it.

Results · test set

From total collapse to a working model.

v0.1 predicted "normal" on every window — precision and recall both 0.00. The rebuilt data and model (v0.2.0) learns all five failure modes. v0.2.1 then moves the operating point to the real cost of a miss.

0.9957

Anomaly recall at the cost-optimal threshold (v0.2.1) — up from 0.00 in v0.1.

Missed failures on the test set, down from 586 at the default threshold.

−78%

Total operating cost (50·FN + FP) versus the default threshold.

RUL regression RMSE: 8.92 cycles (v0.1 collapsed to a near-constant mean at 41.72). The precision-recall curve, confusion matrix, and RUL scatter are live on the dashboard.

v0.2.1 — the operating point

Optimizing for the cost of being wrong.

A working model raised a sharper question: where do you set the decision threshold? v0.2.0 defaulted to argmax (threshold 0.50), which optimizes balanced accuracy. For jet engines that's the wrong objective — a missed failure and a false alarm are not equally bad.

A missed failure can mean an in-flight engine event: aircraft on ground, secondary damage, lives. A false alarm means an unnecessary borescope inspection. Conservatively, a miss costs on the order of 50× a false alarm. So I made that ratio explicit and swept the validation set for the threshold that minimizes expected cost (50·FN + FP).

The cost-minimizing operating point was t* = 0.22, not 0.50. Moving there dropped missed failures from 586 to 13 — a 0.4% miss rate — at the deliberate cost of lower precision. Under this cost structure, that trade is correct.

The threshold was chosen on validation and evaluated once on the held-out test set — no leakage. The full derivation, including the neighbor check that confirmed t* = 0.22 sits in a stable cost basin, is in the pull request that shipped v0.2.1.

Stack

What's under the hood.

3-layer Transformer encoder, 4 attention heads, 128-dim hidden. Sliding window of 50 timesteps × 5 sensor channels. Sinusoidal positional encoding, global average pooling, dual-head output (anomaly classification + RUL regression). Focal loss on the classifier; MSE on the regressor. A retrieval-augmented Gemini layer turns predictions into human-readable failure-mode hypotheses.

Model

PyTorch
Transformer encoder (3 layers, 4 heads)
Focal loss · dual-head · RUL clipping

Data & pipeline

SNR-scaled synthetic sim (v2)
50 engines · 1000 cycles · 5 channels
Engine-level split · learnability gate

Interpretation

RAG-augmented Google Gemini
Prompt-engineered structured JSON
Failure-mode hypotheses from logs

Serving

Flask inference API
Live diagnostics dashboard

Artifacts

The work, in public.

Code

GitHub repository

Source, PRs, tagged releases through v0.2.1

›

Demo

Live diagnostics dashboard

Real test metrics — PR curve, confusion matrix, RUL scatter

›

Writeup

Project log

The diagnostic chain and decisions, in narrative form

›