Case Study · 01
Turbofan Predictive Maintenance.
A dual-head Transformer that predicts jet engine anomalies and Remaining Useful Life. My first model collapsed completely — so I diagnosed why, rebuilt the data it learned from, and tuned the operating point to the real cost of a missed failure.
Context
The original goal was a NASA-CMAPSS-inspired predictive maintenance pipeline: synthetic sensor data for 500 jet engines, a Transformer model predicting both anomaly state and Remaining Useful Life, and a live diagnostic dashboard. End-to-end, dual-head, multi-task — the architecture that everyone reaches for in time-series prognostics.
The early model didn't work. Both heads collapsed: the anomaly head predicted "normal" on every window, the RUL head predicted the training mean. Standard remedies — focal loss, class weights, joint-loss rebalancing — didn't move the metrics.
What started as a modeling project became something more useful: a lesson in telling when a dataset has nothing for a model to learn from — and then rebuilding it so it does. The diagnosis was the hard part. The fix, once I understood the real problem, was deliberate engineering.
The diagnostic chain
Going upstream when results don't match expectations.
The first instinct after a model collapse is to tune the loss function. Implement focal loss for the anomaly head, expecting class imbalance to be the culprit. So that's what I did. The first training run produced the same all-zero precision and recall.
Rather than tune gamma, I went upstream — to the labels themselves.
Bug 1 — labels that never fired
The simulator labeled a window as anomalous only if sensor values crossed fixed thresholds:
# Original labeling logic anomaly = 1 if vibration > 0.05 or T50 > 1420 else 0
But the degradation parameters in the same simulator produced vibration baselines
around 0.010–0.015 and worst-case increases of ~0.003 — never
reaching 0.05. T50 baseline was ~1400 with +1°C
degradation at most — never reaching 1420. Of five failure modes, only
Foreign Object Damage ever produced positive labels (a single abrupt spike). Every
other mode produced zero positives across its entire degradation window.
Effective class imbalance: 500:1, not the 100:1 I'd designed around.
Fixed it with the CMAPSS-convention labeling — anomaly defined by RUL proximity, not arbitrary sensor thresholds:
# Fixed anomaly = 1 if rul <= 30 else 0
Class weights recomputed to a real 20% positive base rate. Retrained. The model still collapsed.
Bug 2 — signal below the noise floor
Accuracy of 0.7954 on a 20% positive base rate is the model predicting
"normal" on every window. The confusion matrix showed zero true positives
across 14,662 test windows.
The model classified every anomaly window as normal. Accuracy looks reasonable only because the class imbalance carries it — predict "normal" always and you'll be right 80% of the time on a 20% positive base rate.
I built a diagnostic script — scripts/diagnose_sensor_signal.py — to plot
raw and scaled sensor traces for a single engine with the RUL ≤ 30 region marked.
The result:
Sensor traces for one engine across its full lifetime. The shaded region marks the anomaly window (RUL ≤ 30) where the model is supposed to detect degradation. Inside that window, neither sensor shows a signal distinguishable from baseline noise.
The degradation signal was below the noise floor. A vib_increase of
0.002 on a baseline of 0.010–0.015 with comparable noise.
T50 and P30 carried zero signal for failure modes like "high-pressure turbine
wear" — only vibration was affected, not the other physically related sensors.
The model literally could not distinguish anomaly windows from normal windows in
feature space.
No loss function can save a dataset without learnable signal. v0.1's anomaly head collapse was inevitable from day one — the simulator wasn't producing learnable structure inside the anomaly windows.
What the diagnostic proved
Three things, in order:
- Loss-function tuning was the wrong tool — the issue was upstream of the model.
- Cheap baseline checks (logistic regression on raw features) should precede expensive Transformer training. If a linear model can't learn the signal, the Transformer won't either.
- The simulator itself is a piece of engineering — not a fixed dataset. Sensor degradation magnitudes, multi-sensor signatures per failure mode, and SNR targets are design decisions, not parameters to crank.
The fix
Engineering learnable signal, on purpose.
The diagnosis pointed at the data, so I rebuilt the generator. The new design
(generate_sensors_v2.py) doesn't just crank degradation magnitudes —
it targets a specific signal-to-noise ratio across the degradation window
(SNR_EOL = 4.0) and spreads each failure mode across
multiple physically plausible sensors. High-pressure turbine wear now shows
in T50 and P30 together, not vibration alone, so the model has cross-channel structure
to learn from.
I also replaced the threshold-based labels that never fired with the CMAPSS convention —
anomaly defined by RUL proximity (rul ≤ 30) — swapped weighted
cross-entropy for focal loss (gamma = 2), and raised the RUL joint-loss
weight from 0.001 to 0.1 so the regression head actually
contributed to the objective.
Then the discipline the diagnostic taught me: before spending Transformer compute, I run a learnability gate. A cheap baseline checks per-mode recall on the regenerated data; every failure mode has to clear the bar before training starts. Know whether the thing can work before betting compute on it.
Results · test set
From total collapse to a working model.
v0.1 predicted "normal" on every window — precision and recall both 0.00. The rebuilt data and model (v0.2.0) learns all five failure modes. v0.2.1 then moves the operating point to the real cost of a miss.
0.9957
Anomaly recall at the cost-optimal threshold (v0.2.1) — up from 0.00 in v0.1.
13
Missed failures on the test set, down from 586 at the default threshold.
−78%
Total operating cost (50·FN + FP) versus the default threshold.
RUL regression RMSE: 8.92 cycles (v0.1 collapsed to a near-constant mean at 41.72). The precision-recall curve, confusion matrix, and RUL scatter are live on the dashboard.
v0.2.1 — the operating point
Optimizing for the cost of being wrong.
A working model raised a sharper question: where do you set the decision threshold?
v0.2.0 defaulted to argmax (threshold 0.50), which optimizes balanced
accuracy. For jet engines that's the wrong objective — a missed failure and a false
alarm are not equally bad.
A missed failure can mean an in-flight engine event: aircraft on ground, secondary
damage, lives. A false alarm means an unnecessary borescope inspection. Conservatively,
a miss costs on the order of 50× a false alarm. So I made that ratio
explicit and swept the validation set for the threshold that minimizes expected cost
(50·FN + FP).
The cost-minimizing operating point was t* = 0.22, not 0.50. Moving
there dropped missed failures from 586 to 13 — a 0.4% miss rate — at the deliberate
cost of lower precision. Under this cost structure, that trade is correct.
The threshold was chosen on validation and evaluated once on the held-out test set —
no leakage. The full derivation, including the neighbor check that confirmed
t* = 0.22 sits in a stable cost basin, is in the pull request that shipped
v0.2.1.
Model
- PyTorch
- Transformer encoder (3 layers, 4 heads)
- Focal loss · dual-head · RUL clipping
Data & pipeline
- SNR-scaled synthetic sim (v2)
- 50 engines · 1000 cycles · 5 channels
- Engine-level split · learnability gate
Interpretation
- RAG-augmented Google Gemini
- Prompt-engineered structured JSON
- Failure-mode hypotheses from logs
Serving
- Flask inference API
- Live diagnostics dashboard
Artifacts