An end-to-end distributed machine learning pipeline using PySpark that predicts equipment failures 7 days in advance, processing 2M+ records across 1,900 machines. It engineered 1,150 rolling statistics features and used stratified down-sampling to handle extreme class imbalance for proactive maintenance scheduling.