Photometric Classification of Tidal Disruption Events With AutoGluon

February 08, 2026
AutoML, Classification, Kaggle

Background

I entered the MALLORN Classifier Challenge in late November. It was a great way to spend some time doing some good old-fashioned machine learning. Ultimately, with me moving and starting a new job, I didn't have too much time to spare. But I still managed to get an F1 score of 0.6426 on the final private test data, which led to 26th place out of 894. So not too shabby.

I want to give a brief write-up on my modelling decisions and findings. You can find the full code here.

The challenge asks participants to identify Tidal Disruption Events (TDEs) in simulated LSST light curves. TDEs occur when a star is ripped apart by the tidal forces of a supermassive black hole. They are rare (~100 confirmed), scientifically valuable, and difficult to distinguish from other transients photometrically.

The training set has 3043 objects in four classes: AGN (59%), SN Ia (27%), CC SN (9%), and TDE (4.9%, our target). Each object has multi-band light curves (flux vs time in u, g, r, i, z, y bands) plus metadata (photometric redshift, dust extinction). The metric is binary F1 for TDE detection only.

My approach was simple: some feature engineering and AutoGluon for AutoML. Key modelling decisions include multiclass over binary, careful metric selection, and post-training threshold optimisation.

AutoGluon Setup

I used AutoGluon-Tabular for classification. For a dataset this size (~3000 objects, ~400 hand-crafted features), tree-based ensembles dominate, and AutoGluon handles the model zoo (LightGBM, XGBoost, CatBoost, neural networks, linear models, k-NN) and stacking automatically.

The main reason I chose AutoGluon over hand-tuning individual GBDTs is the out-of-fold (OOF) predictions it produces from bagged training. Every training object gets a prediction from the fold where it wasn't used for training. This is critical for the threshold optimisation described later.

from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(
    label="target",
    sample_weight="weight",
    eval_metric="f1_macro",
    problem_type="multiclass",
)

predictor.fit(
    train_data,
    presets="high_quality",
    time_limit=3600,
    num_bag_folds=5,
)

Multiclass vs Binary

The competition asks for a binary TDE/not-TDE prediction, but I found that training a 4-class model and reducing to binary consistently outperformed direct binary classification. Under identical conditions:

Mode OOF TDE F1 Threshold
Multiclass 0.616 0.418
Binary 0.560 0.501

The gap widened further at longer training times. My interpretation is that the multiclass model learns sharper decision boundaries. In binary mode, AGN, SN Ia, and CC SN are all lumped into a single "not TDE" class, which is extremely heterogeneous. The multiclass model explicitly learns the structure of each confounding class, which helps it separate them from TDEs.

This matters most for false positives. The main FP sources are AGN (flare-like variability) and SN IIn (smooth post-peak declines that mimic TDEs). With explicit AGN and CC SN classes, the model has dedicated decision boundaries against these confounders.

Class Weighting

TDEs are 4.9% of the training set. We apply per-sample weights inversely proportional to class frequency:

$$w_c = \frac{N}{K \cdot N_c}$$

with an additional TDE boost factor $\beta$. I swept $\beta \in \{1, 2, 3, 5, 8, 10\}$ and found $\beta = 5$ optimal for multiclass. This gives TDE samples about 25 times the weight of AGN samples.

In binary mode, $\beta = 1$ worked best. I think this is because the two-class formulation already makes the imbalance explicit, and additional boosting causes overfitting.

Training Metric vs Competition Metric

An important subtlety: the competition metric (binary TDE F1) is not necessarily the best training metric. AutoGluon uses eval_metric for early stopping, hyperparameter tuning, and model ranking. I tested:

  • f1_macro: Multiclass F1 averaged across all four classes.
  • f1_tde: Custom scorer computing binary TDE F1 from multiclass predictions.
  • log_loss: Probabilistic calibration loss.

Despite being the competition metric, f1_tde underperformed. I believe it overfits during model selection because the TDE class is so small (148 samples). f1_macro produced better overall representations that generalised better after threshold tuning.

from autogluon.core.metrics import make_scorer
from sklearn.metrics import f1_score

TDE_LABEL = 3


def _f1_tde_multiclass(y_true, y_pred):
    """Binary F1 for TDE class in multiclass setting."""
    y_binary_true = (y_true == TDE_LABEL).astype(int)
    y_binary_pred = (y_pred == TDE_LABEL).astype(int)
    return f1_score(y_binary_true, y_binary_pred)


F1_TDE_SCORER = make_scorer(
    name="f1_tde",
    score_func=_f1_tde_multiclass,
    optimum=1.0,
    greater_is_better=True,
)

A useful middle ground: train base models with f1_macro, then refit the ensemble weights with f1_tde. This is a cheap post-training step (no retraining of base models) that sometimes picks up extra points. AutoGluon's fit_weighted_ensemble makes this straightforward.

Features

The feature set evolved over multiple iterations.

Iteration 1: Generic Statistics (~336 features)

Per-band statistics ($\times$ 6 bands): flux mean/std/range/skewness/kurtosis, rise and decay rates, SNR metrics, variability indices (Stetson K, Von Neumann $\eta$, MAD), peak properties. Plus cross-band colours, peak delays, and redshift-weighted variants.

OOF F1 with this set: ~0.65.

Iteration 2: Physics-Motivated Features (+21 features)

Error analysis on a holdout set revealed three main confusion patterns:

Confusion Count Cause
AGN $\to$ TDE (FP) Dominant AGN flares mimic TDE rise
SN IIn $\to$ TDE (FP) Second Smooth post-peak similar to TDE
TDE $\to$ SN Ia (FN) Third Some TDEs have SN-like decline

I designed 21 features targeting these confusions:

  • Late-time behaviour: late_time_activity measures variability in the final 30% of the light curve. TDEs fade to baseline; AGN stay active. This turned out to be the single most important feature.
  • Peak timing: blue_peaks_first_count encodes Wien's law prediction that TDE blue bands peak before red.
  • Post-peak smoothness: Residual scatter after polynomial fit to post-peak data. TDEs are smooth; AGN and IIn are bumpy.
  • Decay shape: Power-law fits to post-peak decline, checking if $\alpha \approx 5/3$ (TDE mass fallback prediction). Also DRW fits as AGN diagnostics.
  • Colour evolution: Early-to-late colour change captures TDE cooling (blue $\to$ red).
  • SN Ia discrimination: decline_uniformity measures how uniform the decay rate is across bands. SN Ia are very uniform; TDEs aren't.

Top Features

Permutation importance on the holdout set (best available run):

Rank Feature Importance Category
1 late_time_activity +0.0229 TDE-specific
2 blue_peaks_first_count +0.0123 TDE-specific
3 z_x_r_mad +0.0107 Redshift-weighted
4 i_frac_positive +0.0106 Per-band
5 z_x_preflare_activity +0.0096 Redshift-weighted
8 post_peak_smoothness +0.0081 TDE-specific

The physics-motivated features dominate the top ranks. Interestingly, resampled_skew_r was the top feature for the 2nd place GBDT solution. I didn't use resampled/interpolated features, which is probably a gap.

Threshold Optimisation

This was probably the most impactful modelling decision. AutoGluon predicts the most probable class by default. But for TDE detection, we separately tune the probability threshold $\theta$ at which we classify something as TDE.

After training, we have OOF probabilities $P(\text{TDE} \mid x)$ for every training object. We sweep $\theta \in [0.01, 0.99]$ and select $\theta^* = \arg\max_\theta \text{F1}(\theta)$.

import warnings
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import f1_score

warnings.filterwarnings("ignore")
plt.rcParams["figure.figsize"] = (12, 5)
sns.set_style("whitegrid")

oof = pd.read_csv("../output/analysis/oof_predictions.csv")
proba_col = next(c for c in oof.columns if "prob" in c.lower())
tde_val = 3

thresholds = np.arange(0.01, 1.0, 0.01)
f1_scores = [
    f1_score(
        (oof["true_label"] == tde_val).astype(int),
        (oof[proba_col] >= t).astype(int),
        zero_division=0,
    )
    for t in thresholds
]

best_idx = np.argmax(f1_scores)
print(f"Best threshold: {thresholds[best_idx]:.2f}  (F1 = {f1_scores[best_idx]:.4f})")

fig, ax = plt.subplots()
ax.plot(thresholds, f1_scores, linewidth=2)
ax.axvline(
    thresholds[best_idx],
    color="black",
    linestyle="--",
    alpha=0.7,
    label=f"$\\theta^* = {thresholds[best_idx]:.2f}$",
)
ax.set_xlabel("TDE Probability Threshold $\\theta$")
ax.set_ylabel("F1 Score")
ax.set_title("Threshold Optimisation on OOF Predictions")
ax.legend()
plt.tight_layout()

An additional subtlety: AutoGluon trains many models internally, and different models produce different OOF probabilities with different optimal thresholds. Rather than using AutoGluon's default "best model" (ranked by f1_macro), I iterate through all models on the leaderboard, optimise the threshold for each, and select the model with the best TDE F1 after tuning. This often selects a different model than the default ranking.

Cross-Validation

With only 148 TDEs, validation is noisy. I used two modes:

Mode Purpose TDEs in Validation Feature Importance Submission
Full (5-fold OOF) Final runs ~148 (all) No Yes
Holdout (80/20) Iteration ~30 Yes No

The OOF approach is more reliable because we evaluate on all 148 TDEs rather than the ~30 in a single holdout fold. But the holdout mode is faster (1 fit instead of 5) and supports permutation-based feature importance.

Stacking Configuration

AutoGluon supports multi-level stacking: models at level $L+1$ train on predictions from level $L$ plus original features. I tried:

  • No stacking (num_stack_levels=0): Bagged models + weighted ensemble. Fast.
  • 2-level stacking: Sweet spot for shorter runs.
  • Dynamic stacking: AutoGluon decides adaptively. Mixed results.

The extreme_quality preset with default stacking and a long time limit worked best. I also tested restricting the model pool (e.g. only GBM, XGB, CatBoost, TabM), but letting AutoGluon explore freely was better.

Final Pipeline

  1. Extract ~400 features from light curves
  2. Train 4-class multiclass model (AutoGluon, extreme_quality, f1_macro, $\beta_{\text{TDE}} = 5$)
  3. Optionally refit ensemble weights with f1_tde
  4. Iterate all leaderboard models, optimise TDE threshold per model on OOF
  5. Select best model by TDE F1, apply to test set

Best result: F1 = 0.70 (precision 0.71, recall 0.64, threshold 0.48).

What Worked

  • Multiclass over binary. Consistently better by ~5 F1 points under identical conditions.
  • Per-model threshold optimisation. Probably the single biggest improvement.
  • Physics-motivated features. late_time_activity and blue_peaks_first_count dominate feature importance.
  • Aggressive TDE class weighting ($\beta = 5$).
  • Using f1_macro for training, f1_tde for ensemble refit. Best of both worlds.

What Didn't Work

  • Training directly with f1_tde. Overfits during model selection on 148 TDE samples.
  • Several physics features. Decay quality and decline rate metrics had negative importance. The chi-squared fits are too noisy on sparse light curves.
  • Binary classification. Even with tuned weighting.
  • AutoGluon's built-in feature pruning. Didn't reliably improve over using all features.
  • Interpolation/resampling features. Didn't try these, but the top solutions showed resampled skewness was their top feature. I had an inkling this would be the case. And Bayesian methods and Gaussian Processes would have been perfect here. But alas, you only have so much time for fun projects.