Background
I entered the MALLORN Classifier Challenge in late November. It was a great way to spend some time doing some good old-fashioned machine learning. Ultimately, with me moving and starting a new job, I didn't have too much time to spare. But I still managed to get an F1 score of 0.6426 on the final private test data, which led to 26th place out of 894. So not too shabby.
I want to give a brief write-up on my modelling decisions and findings. You can find the full code here.
The challenge asks participants to identify Tidal Disruption Events (TDEs) in simulated LSST light curves. TDEs occur when a star is ripped apart by the tidal forces of a supermassive black hole. They are rare (~100 confirmed), scientifically valuable, and difficult to distinguish from other transients photometrically.
The training set has 3043 objects in four classes: AGN (59%), SN Ia (27%), CC SN (9%), and TDE (4.9%,
our target). Each object has multi-band light curves (flux vs time in u,
g, r, i, z, y bands) plus metadata
(photometric redshift, dust extinction). The metric is binary F1 for TDE detection only.
My approach was simple: some feature engineering and AutoGluon for AutoML. Key modelling decisions include multiclass over binary, careful metric selection, and post-training threshold optimisation.
AutoGluon Setup
I used AutoGluon-Tabular for classification. For a dataset this size (~3000 objects, ~400 hand-crafted features), tree-based ensembles dominate, and AutoGluon handles the model zoo (LightGBM, XGBoost, CatBoost, neural networks, linear models, k-NN) and stacking automatically.
The main reason I chose AutoGluon over hand-tuning individual GBDTs is the out-of-fold (OOF) predictions it produces from bagged training. Every training object gets a prediction from the fold where it wasn't used for training. This is critical for the threshold optimisation described later.
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(
label="target",
sample_weight="weight",
eval_metric="f1_macro",
problem_type="multiclass",
)
predictor.fit(
train_data,
presets="high_quality",
time_limit=3600,
num_bag_folds=5,
)
Multiclass vs Binary
The competition asks for a binary TDE/not-TDE prediction, but I found that training a 4-class model and reducing to binary consistently outperformed direct binary classification. Under identical conditions:
| Mode | OOF TDE F1 | Threshold |
|---|---|---|
| Multiclass | 0.616 | 0.418 |
| Binary | 0.560 | 0.501 |
The gap widened further at longer training times. My interpretation is that the multiclass model learns sharper decision boundaries. In binary mode, AGN, SN Ia, and CC SN are all lumped into a single "not TDE" class, which is extremely heterogeneous. The multiclass model explicitly learns the structure of each confounding class, which helps it separate them from TDEs.
This matters most for false positives. The main FP sources are AGN (flare-like variability) and SN IIn (smooth post-peak declines that mimic TDEs). With explicit AGN and CC SN classes, the model has dedicated decision boundaries against these confounders.
Class Weighting
TDEs are 4.9% of the training set. We apply per-sample weights inversely proportional to class frequency:
$$w_c = \frac{N}{K \cdot N_c}$$
with an additional TDE boost factor $\beta$. I swept $\beta \in \{1, 2, 3, 5, 8, 10\}$ and found $\beta = 5$ optimal for multiclass. This gives TDE samples about 25 times the weight of AGN samples.
In binary mode, $\beta = 1$ worked best. I think this is because the two-class formulation already makes the imbalance explicit, and additional boosting causes overfitting.
Training Metric vs Competition Metric
An important subtlety: the competition metric (binary TDE F1) is not necessarily the best
training metric. AutoGluon uses eval_metric for early stopping, hyperparameter
tuning, and model ranking. I tested:
f1_macro: Multiclass F1 averaged across all four classes.f1_tde: Custom scorer computing binary TDE F1 from multiclass predictions.log_loss: Probabilistic calibration loss.
Despite being the competition metric, f1_tde underperformed. I believe it overfits
during model selection because the TDE class is so small (148 samples). f1_macro
produced better overall representations that generalised better after threshold tuning.
from autogluon.core.metrics import make_scorer
from sklearn.metrics import f1_score
TDE_LABEL = 3
def _f1_tde_multiclass(y_true, y_pred):
"""Binary F1 for TDE class in multiclass setting."""
y_binary_true = (y_true == TDE_LABEL).astype(int)
y_binary_pred = (y_pred == TDE_LABEL).astype(int)
return f1_score(y_binary_true, y_binary_pred)
F1_TDE_SCORER = make_scorer(
name="f1_tde",
score_func=_f1_tde_multiclass,
optimum=1.0,
greater_is_better=True,
)
A useful middle ground: train base models with f1_macro, then refit the ensemble weights
with f1_tde. This is a cheap post-training step (no retraining of base models) that
sometimes picks up extra points. AutoGluon's fit_weighted_ensemble makes this
straightforward.
Features
The feature set evolved over multiple iterations.
Iteration 1: Generic Statistics (~336 features)
Per-band statistics ($\times$ 6 bands): flux mean/std/range/skewness/kurtosis, rise and decay rates, SNR metrics, variability indices (Stetson K, Von Neumann $\eta$, MAD), peak properties. Plus cross-band colours, peak delays, and redshift-weighted variants.
OOF F1 with this set: ~0.65.
Iteration 2: Physics-Motivated Features (+21 features)
Error analysis on a holdout set revealed three main confusion patterns:
| Confusion | Count | Cause |
|---|---|---|
| AGN $\to$ TDE (FP) | Dominant | AGN flares mimic TDE rise |
| SN IIn $\to$ TDE (FP) | Second | Smooth post-peak similar to TDE |
| TDE $\to$ SN Ia (FN) | Third | Some TDEs have SN-like decline |
I designed 21 features targeting these confusions:
- Late-time behaviour:
late_time_activitymeasures variability in the final 30% of the light curve. TDEs fade to baseline; AGN stay active. This turned out to be the single most important feature. - Peak timing:
blue_peaks_first_countencodes Wien's law prediction that TDE blue bands peak before red. - Post-peak smoothness: Residual scatter after polynomial fit to post-peak data. TDEs are smooth; AGN and IIn are bumpy.
- Decay shape: Power-law fits to post-peak decline, checking if $\alpha \approx 5/3$ (TDE mass fallback prediction). Also DRW fits as AGN diagnostics.
- Colour evolution: Early-to-late colour change captures TDE cooling (blue $\to$ red).
- SN Ia discrimination:
decline_uniformitymeasures how uniform the decay rate is across bands. SN Ia are very uniform; TDEs aren't.
Top Features
Permutation importance on the holdout set (best available run):
| Rank | Feature | Importance | Category |
|---|---|---|---|
| 1 | late_time_activity |
+0.0229 | TDE-specific |
| 2 | blue_peaks_first_count |
+0.0123 | TDE-specific |
| 3 | z_x_r_mad |
+0.0107 | Redshift-weighted |
| 4 | i_frac_positive |
+0.0106 | Per-band |
| 5 | z_x_preflare_activity |
+0.0096 | Redshift-weighted |
| 8 | post_peak_smoothness |
+0.0081 | TDE-specific |
The physics-motivated features dominate the top ranks. Interestingly, resampled_skew_r
was the top feature for the 2nd place GBDT solution. I didn't use resampled/interpolated features,
which is probably a gap.
Threshold Optimisation
This was probably the most impactful modelling decision. AutoGluon predicts the most probable class by default. But for TDE detection, we separately tune the probability threshold $\theta$ at which we classify something as TDE.
After training, we have OOF probabilities $P(\text{TDE} \mid x)$ for every training object. We sweep $\theta \in [0.01, 0.99]$ and select $\theta^* = \arg\max_\theta \text{F1}(\theta)$.
import warnings
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import f1_score
warnings.filterwarnings("ignore")
plt.rcParams["figure.figsize"] = (12, 5)
sns.set_style("whitegrid")
oof = pd.read_csv("../output/analysis/oof_predictions.csv")
proba_col = next(c for c in oof.columns if "prob" in c.lower())
tde_val = 3
thresholds = np.arange(0.01, 1.0, 0.01)
f1_scores = [
f1_score(
(oof["true_label"] == tde_val).astype(int),
(oof[proba_col] >= t).astype(int),
zero_division=0,
)
for t in thresholds
]
best_idx = np.argmax(f1_scores)
print(f"Best threshold: {thresholds[best_idx]:.2f} (F1 = {f1_scores[best_idx]:.4f})")
fig, ax = plt.subplots()
ax.plot(thresholds, f1_scores, linewidth=2)
ax.axvline(
thresholds[best_idx],
color="black",
linestyle="--",
alpha=0.7,
label=f"$\\theta^* = {thresholds[best_idx]:.2f}$",
)
ax.set_xlabel("TDE Probability Threshold $\\theta$")
ax.set_ylabel("F1 Score")
ax.set_title("Threshold Optimisation on OOF Predictions")
ax.legend()
plt.tight_layout()
An additional subtlety: AutoGluon trains many models internally, and different models produce
different OOF probabilities with different optimal thresholds. Rather than using AutoGluon's default
"best model" (ranked by f1_macro), I iterate through all models on the leaderboard,
optimise the threshold for each, and select the model with the best TDE F1 after tuning. This often
selects a different model than the default ranking.
Cross-Validation
With only 148 TDEs, validation is noisy. I used two modes:
| Mode | Purpose | TDEs in Validation | Feature Importance | Submission |
|---|---|---|---|---|
| Full (5-fold OOF) | Final runs | ~148 (all) | No | Yes |
| Holdout (80/20) | Iteration | ~30 | Yes | No |
The OOF approach is more reliable because we evaluate on all 148 TDEs rather than the ~30 in a single holdout fold. But the holdout mode is faster (1 fit instead of 5) and supports permutation-based feature importance.
Stacking Configuration
AutoGluon supports multi-level stacking: models at level $L+1$ train on predictions from level $L$ plus original features. I tried:
- No stacking (
num_stack_levels=0): Bagged models + weighted ensemble. Fast. - 2-level stacking: Sweet spot for shorter runs.
- Dynamic stacking: AutoGluon decides adaptively. Mixed results.
The extreme_quality preset with default stacking and a long time limit worked best. I
also tested restricting the model pool (e.g. only GBM, XGB, CatBoost, TabM), but letting AutoGluon
explore freely was better.
Final Pipeline
- Extract ~400 features from light curves
- Train 4-class multiclass model (AutoGluon,
extreme_quality,f1_macro, $\beta_{\text{TDE}} = 5$) - Optionally refit ensemble weights with
f1_tde - Iterate all leaderboard models, optimise TDE threshold per model on OOF
- Select best model by TDE F1, apply to test set
Best result: F1 = 0.70 (precision 0.71, recall 0.64, threshold 0.48).
What Worked
- Multiclass over binary. Consistently better by ~5 F1 points under identical conditions.
- Per-model threshold optimisation. Probably the single biggest improvement.
- Physics-motivated features.
late_time_activityandblue_peaks_first_countdominate feature importance. - Aggressive TDE class weighting ($\beta = 5$).
- Using
f1_macrofor training,f1_tdefor ensemble refit. Best of both worlds.
What Didn't Work
- Training directly with
f1_tde. Overfits during model selection on 148 TDE samples. - Several physics features. Decay quality and decline rate metrics had negative importance. The chi-squared fits are too noisy on sparse light curves.
- Binary classification. Even with tuned weighting.
- AutoGluon's built-in feature pruning. Didn't reliably improve over using all features.
- Interpolation/resampling features. Didn't try these, but the top solutions showed resampled skewness was their top feature. I had an inkling this would be the case. And Bayesian methods and Gaussian Processes would have been perfect here. But alas, you only have so much time for fun projects.