Table 1. Prosody experiment under clean and masked audio. Values are mean [95% CI].
Judge Measure Clean (mean [CIs]) Masked (mean [CIs]) with WER=1.64
MERaLiON Sensitivity -0.055 [-0.065, -0.045] -0.088 [-0.098, -0.077]
MERaLiON Specificity -0.004 [-0.006, -0.002] -0.007 [-0.013, -0.002]
Qwen Sensitivity 0.005 [-0.004, -0.019] -0.007 [-0.016, 0.004]
Qwen Specificity -0.087 [-0.096, -0.077] -0.076 [-0.084, -0.067]
Flamingo Sensitivity -0.001 [-0.015, 0.012] 0.001 [-0.005, 0.006]
Flamingo Specificity -0.015 [-0.022, -0.008] -0.015 [-0.020, -0.012]
Table 2A. Sensitivity across transcript sources (GT, Whisper-Large, Whisper-Base) for all judges under text-only and multimodal settings.
Judge Modality GT Whisper-Large Whisper-Base Direction
Gemini-2.5-Flash text only +0.080 +0.015 -0.009 Degrades (-0.089)
Gemini-2.5-Flash audio text +0.067 +0.015 Degrades (-0.052)
Qwen2.5-Omni-7B text only +0.029 +0.037 +0.020 Stable
Qwen2.5-Omni-7B audio text +0.090 +0.127 +0.143 Improves (+0.053)
MiniCPM-o-4.5 text only -0.127 +0.226 +0.213 Inverts
MiniCPM-o-4.5 audio text -0.109 -0.149 -0.150 Degrades (-0.041)
Table 2B. Specificity across transcript sources (GT, Whisper-Large, Whisper-Base) for all judges under text-only and multimodal settings.
Judge Modality GT Whisper-Large Whisper-Base Direction
Gemini-2.5-Flash text only 1 0.4 0.5 Collapses
Gemini-2.5-Flash audio text 1 0.8 Degrades
Qwen2.5-Omni-7B text only 0.4 0.9 0.8 Improves
Qwen2.5-Omni-7B audio text 0 0.2 0.2 Improves
MiniCPM-o-4.5 text only 0 0.9 0.8 Improves
MiniCPM-o-4.5 audio text 0 0 0 Unchanged
Table 3. Gain decomposition and rescue values for judges evaluated under multiple transcript sources.
Judge Measure gain_GT gain_WL gain_WB rescue_WL rescue_WB
Qwen2.5-Omni-7B Sensitivity +0.061 +0.09 +0.122 +0.029 +0.061
Qwen2.5-Omni-7B Specificity -0.010 +0.09 +0.122 +0.1 +0.132
Gemini-2.5-Flash Sensitivity -0.013 0 +0.013
MiniCPM-o-4.5 Sensitivity +0.017 -0.375 -0.363 -0.392 -0.38
MiniCPM-o-4.5 Specificity +0.021 -0.374 -0.356 -0.395 -0.377
Table 4. AAPB of Sensitivity across transcript sources for judges with multi-source evaluations.
Judge Modality GT Whisper-Large Whisper-Base
Gemini-2.5-Flash text_only 0.027 0.025 0.028
Gemini-2.5-Flash audio_text 0.042
Qwen2.5-Omni-7B text_only 0.059 0.159 0.205
Qwen2.5-Omni-7B audio_text 0.097 0.106 0.087
MiniCPM-o-4.5 text_only 0.414 0.087 0.075
MiniCPM-o-4.5 audio_text 0.325 0.350 0.360
Table 5A. Category-based comparison between MERaLiON (MER) and Audio Flamingo (AF). Values are mean [95% CI].
Setting Category MER Sensitivity (CI) AF Sensitivity (CI) MER Specificity AUC (CI) AF Specificity AUC (CI)
audio only harassment 0.181 [0.160, 0.200] -0.002 [-0.037, 0.033] 0.765 [0.731, 0.801] 0.765 [0.724, 0.801]
audio only hate 0.332 [0.305, 0.364] 0.071 [0.023, 0.120] 0.79 [0.756, 0.825] 0.629 [0.580, 0.675]
audio only violence 0.162 [0.137, 0.190] 0.035 [-0.001, 0.071] 0.674 [0.628, 0.720] 0.685 [0.639, 0.729]
audio text/GT harassment 0.125 [0.106, 0.146] 0.052 [-0.221, 0.344] 0.843 [0.821, 0.865] 0.727 [0.690, 0.763]
audio text/GT hate 0.249 [0.224, 0.277] 0.076 [-0.022, 0.170] 0.818 [0.789, 0.845] 0.707 [0.672, 0.742]
audio text/GT violence 0.165 [0.138, 0.196] -0.19 [-0.426, 0.057] 0.823 [0.791, 0.852] 0.733 [0.694, 0.770]
audio text/WL harassment 0.142 [0.12, 0.162] -0.004 [-0.046, 0.029] 0.861 [0.841, 0.881] 0.725 [0.685, 0.759]
audio text/WL hate 0.256 [0.225, 0.283] -0.037 [-0.089, 0.01] 0.825 [0.795, 0.855] 0.707 [0.669, 0.742]
audio text/WL violence 0.157 [0.130, 0.186] -0.02 [-0.062, 0.014] 0.810 [0.774, 0.844] 0.686 [0.637, 0.731]
audio text/WB harassment 0.140 [0.119, 0.161] 0.033 [-0.263, 0.286] 0.850 [0.830, 0.872] 0.723 [0.687, 0.759]
audio text/WB hate 0.249 [0.220, 0.277] -0.045 [-0.142, 0.046] 0.809 [0.778, 0.844] 0.694 [0.649, 0.735]
audio text/WB violence 0.169 [0.139, 0.199] -0.204 [-0.464, 0.075] 0.818 [0.784, 0.849] 0.694 [0.646, 0.738]
Table 5B. Sensitivity comparison for dangerous, self_harm, and sexual categories under text-only and audio text settings. Values are mean [95% CI].
Category MER text only Sensitivity AF text only Sensitivity MER audio text Sensitivity AF audio text Sensitivity
dangerous 0.169 [0.135, 0.201] 0.360 [0.316, 0.404] 0.152 [0.138, 0.170] 0.481 [0.424, 0.537]
self_harm 0.058 [0.018, 0.096] 0.580 [0.523, 0.629] 0.091 [0.077, 0.107] 0.413 [0.071, 0.682]
sexual 0.193 [0.152, 0.235] 0.651 [0.615, 0.683] 0.278 [0.251, 0.306] 0.433 [0.357, 0.505]
Table 6. Safety scores for severity-0 dialogues.
Model and modality Safety score
Qwen, Audio+Transcriptions 0.824
Qwen, Audio 0.8
Flamingo, Audio+Transcriptions 0.711
Flamingo, Audio 0.739
MERaLiON, Audio+Transcriptions 0.654
MERaLiON, Audio 0.672