Table 1. Prosody experiment under clean and masked audio. Values are mean [95% CI].

Judge	Measure	Clean (mean [CIs])	Masked (mean [CIs]) with WER=1.64
MERaLiON	Sensitivity	-0.055 [-0.065, -0.045]	-0.088 [-0.098, -0.077]
MERaLiON	Specificity	-0.004 [-0.006, -0.002]	-0.007 [-0.013, -0.002]
Qwen	Sensitivity	0.005 [-0.004, -0.019]	-0.007 [-0.016, 0.004]
Qwen	Specificity	-0.087 [-0.096, -0.077]	-0.076 [-0.084, -0.067]
Flamingo	Sensitivity	-0.001 [-0.015, 0.012]	0.001 [-0.005, 0.006]
Flamingo	Specificity	-0.015 [-0.022, -0.008]	-0.015 [-0.020, -0.012]

Table 2A. Sensitivity across transcript sources (GT, Whisper-Large, Whisper-Base) for all judges under text-only and multimodal settings.

Judge	Modality	GT	Whisper-Large	Whisper-Base	Direction
Gemini-2.5-Flash	text only	+0.080	+0.015	-0.009	Degrades (-0.089)
Gemini-2.5-Flash	audio text	+0.067	+0.015	—	Degrades (-0.052)
Qwen2.5-Omni-7B	text only	+0.029	+0.037	+0.020	Stable
Qwen2.5-Omni-7B	audio text	+0.090	+0.127	+0.143	Improves (+0.053)
MiniCPM-o-4.5	text only	-0.127	+0.226	+0.213	Inverts
MiniCPM-o-4.5	audio text	-0.109	-0.149	-0.150	Degrades (-0.041)

Table 2B. Specificity across transcript sources (GT, Whisper-Large, Whisper-Base) for all judges under text-only and multimodal settings.

Judge	Modality	GT	Whisper-Large	Whisper-Base	Direction
Gemini-2.5-Flash	text only	1	0.4	0.5	Collapses
Gemini-2.5-Flash	audio text	1	0.8	—	Degrades
Qwen2.5-Omni-7B	text only	0.4	0.9	0.8	Improves
Qwen2.5-Omni-7B	audio text	0	0.2	0.2	Improves
MiniCPM-o-4.5	text only	0	0.9	0.8	Improves
MiniCPM-o-4.5	audio text	0	0	0	Unchanged

Table 3. Gain decomposition and rescue values for judges evaluated under multiple transcript sources.

Judge	Measure	gain_GT	gain_WL	gain_WB	rescue_WL	rescue_WB
Qwen2.5-Omni-7B	Sensitivity	+0.061	+0.09	+0.122	+0.029	+0.061
Qwen2.5-Omni-7B	Specificity	-0.010	+0.09	+0.122	+0.1	+0.132
Gemini-2.5-Flash	Sensitivity	-0.013	0	—	+0.013	—
MiniCPM-o-4.5	Sensitivity	+0.017	-0.375	-0.363	-0.392	-0.38
MiniCPM-o-4.5	Specificity	+0.021	-0.374	-0.356	-0.395	-0.377

Table 4. AAPB of Sensitivity across transcript sources for judges with multi-source evaluations.

Judge	Modality	GT	Whisper-Large	Whisper-Base
Gemini-2.5-Flash	text_only	0.027	0.025	0.028
Gemini-2.5-Flash	audio_text	0.042	—	—
Qwen2.5-Omni-7B	text_only	0.059	0.159	0.205
Qwen2.5-Omni-7B	audio_text	0.097	0.106	0.087
MiniCPM-o-4.5	text_only	0.414	0.087	0.075
MiniCPM-o-4.5	audio_text	0.325	0.350	0.360

Table 5A. Category-based comparison between MERaLiON (MER) and Audio Flamingo (AF). Values are mean [95% CI].

Setting	Category	MER Sensitivity (CI)	AF Sensitivity (CI)	MER Specificity AUC (CI)	AF Specificity AUC (CI)
audio only	harassment	0.181 [0.160, 0.200]	-0.002 [-0.037, 0.033]	0.765 [0.731, 0.801]	0.765 [0.724, 0.801]
audio only	hate	0.332 [0.305, 0.364]	0.071 [0.023, 0.120]	0.79 [0.756, 0.825]	0.629 [0.580, 0.675]
audio only	violence	0.162 [0.137, 0.190]	0.035 [-0.001, 0.071]	0.674 [0.628, 0.720]	0.685 [0.639, 0.729]
audio text/GT	harassment	0.125 [0.106, 0.146]	0.052 [-0.221, 0.344]	0.843 [0.821, 0.865]	0.727 [0.690, 0.763]
audio text/GT	hate	0.249 [0.224, 0.277]	0.076 [-0.022, 0.170]	0.818 [0.789, 0.845]	0.707 [0.672, 0.742]
audio text/GT	violence	0.165 [0.138, 0.196]	-0.19 [-0.426, 0.057]	0.823 [0.791, 0.852]	0.733 [0.694, 0.770]
audio text/WL	harassment	0.142 [0.12, 0.162]	-0.004 [-0.046, 0.029]	0.861 [0.841, 0.881]	0.725 [0.685, 0.759]
audio text/WL	hate	0.256 [0.225, 0.283]	-0.037 [-0.089, 0.01]	0.825 [0.795, 0.855]	0.707 [0.669, 0.742]
audio text/WL	violence	0.157 [0.130, 0.186]	-0.02 [-0.062, 0.014]	0.810 [0.774, 0.844]	0.686 [0.637, 0.731]
audio text/WB	harassment	0.140 [0.119, 0.161]	0.033 [-0.263, 0.286]	0.850 [0.830, 0.872]	0.723 [0.687, 0.759]
audio text/WB	hate	0.249 [0.220, 0.277]	-0.045 [-0.142, 0.046]	0.809 [0.778, 0.844]	0.694 [0.649, 0.735]
audio text/WB	violence	0.169 [0.139, 0.199]	-0.204 [-0.464, 0.075]	0.818 [0.784, 0.849]	0.694 [0.646, 0.738]

Table 5B. Sensitivity comparison for dangerous, self_harm, and sexual categories under text-only and audio text settings. Values are mean [95% CI].

Category	MER text only Sensitivity	AF text only Sensitivity	MER audio text Sensitivity	AF audio text Sensitivity
dangerous	0.169 [0.135, 0.201]	0.360 [0.316, 0.404]	0.152 [0.138, 0.170]	0.481 [0.424, 0.537]
self_harm	0.058 [0.018, 0.096]	0.580 [0.523, 0.629]	0.091 [0.077, 0.107]	0.413 [0.071, 0.682]
sexual	0.193 [0.152, 0.235]	0.651 [0.615, 0.683]	0.278 [0.251, 0.306]	0.433 [0.357, 0.505]

Table 6. Safety scores for severity-0 dialogues.

Model and modality	Safety score
Qwen, Audio+Transcriptions	0.824
Qwen, Audio	0.8
Flamingo, Audio+Transcriptions	0.711
Flamingo, Audio	0.739
MERaLiON, Audio+Transcriptions	0.654
MERaLiON, Audio	0.672