Qwen2.5-Omni-7B

Qwen2.5-Omni is the flagship end-to-end multimodal model in the Qwen series. It processes diverse inputs including text, images, audio, and video, delivering real-time streaming responses through text generation and natural speech synthesis using a novel Thinker-Talker architecture.

DocVQA

95.2%

i
VocalSound

93.9%

i
GSM8k

88.7%

i
GiantSteps Tempo

88.0%

i
ChartQA

85.3%

i
TextVQA

84.4%

i
AI2D

83.2%

i
MMBench-V1.1

81.8%

i
HumanEval

78.7%

i
CRPErelation

76.5%

i
VoiceBench Avg

74.1%

i
MBPP

73.2%

i
VideoMME w sub.

72.4%

i
MATH

71.5%

i
MMLU-Redux

71.0%

i
MVBench

70.3%

i
RealWorldQA

70.3%

i
MMAU Music

69.2%

i
EgoSchema

68.6%

i
MathVista

67.9%

i
MMAU Sound

67.9%

i
PointGrounding

66.5%

i
MultiPL-E

65.8%

i
MMAU

65.6%

i
MMStar

64.0%

i
MME-RealWorld

61.6%

i
MMAU Speech

59.8%

i
MMMU

59.2%

i
MuirBench

59.2%

i
OCRBench_V2

57.8%

i
Meld

57.0%

i
OmniBench

56.1%

i
OmniBench Music

52.8%

i
MMLU-Pro

47.0%

i
ODinW

42.4%

i
CoVoST2 en-zh

41.4%

i
MMMU-Pro

36.6%

i
MusicCaps

32.8%

i
GPQA

30.8%

i
LiveBench

29.6%

i
MathVision

25.0%

i
Common Voice 15

7.6%

i
NMOS

4.5%

i
FLEURS

4.1%

i
MM-MT-Bench

0.06

i