Evolution of Performance Metrics for Accurate Evaluation of Speech-to-Speech Translation Models: A Literature Review

Dr. Gabriel O. Sobola

IntelliPaper

Peer Review Double Blind Handling Editor
Accepted 22 July 2025 Access Review Comments

− Abstract

The translation of speech from a source to speech in a target language with generative artificial intelligence is an area of research that is presently being actively explored. This is aimed at solving global language barriers thereby ensuring seamless communication between the individuals involved. It has been well developed for high-resourced languages like English, Spanish, French and Chinese. Currently, objective evaluation metrics such as Bilingual Evaluation Understudy Scores (BLEUS), and subjective metrics such as Mean Opinion Score Naturalness (MOSN) and Mean Opinion Score Similarity (MOSS) are being used to evaluate the performance of the output of speech-to-speech models. However, low resourced languages are still undeveloped in the area of speech processing applications, especially the African indigenous languages. The output speech in the target language needs to be evaluated to determine the closeness to the ground truth, as well as how natural and intelligible it is to the intended listeners. This paper presents a review of trends from the current metrics to emerging ones such as Recall Oriented Understudy for Gisting Evaluation-L (ROUGE-L) and BLASER. The applications of speech models’ metrics on various leaderboards and modern AI platforms were also discussed. The outcome shows that while BLEU score and MOSN metrics are prevalent for speech models, there is a need to explore metrics such as ROUGE-L, and BERTScore which are machine translation metric due to their benefits.

− Explore Digital Article Text

Article file ID not found.

− Conflict of Interest

The authors declare no conflict of interest.

− Ethical Approval

Not applicable

− Data Availability

The datasets used in this study are openly available at [repository link] and the source code is available on GitHub at [GitHub link].

− Funding

This work did not receive any external funding.

− Cite this article

Generating citation...

Classification

LCC Code: P418.02028
Version of record

v1.0
Issue date

07 November 2025
Language

en

Download Article

Open Access

Research Article

CC-BY-NC 4.0

LJER Volume 25 LJER Volume 25 Issue 4, Pg. 65-91

Explore Journal

Read LJER Volume 25 Issue 4 Explore LJER Volume 25

Special Issue

Launch a focused special issue to highlight research, emerging trends, and expert insights in your academic field.

Evolution of Performance Metrics for Accurate Evaluation of Speech-to-Speech Translation Models: A Literature Review

IntelliPaper

Contact Person

− Abstract

− Explore Digital Article Text

− Conflict of Interest

− Ethical Approval

− Data Availability

− Funding

− Cite this article

Classification

Version of record

Issue date

Language

Special Issue

Next Research

Copy of Cardiovascular Risk Factors and Cardiovascular Risk in People Living with HIV: Comparison of Four Cardiovascular Risk Prediction Algorithms

Evolution of Performance Metrics for Accurate Evaluation of Speech-to-Speech Translation Models: A Literature Review

IntelliPaper

Request Review Access

Order Article Reprints

Contact Person

− Abstract

− Explore Digital Article Text

− Conflict of Interest

− Ethical Approval

− Data Availability

− Funding

− Cite this article

− Related Research

Classification

Version of record

Issue date

Language

Special Issue

Next Research

Copy of Cardiovascular Risk Factors and Cardiovascular Risk in People Living with HIV: Comparison of Four Cardiovascular Risk Prediction Algorithms