s2t-medium-mustc-multilingual-st
is a Speech to Text Transformer (S2T) model trained for end-to-end Multilingual Speech Translation (ST).
The S2T model was proposed in
this paper
and released in
this repository
Model description
S2T is a transformer-based seq2seq (encoder-decoder) model designed for end-to-end Automatic Speech Recognition (ASR) and Speech
Translation (ST). It uses a convolutional downsampler to reduce the length of speech inputs by 3/4th before they are
fed into the encoder. The model is trained with standard autoregressive cross-entropy loss and generates the
transcripts/translations autoregressively.
Intended uses & limitations
This model can be used for end-to-end English speech to French text translation.
See the
model hub
to look for other S2T checkpoints.
How to use
As this a standard sequence to sequence transformer model, you can use the
generate
method to generate the
transcripts by passing the speech features to the model.
For multilingual speech translation models,
eos_token_id
is used as the
decoder_start_token_id
and
the target language id is forced as the first generated token. To force the target language id as the first
generated token, pass the
forced_bos_token_id
parameter to the
generate()
method. The following
example shows how to transate English speech to French and German text using the
facebook/s2t-medium-mustc-multilingual-st
checkpoint.
Note: The
Speech2TextProcessor
object uses
torchaudio
to extract the
filter bank features. Make sure to install the
torchaudio
package before running this example.
You could either install those as extra speech dependancies with
pip install transformers"[speech, sentencepiece]"
or install the packages seperatly
with
pip install torchaudio sentencepiece
.
import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset
import soundfile as sf
model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-medium-mustc-multilingual-st")
defmap_to_array(batch):
speech, _ = sf.read(batch["file"])
batch["speech"] = speech
return batch
ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
ds = ds.map(map_to_array)
inputs = processor(ds["speech"][0], sampling_rate=16_000, return_tensors="pt")
# translate English Speech To French Text
generated_ids = model.generate(
input_ids=inputs["input_features"],
attention_mask=inputs["attention_mask"],
forced_bos_token_id=processor.tokenizer.lang_code_to_id["fr"]
)
translation_fr = processor.batch_decode(generated_ids)
# translate English Speech To German Text
generated_ids = model.generate(
input_ids=inputs["input_features"],
attention_mask=inputs["attention_mask"],
forced_bos_token_id=processor.tokenizer.lang_code_to_id["de"]
)
translation_de = processor.batch_decode(generated_ids, skip_special_tokens=True)
Training data
The s2t-medium-mustc-multilingual-st is trained on
MuST-C
.
MuST-C is a multilingual speech translation corpus whose size and quality facilitates the training of end-to-end systems
for speech translation from English into several languages. For each target language, MuST-C comprises several hundred
hours of audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual
transcriptions and translations.
Training procedure
Preprocessing
The speech data is pre-processed by extracting Kaldi-compliant 80-channel log mel-filter bank features automatically from
WAV/FLAC audio files via PyKaldi or torchaudio. Further utterance-level CMVN (cepstral mean and variance normalization)
is applied to each example.
The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 10,000.
Training
The model is trained with standard autoregressive cross-entropy loss and using
SpecAugment
.
The encoder receives speech features, and the decoder generates the transcripts autoregressively. To accelerate
model training and for better performance the encoder is pre-trained for multilingual ASR. For multilingual models, target language ID token
is used as target BOS.
Evaluation results
MuST-C test results (BLEU score):
En-De
En-Nl
En-Es
En-Fr
En-It
En-Pt
En-Ro
En-Ru
24.5
28.6
28.2
34.9
24.6
31.1
23.8
16.0
BibTeX entry and citation info
@inproceedings{wang2020fairseqs2t,
title = {fairseq S2T: Fast Speech-to-Text Modeling with fairseq},
author = {Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino},
booktitle = {Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations},
year = {2020},
}
Runs of facebook s2t-medium-mustc-multilingual-st on huggingface.co
6.5K
Total runs
263
24-hour runs
740
3-day runs
2.5K
7-day runs
2.6K
30-day runs
More Information About s2t-medium-mustc-multilingual-st huggingface.co Model
More s2t-medium-mustc-multilingual-st license Visit here:
s2t-medium-mustc-multilingual-st huggingface.co is an AI model on huggingface.co that provides s2t-medium-mustc-multilingual-st's model effect (), which can be used instantly with this facebook s2t-medium-mustc-multilingual-st model. huggingface.co supports a free trial of the s2t-medium-mustc-multilingual-st model, and also provides paid use of the s2t-medium-mustc-multilingual-st. Support call s2t-medium-mustc-multilingual-st model through api, including Node.js, Python, http.
s2t-medium-mustc-multilingual-st huggingface.co is an online trial and call api platform, which integrates s2t-medium-mustc-multilingual-st's modeling effects, including api services, and provides a free online trial of s2t-medium-mustc-multilingual-st, you can try s2t-medium-mustc-multilingual-st online for free by clicking the link below.
facebook s2t-medium-mustc-multilingual-st online free url in huggingface.co:
s2t-medium-mustc-multilingual-st is an open source model from GitHub that offers a free installation service, and any user can find s2t-medium-mustc-multilingual-st on GitHub to install. At the same time, huggingface.co provides the effect of s2t-medium-mustc-multilingual-st install, users can directly use s2t-medium-mustc-multilingual-st installed effect in huggingface.co for debugging and trial. It also supports api for free installation.
s2t-medium-mustc-multilingual-st install url in huggingface.co: