SONAR provides a set of speech and text encoders for multilingual, multimodal semantic embedding.
Project description
SONAR
We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.
Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.
SONAR stands for Sentence-level multimOdal and laNguage-Agnostic Representations
The full list of supported languages (along with download links) can be found here
SONAR Architecture:
Model inference support thanks Fairseq2 (TODO: change for external link once released)
Text results
Speech results
Installing
You can install SONAR with pip install sonar-space
. Note that there is another sonar
package on pip that IS NOT this project, make sure to use sonar-space
in your dependencies.
If you want to install SONAR manually, you can install it localy. SONAR depends mainly on Fairseq2 and can be installed using (tested with python=3.8
)
pip install --upgrade pip
pip install -e .
If fairseq2 does not provide a build for your machine, check the readme of that project to build it localy.
Usage
fairseq2 will automatically download models into your $TORCH_HOME/hub
directory upon using the commands below.
Compute text sentence embeddings with SONAR:
from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
t2vec_model = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder",
tokenizer="text_sonar_basic_encoder")
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2vec_model.predict(sentences, source_lang="eng_Latn").shape
# torch.Size([2, 1024])
Translate text with SONAR
from sonar.inference_pipelines.text import TextToTextModelPipeline
t2t_model = TextToTextModelPipeline(encoder="text_sonar_basic_encoder",
decoder="text_sonar_basic_decoder",
tokenizer="text_sonar_basic_encoder") # tokenizer is attached to both encoder and decoder cards
sentences = ['My name is SONAR.', 'I can embed the sentences into vectorial space.']
t2t_model.predict(sentences, source_lang="eng_Latn", target_lang="fra_Latn")
# ['Mon nom est SONAR.', "Je peux intégrer les phrases dans l'espace vectoriel."]
Compute speech sentence embeddings with SONAR
from sonar.inference_pipelines.speech import SpeechToEmbeddingModelPipeline
s2vec_model = SpeechToEmbeddingModelPipeline(encoder="sonar_speech_encoder_eng")
s2vec_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
"./tests/integration_tests/data/audio_files/audio_2.wav"]).shape
# torch.Size([2, 1024])
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"
s2vec_model.predict([inp]).shape
# torch.Size([1, 1024])
Speech-to-text translation with SONAR
from sonar.inference_pipelines.speech import SpeechToTextModelPipeline
s2t_model = SpeechToTextModelPipeline(encoder="sonar_speech_encoder_eng",
decoder="text_sonar_basic_decoder",
tokenizer="text_sonar_basic_decoder")
import torchaudio
inp, sr = torchaudio.load("./tests/integration_tests/data/audio_files/audio_1.wav")
assert sr == 16000, "Sample rate should be 16kHz"
# passing loaded audio files
s2t_model.predict([inp], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.']
# passing multiple wav files
s2t_model.predict(["./tests/integration_tests/data/audio_files/audio_1.wav",
"./tests/integration_tests/data/audio_files/audio_2.wav"], target_lang="eng_Latn")
# ['Television reports show white smoke coming from the plant.',
# 'These couples may choose to make an adoption plan for their baby.']
Predicting cross-lingual semantic similarity with BLASER 2 models
import torch
from sonar.models.blaser.loader import load_blaser_model
blaser_ref = load_blaser_model("blaser_st2st_ref_v2_0").eval()
blaser_qe = load_blaser_model("blaser_st2st_qe_v2_0").eval()
# BLASER-2 is supposed to work with SONAR speech and text embeddings,
# but we didn't include their extraction in this snippet, to keep it simple.
emb = torch.ones([1, 1024])
print(blaser_ref(src=emb, ref=emb, mt=emb).item()) # 5.2552
print(blaser_qe(src=emb, mt=emb).item()) # 4.9819
See more complete demo notebooks :
Supported languages and download links
The SONAR text encoder & decoder supports 200 languages. SONAR speech encoders support 37 languages.
Available text encoders/decoders
model | link |
---|---|
encoder | download |
decoder | download |
finetuned decoder | download |
tokenizer | download |
All 200 languages from the No Language Left Behind project are supported.
Available speech encoders
lang_code | language | link |
---|---|---|
arb | afrikaans | download |
ben | bengali | download |
cat | catalan | download |
ces | czech | download |
cmn | mandarin chinese | download |
cym | welsh | download |
dan | danish | download |
deu | german | download |
est | estonian | download |
fin | finnish | download |
fra | french | download |
hin | hindi | download |
ind | indonesian | download |
ita | italian | download |
jpn | japanse | download |
kan | kannada | download |
kor | korean | download |
mlt | maltese | download |
nld | dutch | download |
pes | western persian | download |
pol | polish | download |
por | portuguese | download |
ron | romanian | download |
rus | russian | download |
slk | slovak | download |
spa | spanish | download |
swe | swedish | download |
swh | swahili | download |
tam | tamil | download |
tel | telugu | download |
tgl | tagalog | download |
tha | thai | download |
tur | turkish | download |
ukr | ukrainian | download |
urd | urdu | download |
uzn | northern uzbek | download |
vie | vietnamese | download |
Citation Information
Please cite the paper when referencing the SONAR embedding space, encoders and decoders as:
@misc{Duquenne:2023:sonar_arxiv,
author = {Paul-Ambroise Duquenne and Holger Schwenk and Benoit Sagot},
title = {{SONAR:} Sentence-Level Multimodal and Language-Agnostic Representations},
publisher = {arXiv},
year = {2023},
url = {https://arxiv.org/abs/unk},
}
Contributing
See the CONTRIBUTING file for how to help out.
License
SONAR
code and models are CC-BY-NC-4.0 licensed. See LICENSE.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file sonar-space-0.1.0.tar.gz
.
File metadata
- Download URL: sonar-space-0.1.0.tar.gz
- Upload date:
- Size: 1.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.24.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3402ddb01ed6d546b1d7c38ebf9fe0a560888f101166daef645583286ccafa9b |
|
MD5 | 1f2ba393df0201794cb6ce879d0470b8 |
|
BLAKE2b-256 | 4fa969e8267904b73d29771e35997a8cb1703c4e1c08c13f9b123214bd23d0fa |
File details
Details for the file sonar_space-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: sonar_space-0.1.0-py3-none-any.whl
- Upload date:
- Size: 65.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-requests/2.24.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a90cd774632ac401b486ec0ad688f87676107eb7a88e87805e9b45cf370788be |
|
MD5 | 63a7d09bf0fd84ccab540756a671da7d |
|
BLAKE2b-256 | ebb4088469bfb68e7ea63056966302fb57fa3707d65982c1e56920128074e1af |