语音识别开源框架

文章目录

语音识别开源框架Whisper特征Github地址开源文档介绍论文参考

ASRT特征环境Github地址开源文档介绍

DeepSpeech特征环境Github地址文档介绍论文参考

DeepSpeech2环境Github地址文档介绍论文参考

ESPNET特征Github地址开源文档介绍

kaldi特征Kaldi's versus other toolkitsThe flavor of KaldiGithub地址开源文档介绍

sherpa-ncnn特征Github地址开源文档介绍

Wenet特征Github地址开源文档介绍论文参考

Speechbrain特征&环境Github地址开源文档介绍

Vosk API特征&环境Github地址开源文档介绍

fairseq(传统端到端)框架特征环境Github地址

Eesen特征&环境Github地址论文参考

*Athena*特征&环境Github地址

PIKA特征环境OthersGithub地址开源文档介绍

SpeechLM(暂时不能用)Github地址开源文档介绍论文参考

Alibaba-MIT-Speech特征Github地址

Whisper

特征

A Transformer sequence-to-sequence model is trained on various speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. These tasks are jointly represented as a sequence of tokens to be predicted by the decoder, allowing a single model to replace many stages of a traditional speech-processing pipeline. The multitask training format uses a set of special tokens that serve as task specifiers or classification targets.

Github地址

https://github.com/openai/whisper

开源文档介绍

https://openai.com/research/whisper

论文参考

https://arxiv.org/abs/2212.04356

ASRT

特征

ASRT是一套基于深度学习实现的语音识别系统,全称为Auto Speech Recognition Tool,由AI柠檬博主开发并在GitHub上开源(GPL 3.0协议)。本项目声学模型通过采用卷积神经网络(CNN)和连接性时序分类(CTC)方法,使用大量中文语音数据集进行训练,将声音转录为中文拼音,并通过语言模型,将拼音序列转换为中文文本。算法模型在测试集上已经获得了80%的正确率。基于该模型,在Windows平台上实现了一个基于ASRT的语音识别应用软件,取得了较好应用效果。这个应用软件包含Windows 10 UWP商店应用和Windows 版.Net平台桌面应用,也一起开源在GitHub上了。

环境

硬件

CPU: 4核 (x86_64, amd64) +RAM: 16 GB +GPU: NVIDIA, Graph Memory 11GB+ (1080ti起步)硬盘: 500 GB 机械硬盘(或固态硬盘)

软件

Linux: Ubuntu 18.04 + / CentOS 7 + 或 Windows 10/11Python: 3.7 - 3.10 及后续版本TensorFlow: 2.5 - 2.11 及后续版本

Github地址

https://github.com/nl8590687/ASRT_SpeechRecognition

开源文档介绍

https://wiki.ailemon.net/docs/asrt-doc/asrt-doc-1demhoid4inc6

DeepSpeech

特征

DeepSpeech is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu’s Deep Speech research paper. Project DeepSpeech uses Google’s TensorFlow to make the implementation easier.

不基于phoneme,基于端到端的深度学习语音系统,一个使用多个gpu的优化的RNN训练系统

环境

基于TensorFlow

Github地址

https://github.com/mozilla/DeepSpeech

文档介绍

https://deepspeech.readthedocs.io/en/r0.9/?badge=latest

论文参考

https://arxiv.org/abs/1412.5567

DeepSpeech2

环境

基于PaddlePaddle

Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing modules, and deployment process. To be more specific, this toolkit features at:

 Ease of Use: low barriers to install, CLI, Server, and Streaming Server is available to quick-start your journey. Align to the State-of-the-Art: we provide high-speed and ultra-lightweight models, and also cutting-edge technology. Streaming ASR and TTS System: we provide production ready streaming asr and streaming tts system. Rule-based Chinese frontend: our frontend contains Text Normalization and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). Moreover, we use self-defined linguistic rules to adapt Chinese context. Varieties of Functions that Vitalize both Industrial and Academia:

️ Implementation of critical audio tasks: this toolkit contains audio functions like Automatic Speech Recognition, Text-to-Speech Synthesis, Speaker Verfication, KeyWord Spotting, Audio Classification, and Speech Translation, etc. Integration of mainstream models and datasets: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also model list for more details.里 Cascaded models application: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).

Github地址

https://github.com/PaddlePaddle/PaddleSpeech

文档介绍

https://github.com/PaddlePaddle/PaddleSpeech#documents

论文参考

https://arxiv.org/abs/2205.12007

ESPNET

CMU每年都更新教程

特征

ESPnet is an end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker diarization, spoken language understanding, and so on. ESPnet uses pytorch as a deep learning engine and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for various speech processing experiments.****

Github地址

https://github.com/espnet/espnet

开源文档介绍

https://espnet.github.io/espnet/

kaldi

特征

C++

Kaldi’s versus other toolkits

Kaldi is similar in aims and scope to HTK. The goal is to have modern and flexible code, written in C++, that is easy to modify and extend. Important features include:

Code-level integration with Finite State Transducers (FSTs)

We compile against the OpenFst toolkit (using it as a library). Extensive linear algebra support

We include a matrix library that wraps standard BLAS and LAPACK routines. Extensible design

As far as possible, we provide our algorithms in the most generic form possible. For instance, our decoders are templated on an object that provides a score indexed by a (frame, fst-input-symbol) tuple. This means the decoder could work from any suitable source of scores, such as a neural net. Open license

The code is licensed under Apache 2.0, which is one of the least restrictive licenses available. Complete recipes

Our goal is to make available complete recipes for building speech recognition systems, that work from widely available databases such as those provided by the Linguistic Data Consortium (LDC).

The goal of releasing complete recipes is an important aspect of Kaldi. Since the code is publicly available under a license that permits modifications and re-release, we would like to encourage people to release their code, along with their script directories, in a similar format to Kaldi’s own example script.

We have tried to make Kaldi’s documentation as complete as possible given time constraints, but in the short term we cannot hope to generate documentation that is as thorough as HTK’s. In particular there is a lot of introductory material in the HTKBook, explaining statistical speech recognition for the uninitiated, that will probably never appear in Kaldi’s documentation. Much of Kaldi’s documentation is written in such a way that it will only be accessible to an expert. In the future we hope to make it somewhat more accessible, bearing in mind that our intended audience is speech recognition researchers or researchers-in-training. In general, Kaldi is not a speech recognition toolkit “for dummies.” It will allow you to do many kinds of operations that don’t make sense.

The flavor of Kaldi

In this section we attempt to summarize some of the more generic qualities of the Kaldi toolkit. To some extent this describes the goals of the current developers, as much as it descibes the current status of the project. It is not meant to exclude contributions from researchers whose work has a different flavor.

We emphasize generic algorithms and universal recipes

By “generic algorithms” we mean things like linear transforms, rather than those that are specific to speech in some way. But we don’t intend to be too dogmatic about this, if more specific algorithms are useful.We would like recipes that can be run on any data-set, rather than those that have to be customized. We prefer provably correct algorithms

The recipes have been designed in such a way that in principle they should never fail in a catastrophic way. There has been an effort to avoid recipes and algorithms that could possibly fail, even if they don’t fail in the “normal case” (one example: FST weight-pushing, which normally helps but can crash or make things much worse in certain cases). Kaldi code is thoroughly tested.

The goal is for all or nearly all the code to have corresponding test routines. We try to keep the simple cases simple.

There is a danger when building a large speech toolkit that the code can become a forest of rarely used alternatives. We are trying to avoid this by structuring the toolkit in the following way. Each command-line program generally works for a limited set of cases (e.g. a decoder might just work for GMMs). Thus, when you add a new type of model, you create a new command-line decoder (that calls the same underlying templated code). Kaldi code is easy to understand.

Even though the Kaldi toolkit as a whole may get very large, we aim for each individual part of it to be understandable without too much effort. We will accept some code duplication if it improves the understandability of individual pieces. Kaldi code is easy to reuse and refactor.

We aim for the toolkit to as loosely coupled as possible. In general this means that any given header should need to #include as few other header files as possible. The matrix library, in particular, only depends on code in one other subdirectory so it can be used independently of almost all the rest of Kaldi.

Github地址

https://github.com/kaldi-asr/kaldi

开源文档介绍

http://kaldi-asr.org/doc/

https://github.com/mravanelli/pytorch-kaldi

sherpa-ncnn

特征

We support using ncnn to replace PyTorch for neural network computation. The code is put in a separate repository sherpa-ncnn

In the following, we describe how to build sherpa-ncnn for Linux, macOS, Windows, embedded systems, Android, and iOS.

Github地址

https://github.com/k2-fsa/sherpa-ncnn

开源文档介绍

https://k2-fsa.github.io/sherpa/ncnn/index.html

Wenet

特征

Production first and production ready: The core design principle, WeNet provides full stack production solutions for speech recognition.Accurate: WeNet achieves SOTA results on a lot of public speech datasets.Light weight: WeNet is easy to install, easy to use, well designed, and well documented.

Github地址

https://github.com/wenet-e2e/wenet

开源文档介绍

https://wenet.org.cn/wenet/index.html

论文参考

WeNet: Production Oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit, accepted by InterSpeech 2021.WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit, accepted by InterSpeech 2022.

Speechbrain

特征&环境

SpeechBrain provides various useful tools to speed up and facilitate research on speech and language technologies:

Various pretrained models nicely integrated with in our official organization account. These models are coupled with easy-inference interfaces that facilitate their use. To help everyone replicate our results, we also provide all the experimental results and folders (including logs, training curves, etc.) in a shared Google Drive folder. https://camo.githubusercontent.com/b9d050a07e52c7930206d37d72a229ab484cae1ace09bf0fe1c6cf9c7f5d4bc0/68747470733a2f2f68756767696e67666163652e636f2f66726f6e742f6173736574732f68756767696e67666163655f6c6f676f2e737667 (HuggingFace) The Brain class is a fully-customizable tool for managing training and evaluation loops over data. The annoying details of training loops are handled for you while retaining complete flexibility to override any part of the process when needed. A YAML-based hyperparameter file that specifies all the hyperparameters, from individual numbers (e.g., learning rate) to complete objects (e.g., custom models). This elegant solution dramatically simplifies the training script. Multi-GPU training and inference with PyTorch Data-Parallel or Distributed Data-Parallel. Mixed-precision for faster training. A transparent and entirely customizable data input and output pipeline. SpeechBrain follows the PyTorch data loading style and enables users to customize the I/O pipelines (e.g., adding on-the-fly downsampling, BPE tokenization, sorting, threshold …). On-the-fly dynamic batching Efficient reading of large datasets from a shared Network File System (NFS) via WebDataset. Interface with HuggingFace for popular models such as wav2vec2 and Hubert. Interface with Orion for hyperparameter tuning.

Github地址

https://github.com/speechbrain/speechbrain

开源文档介绍

https://speechbrain.readthedocs.io/en/latest/index.html

Vosk API

特征&环境

Vosk是言语识别工具包。Vosk最好的事情是:

支持二十+种语言 - 中文,英语,印度英语,德语,法语,西班牙语,葡萄牙语,俄语,土耳其语,越南语,意大利语,荷兰人,加泰罗尼亚语,阿拉伯, 希腊语, 波斯语, 菲律宾语,乌克兰语, 哈萨克语, 瑞典语, 日语, 世界语, 印地语, 捷克语, 波兰语, 乌兹别克语, 韩国语移动设备上脱机工作-Raspberry Pi, Android,iOS使用简单的 pip3 install vosk 安装每种语言的手提式模型只有是50Mb, 但还有更大的服务器模型可用提供流媒体API,以提供最佳用户体验(与流行的语音识别python包不同)还有用于不同编程语言的包装器-java / c# / javascript等可以快速重新配置词汇以实现最佳准确性支持说话人识别

Github地址

https://github.com/alphacep/vosk-api/

开源文档介绍

https://alphacephei.com/vosk/

fairseq(传统端到端)框架

特征

multi-GPU training on one machine or across multiple machines (data and model parallel)fast generation on both CPU and GPU with multiple search algorithms implemented:

beam searchDiverse Beam Search (Vijayakumar et al., 2016)sampling (unconstrained, top-k and top-p/nucleus)lexically constrained decoding (Post & Vilar, 2018) gradient accumulation enables training with large mini-batches even on a single GPUmixed precision training (trains faster with less GPU memory on NVIDIA tensor cores)extensible: easily register new models, criterions, tasks, optimizers and learning rate schedulersflexible configuration based on Hydra allowing a combination of code, command-line and file based configurationfull parameter and optimizer state shardingoffloading parameters to CPU

环境

PyTorch version >= 1.10.0Python version >= 3.8For training new models, you’ll also need an NVIDIA GPU and NCCL

Github地址

https://github.com/facebookresearch/fairseq

Eesen

特征&环境

The WFST-based decoding approach can incorporate lexicons and language models into CTC decoding in an effective and efficient way.The RNN-LM decoding approach does not require a fixed lexicon.GPU implementation of LSTM model training and CTC learning, now also using Tensorflow.Multiple utterances are processed in parallel for training speed-up.Fully-fledged example setups to demonstrate end-to-end system building, with both phonemes and characters as labels, following Kaldi recipes and conventions.

Github地址

https://github.com/srvk/eesen

论文参考

https://arxiv.org/abs/1507.08240

Athena

特征&环境

Hybrid Attention/CTC based end-to-end and streaming methods(ASR)Text-to-Speech(FastSpeech/FastSpeech2/Transformer)Voice activity detection(VAD)Key Word Spotting with end-to-end and streaming methods(KWS)ASR Unsupervised pre-training(MPC)Multi-GPU training on one machine or across multiple machines with HorovodWFST creation and WFST-based decoding with C++Deployment with Tensorflow C++(Local server)

Github地址

https://github.com/athena-team/athena

PIKA

特征

使用Pytorch作为深度学习引擎,Kaldi用于数据格式化和特征提取

On-the-fly data augmentation and feature extraction loaderTDNN Transformer encoder and convolution and transformer based decoder model structureRNNT training and batch decodingRNNT decoding with external Ngram FSTs (on-the-fly rescoring, aka, shallow fusion)RNNT Minimum Bayes Risk (MBR) trainingLAS forward and backward rescorer for RNNTEfficient BMUF (Block model update filtering) based distributed training

环境

In general, we recommend Anaconda since it comes with most dependencies. Other major dependencies include,

Pytorch

Please go to https://pytorch.org/ for pytorch installation, codes and scripts should be able to run against pytorch 0.4.0 and above. But we recommend 1.0.0 above for compatibility with RNNT loss module (see below)

Pykaldi and Kaldi

We use Kaldi (https://github.com/kaldi-asr/kaldi))) and PyKaldi (a python wrapper for Kaldi) for data processing, feature extraction and FST manipulations. Please go to Pykaldi website https://github.com/pykaldi/pykaldi for installation and make sure to build Pykaldi with ninja for efficiency. After following the installation process of pykaldi, you should have both Kaldi and Pykaldi dependencies ready.

CUDA-Warp RNN-Transducer

For RNNT loss module, we adopt the pytorch binding at https://github.com/1ytic/warp-rnnt

Others

Check requirements.txt for other dependencies.

Github地址

https://github.com/tencent-ailab/pika

开源文档介绍

https://github.com/tencent-ailab/pika

SpeechLM(暂时不能用)

(我们在预训练实验中发现了一些数据错误,这将影响 SpeechLM-P Base 模型的所有结果。我们正在重新进行相关实验,并将用新结果更新论文)

Github地址

https://github.com/microsoft/SpeechT5/tree/main/SpeechLM

开源文档介绍

https://github.com/microsoft/SpeechT5/tree/main/SpeechLM

论文参考

https://arxiv.org/abs/2209.15329

Alibaba-MIT-Speech

特征

This is a PATCH file with the DFSMN related codes and example scripts for LibriSpeech task.

Github地址

https://github.com/alibaba/Alibaba-MIT-Speech

精彩链接

评论可见,请评论后查看内容,谢谢!!!评论后请刷新页面。