As an Amazon Associate I earn from qualifying purchases.

Amazon’s 36 ICASSP papers touch on everything audio

[ad_1]

The International Conference on Acoustics, Speech, and Signal Processing (ICASSP) starts next week, and as Alexa principal research scientist Ariya Rastrow explained last year, it casts a wide net. The topics of the 36 Amazon research papers at this year’s ICASSP range from the classic signal-processing problems of noise and echo cancellation to such far-flung problems as separating song vocals from instrumental tracks and regulating translation length. 

A plurality of the papers, however, concentrate on the core technology of automatic speech recognition (ASR), or converting an acoustic speech signal into text:

  • ASR n-best fusion nets
    Xinyue Liu, Mingda Li, Luoxin ChenPrashan WanigasekaraWeitong RuanHaidar KhanWael HamzaChengwei Su
  • Bifocal neural ASR: Exploiting keyword spotting for inference optimization
    Jon MacoskeyGrant P. StrimelAriya Rastrow 
  • Domain-aware neural language models for speech recognition
    Linda LiuYile GuAditya GouravAnkur GandheShashank KalmaneDenis FilimonovAriya RastrowIvan Bulyko 
  • End-to-end multi-channel transformer for speech recognition
    Feng-Ju ChangMartin RadfarAthanasios MouchtarisBrian KingSiegfried Kunzmann
  • Improved robustness to disfluencies in RNN-transducer-based speech recognition
    Valentin Mendelev, Tina Raissi, Guglielmo Camporese, Manuel Giollo 
  • Personalization strategies for end-to-end speech recognition systems
    Aditya GouravLinda LiuAnkur GandheYile GuGuitang LanXiangyang HuangShashank KalmaneGautam TiwariDenis FilimonovAriya RastrowAndreas StolckeIvan Bulyko 
  • reDAT: Accent-invariant representation for end-to-end ASR by domain adversarial training with relabeling
    Hu Hu, Xuesong YangZeynab RaeesyJinxi Guo, Gokce Keskin, Harish ArsikereAriya RastrowAndreas StolckeRoland Maas 
  • Sparsification via compressed sensing for automatic speech recognition
    Kai Zhen, Hieu Duy NguyenFeng-Ju ChangAthanasios MouchtarisAriya Rastrow
  • Streaming multi-speaker ASR with RNN-T
    Ilya SklyarAnna PiunovaYulan Liu 
  • Using synthetic audio to improve the recognition of out-of-vocabulary words in end-to-end ASR systems
    Xianrui Zheng, Yulan Liu, Deniz Gunceler, Daniel Willett
To enable personalization of end-to-end automatic-speech-recognition systems, Linda Liu, Aditya Gourav and their colleagues use a word-level biasing finite state transducer, or FST (left). A subword-level FST preserves the weights of the word-level FST. For instance, the weight between state 0 and 5 of the subword-level FST (representing the word “player”) is (-1.6) +(- 1.6)+(-4.8) = -8.

Two of the papers address language (or code) switching, a more complicated version of ASR in which the speech recognizer must also determine which of several possible languages is being spoken: 

  • Joint ASR and language identification using RNN-T: An efficent approach to dynamic language switching
    Surabhi PunjabiHarish ArsikereZeynab RaeesyChander ChandakNikhil BhaveMarkus MuellerSergio MurilloAriya RastrowAndreas StolckeJasha DroppoSri GarimellaRoland MaasMat HansAthanasios MouchtarisSiegfried Kunzmann
  • Transformer-transducers for code-switched speech recognition
    Siddharth Dalmia, Yuzong Liu, Srikanth RonankiKatrin Kirchhoff 

The acoustic speech signal contains more information than just the speaker’s words; how the words are said can change their meaning. Such paralinguistic signals can be useful for a voice agent trying to determine how to interpret the raw text. Two of Amazon’s ICASSP papers focus on such signals:

  • Contrastive unsupervised learning for speech emotion recognition
    Mao Li, Bo Yang, Joshua Levy, Andreas StolckeViktor RozgicSpyros MatsoukasConstantinos PapayiannisDaniel BoneChao Wang 
  • Disentanglement for audiovisual emotion recognition using multitask setup
    Raghuveer Peri, Srinivas ParthasarathyCharles Bradshaw, Shiva Sundaram 

Several papers address other extensions of ASR, such as speaker diarization, or tracking which of several speakers issues each utterance; inverse text normalization, or converting the raw ASR output into a format useful to downstream applications; and acoustic event classification, or recognizing sounds other than human voices:

The structure of a joint echo control and noise suppression system from Amazon. A microphone (mic) captures the output of a loudspeaker, along with noise and echo. The echo is partially cancelled by an adaptive filter f), which uses the signal to the speaker. The microphone signal then passes to a residual-echo-suppression (RES) algorithm.

Speech enhancement, or removing noise and echo from the speech signal, has been a prominent topic at ICASSP since the conference began in 1976. But more recent work on the topic — including Amazon’s two papers this year — uses deep-learning methods:

Every interaction with Alexa begins with a wake word — usually “Alexa”, but sometimes “computer” or “Echo”. So at ICASSP, Amazon usually presents work on wake word detection — or keyword spotting, as it’s more generally known:

In many spoken-language systems, the next step after ASR is natural-language understanding (NLU), or making sense of the text output from the ASR system:

In some contexts, however, it’s possible to perform both ASR and NLU with a single model, in a task known as spoken-language understanding:

  • Do as I mean, not as I say: Sequence loss training for spoken language understanding
    Milind RaoPranav DheramGautam TiwariAnirudh RajuJasha DroppoAriya RastrowAndreas Stolcke 
  • Graph enhanced query rewriting for spoken language understanding system
    Siyang Yuan, Saurabh GuptaXing FanDerek LiuYang LiuChenlei (Edward) Guo 
  • Top-down attention in end-to-end spoken language understanding
    Yixin Chen, Weiyi LuAlejandro MottiniErran LiJasha DroppoZheng DuBelinda Zeng 
A spoken-language-understanding system combines automatic speech recognition (ASR) and natural-language understanding (NLU) in a single model.

An interaction with a voice service, which begins with keyword spotting, ASR, and NLU, often culminates with the agent’s use of synthesized speech to relay a response. The agent’s text-to-speech model converts the textual outputs of various NLU and dialogue systems into speech:

  • CAMP: A two-stage approach to modelling prosody in context
    Zack Hodari, Alexis MoinetSri KarlapatiJaime Lorenzo-Trueba, Thomas Merritt, Arnaud JolyAmmar AbbasPenny KaranasouThomas Drugman
  • Low-resource expressive text-to-speech using data augmentation
    Goeric Huybrechts, Thomas Merritt, Giulia CominiBartek PerzRaahil ShahJaime Lorenzo-Trueba 
  • Prosodic representation learning and contextual sampling for neural text-to-speech
    Sri KarlapatiAmmar Abbas, Zack Hodari, Alexis MoinetArnaud Joly, Penny KaranasouThomas Drugman
  • Universal neural vocoding with Parallel WaveNet
    Yunlong JiaoAdam Gabrys, Georgi Tinchev, Bartosz PutryczDaniel KorzekwaViacheslav Klimkov

All of the preceding research topics have implications for voice services like Alexa, but Amazon has a range of other products and services that rely on audio-signal processing. Three of Amazon’s papers at this year’s ICASSP relate to audio-video synchronization: two deal with dubbing audio in one language onto video shot in another, and one describes how to detect synchronization errors in video — as when, for example, the sound of a tennis ball being struck and the shot of the racquet hitting the ball are misaligned:

  • Detection of audio-video synchronization errors via event detection
    Joshua P. Ebenezer, Yongjun WuHai WeiSriram SethuramanZongyi Liu 
  • Improvements to prosodic alignment for automatic dubbing
    Yogesh VirkarMarcello FedericoRobert EnyediRoberto Barra-Chicote
  • Machine translation verbosity control for automatic dubbing
    Surafel Melaku LakewMarcello FedericoYue WangCuong HoangYogesh VirkarRoberto Barra-ChicoteRobert Enyedi 

Amazon’s Text-to-Speech team has an ICASSP paper on the unusual topic of computer-assisted pronunciation training, a feature of some language learning applications. The researchers’ method would enable language learning apps to accept a wider range of word pronunciations, to score pronunciations more accurately, and to provide more reliable feedback:

The architecture of a new Amazon model for separating a recording’s vocal tracks and instrumental tracks.

Another paper investigates the topic of singing voice separation, or separating vocal tracks from instrumental tracks in song recordings:

Finally, two of Amazon’s ICASSP papers, although they do evaluate applications in speech recognition and audio classification, present general machine learning methodologies that could apply to a range of problems. One paper investigates federated learning, a distributed-learning technique in which multiple servers, each with a different, local store of training data, collectively build a machine learning model without exchanging data. The other presents a new loss function for training classification models on synthetic data created by transforming real data — for instance, training a sound classification model with samples that have noise added to them artificially.

Also at ICASSP, on June 8, seven Amazon scientists will be participating in a half-hour live Q&A. Conference registrants may submit questions to the panelists online.



[ad_2]

Source link

We will be happy to hear your thoughts

Leave a reply

Discover Your Essential Style at NovaEssentials
Logo