WaveNet语音合成技术 联系客服

发布时间 : 星期四 文章WaveNet语音合成技术更新完毕开始阅读7c90063ed5bbfd0a78567351

WAVENET:AGENERATIVEMODELFORRAWAUDIO

A¨aronvandenOordKarenSimonyanNalKalchbrenner

SanderDielemanOriolVinyalsAndrewSenior

HeigaZen?AlexGraves

KorayKavukcuoglu

GoogleDeepMind,London,UK?

Google,London,UK

ABSTRACT

ThispaperintroducesWaveNet,adeepneuralnetworkforgeneratingrawaudiowaveforms.Themodelisfullyprobabilisticandautoregressive,withthepredic-tivedistributionforeachaudiosampleconditionedonallpreviousones;nonethe-lessweshowthatitcanbeef?cientlytrainedondatawithtensofthousandsofsamplespersecondofaudio.Whenappliedtotext-to-speech,ityieldsstate-of-the-artperformance,withhumanlistenersratingitassigni?cantlymorenaturalsoundingthanthebestparametricandconcatenativesystemsforbothEnglishandChinese.AsingleWaveNetcancapturethecharacteristicsofmanydifferentspeakerswithequal?delity,andcanswitchbetweenthembyconditioningonthespeakeridentity.Whentrainedtomodelmusic,we?ndthatitgeneratesnovelandoftenhighlyrealisticmusicalfragments.Wealsoshowthatitcanbeemployedasadiscriminativemodel,returningpromisingresultsforphonemerecognition.

1INTRODUCTION

Thisworkexploresrawaudiogenerationtechniques,inspiredbyrecentadvancesinneuralautore-gressivegenerativemodelsthatmodelcomplexdistributionssuchasimages(vandenOordetal.,2016a;b)andtext(J′ozefowiczetal.,2016).Modelingjointprobabilitiesoverpixelsorwordsusingneuralarchitecturesasproductsofconditionaldistributionsyieldsstate-of-the-artgeneration.Remarkably,thesearchitecturesareabletomodeldistributionsoverthousandsofrandomvariables(e.g.64×64pixelsasinPixelRNN(vandenOordetal.,2016a)).Thequestionthispaperaddressesiswhethersimilarapproachescansucceedingeneratingwidebandrawaudiowaveforms,whicharesignalswithveryhightemporalresolution,atleast16,000samplespersecond(seeFig.1).

Figure1:Asecondofgeneratedspeech.

ThispaperintroducesWaveNet,anaudiogenerativemodelbasedonthePixelCNN(vandenOordetal.,2016a;b)architecture.Themaincontributionsofthisworkareasfollows:

?WeshowthatWaveNetscangeneraterawspeechsignalswithsubjectivenaturalnessneverbeforereportedinthe?eldoftext-to-speech(TTS),asassessedbyhumanraters.

1

?Inordertodealwithlong-rangetemporaldependenciesneededforrawaudiogeneration,wedevelopnewarchitecturesbasedondilatedcausalconvolutions,whichexhibitverylargereceptive?elds.

?Weshowthatasinglemodelcanbeusedtogeneratedifferentvoices,conditionedonaspeakeridentity.?Thesamearchitectureshowsstrongresultswhentestedonasmallspeechrecognitiondataset,andispromisingwhenusedtogenerateotheraudiomodalitiessuchasmusic.WebelievethatWaveNetsprovideagenericand?exibleframeworkfortacklingmanyapplicationsthatrelyonaudiogeneration(e.g.TTS,music,speechenhancement,voiceconversion,sourcesep-aration).2WAVENETInthispaperaudiosignalsaremodelledwithagenerativemodeloperatingdirectlyontherawaudiowaveform.Thejointprobabilityofawaveformx={x1,...,xT}isfactorisedasaproductofconditionalprobabilitiesasfollows:p(x)=T??t=1p(xt|x1,...,xt?1)(1)Eachaudiosamplextisthereforeconditionedonthesamplesatallprevioustimesteps.SimilarlytoPixelCNNs(vandenOordetal.,2016a;b),theconditionalprobabilitydistributionismodelledbyastackofconvolutionallayers.Therearenopoolinglayersinthenetwork,andtheoutputofthemodelhasthesametimedimensionalityastheinput.Themodeloutputsacategoricaldistributionoverthenextvaluextwithasoftmaxlayer.Themodelisoptimizedtomaximizethelog-likelihoodofthedataw.r.t.theparameters.Becauselog-likelihoodsaretractable,wetunehyper-parametersonavalidationsetandcaneasilymeasureover?tting/under?tting.2.1DILATEDCAUSALCONVOLUTIONSOutputHidden LayerHidden LayerHidden LayerInputFigure2:Visualizationofastackofcausalconvolutionallayers.ThemainingredientofWaveNetarecausalconvolutions.Byusingcausalconvolutions,wemakesurethemodelcannotviolatetheorderinginwhichwemodelthedata:thepredictionp(xt+1|x1,...,xt)emittedbythemodelattimesteptcannotdependonanyofthefuturetimestepsxt+1,xt+2,...,xT.ThisisvisualizedinFig.2.Forimages,theequivalentofacausalconvolutionisamaskedconvolution(vandenOordetal.,2016a)whichcanbeimplementedbyconstructingamasktensorandmultiplyingthiselementwisewiththeconvolutionkernelbeforeapplyingit.For1-Ddatasuchasaudioonecanmoreeasilyimplementthisbyshiftingtheoutputofanormalcon-volutionbyafewtimesteps.Attrainingtime,theconditionalpredictionsforalltimestepscanbemadeinparallelbecausealltimestepsofgroundtruthxareknown.Whengeneratingwiththemodel,thepredictionsarese-quential:aftereachsampleispredicted,itisfedbackintothenetworktopredictthenextsample.2Becausemodelswithcausalconvolutionsdonothaverecurrentconnections,theyaretypicallyfastertotrainthanRNNs,especiallywhenappliedtoverylongsequences.Oneoftheproblemsofcausalconvolutionsisthattheyrequiremanylayers,orlarge?lterstoincreasethereceptive?eld.Forexample,inFig.2thereceptive?eldisonly5(=#layers+?lterlength-1).Inthispaperweusedilatedconvolutionstoincreasethereceptive?eldbyordersofmagnitude,withoutgreatlyincreasingcomputationalcost.

Adilatedconvolution(alsocalleda`trous,orconvolutionwithholes)isaconvolutionwherethe?lterisappliedoveranarealargerthanitslengthbyskippinginputvalueswithacertainstep.Itisequivalenttoaconvolutionwithalarger?lterderivedfromtheoriginal?lterbydilatingitwithzeros,butissigni?cantlymoreef?cient.Adilatedconvolutioneffectivelyallowsthenetworktooperateonacoarserscalethanwithanormalconvolution.Thisissimilartopoolingorstridedconvolutions,butheretheoutputhasthesamesizeastheinput.Asaspecialcase,dilatedconvolutionwithdilation1yieldsthestandardconvolution.Fig.3depictsdilatedcausalconvolutionsfordilations1,2,4,and8.Dilatedconvolutionshavepreviouslybeenusedinvariouscontexts,e.g.signalprocessing(Holschneideretal.,1989;Dutilleux,1989),andimagesegmentation(Chenetal.,2015;Yu&Koltun,2016).

OutputDilation = 8Hidden LayerDilation = 4Hidden LayerDilation = 2Hidden LayerDilation = 1InputFigure3:Visualizationofastackofdilatedcausalconvolutionallayers.

Stackeddilatedconvolutionsef?cientlyenableverylargereceptive?eldswithjustafewlayers,whilepreservingtheinputresolutionthroughoutthenetwork.Inthispaper,thedilationisdoubledforeverylayeruptoacertainpointandthenrepeated:e.g.

1,2,4,...,512,1,2,4,...,512,1,2,4,...,512.

Theintuitionbehindthiscon?gurationistwo-fold.First,exponentiallyincreasingthedilationfactorresultsinexponentialreceptive?eldgrowthwithdepth(Yu&Koltun,2016).Forexampleeach1,2,4,...,512blockhasreceptive?eldofsize1024,andcanbeseenasamoreef?cientanddis-criminative(non-linear)counterpartofa1×1024convolution.Second,stackingtheseblocksfurtherincreasesthemodelcapacityandthereceptive?eldsize.2.2

SOFTMAXDISTRIBUTIONS

Oneapproachtomodelingtheconditionaldistributionsp(xt|x1,...,xt?1)overtheindividualaudiosampleswouldbetouseamixturemodelsuchasamixturedensitynetwork(Bishop,1994)ormixtureofconditionalGaussianscalemixtures(MCGSM)(Theis&Bethge,2015).However,vandenOordetal.(2016a)showedthatasoftmaxdistributiontendstoworkbetter,evenwhenthedataisimplicitlycontinuous(asisthecaseforimagepixelintensitiesoraudiosamplevalues).Oneofthereasonsisthatacategoricaldistributionismore?exibleandcanmoreeasilymodelarbitrarydistributionsbecauseitmakesnoassumptionsabouttheirshape.

Becauserawaudioistypicallystoredasasequenceof16-bitintegervalues(onepertimestep),asoftmaxlayerwouldneedtooutput65,536probabilitiespertimesteptomodelallpossiblevalues.Tomakethismoretractable,we?rstapplyaμ-lawcompandingtransformation(ITU-T,1988)tothedata,andthenquantizeitto256possiblevalues:

ln(1+μ|xt|)

,f(xt)=sign(xt)

ln(1+μ)

3

ReLU??tanh1?1?+where?1