发布时间 : 星期四 文章WaveNet语音合成技术更新完毕开始阅读7c90063ed5bbfd0a78567351
WAVENET:AGENERATIVEMODELFORRAWAUDIO
A¨aronvandenOordKarenSimonyanNalKalchbrenner
SanderDielemanOriolVinyalsAndrewSenior
HeigaZen?AlexGraves
KorayKavukcuoglu
GoogleDeepMind,London,UK?
Google,London,UK
ABSTRACT
ThispaperintroducesWaveNet,adeepneuralnetworkforgeneratingrawaudiowaveforms.Themodelisfullyprobabilisticandautoregressive,withthepredic-tivedistributionforeachaudiosampleconditionedonallpreviousones;nonethe-lessweshowthatitcanbeef?cientlytrainedondatawithtensofthousandsofsamplespersecondofaudio.Whenappliedtotext-to-speech,ityieldsstate-of-the-artperformance,withhumanlistenersratingitassigni?cantlymorenaturalsoundingthanthebestparametricandconcatenativesystemsforbothEnglishandChinese.AsingleWaveNetcancapturethecharacteristicsofmanydifferentspeakerswithequal?delity,andcanswitchbetweenthembyconditioningonthespeakeridentity.Whentrainedtomodelmusic,we?ndthatitgeneratesnovelandoftenhighlyrealisticmusicalfragments.Wealsoshowthatitcanbeemployedasadiscriminativemodel,returningpromisingresultsforphonemerecognition.
1INTRODUCTION
Thisworkexploresrawaudiogenerationtechniques,inspiredbyrecentadvancesinneuralautore-gressivegenerativemodelsthatmodelcomplexdistributionssuchasimages(vandenOordetal.,2016a;b)andtext(J′ozefowiczetal.,2016).Modelingjointprobabilitiesoverpixelsorwordsusingneuralarchitecturesasproductsofconditionaldistributionsyieldsstate-of-the-artgeneration.Remarkably,thesearchitecturesareabletomodeldistributionsoverthousandsofrandomvariables(e.g.64×64pixelsasinPixelRNN(vandenOordetal.,2016a)).Thequestionthispaperaddressesiswhethersimilarapproachescansucceedingeneratingwidebandrawaudiowaveforms,whicharesignalswithveryhightemporalresolution,atleast16,000samplespersecond(seeFig.1).
Figure1:Asecondofgeneratedspeech.
ThispaperintroducesWaveNet,anaudiogenerativemodelbasedonthePixelCNN(vandenOordetal.,2016a;b)architecture.Themaincontributionsofthisworkareasfollows:
?WeshowthatWaveNetscangeneraterawspeechsignalswithsubjectivenaturalnessneverbeforereportedinthe?eldoftext-to-speech(TTS),asassessedbyhumanraters.
1
?Inordertodealwithlong-rangetemporaldependenciesneededforrawaudiogeneration,wedevelopnewarchitecturesbasedondilatedcausalconvolutions,whichexhibitverylargereceptive?elds.
?Weshowthatasinglemodelcanbeusedtogeneratedifferentvoices,conditionedonaspeakeridentity.?Thesamearchitectureshowsstrongresultswhentestedonasmallspeechrecognitiondataset,andispromisingwhenusedtogenerateotheraudiomodalitiessuchasmusic.WebelievethatWaveNetsprovideagenericand?exibleframeworkfortacklingmanyapplicationsthatrelyonaudiogeneration(e.g.TTS,music,speechenhancement,voiceconversion,sourcesep-aration).2WAVENETInthispaperaudiosignalsaremodelledwithagenerativemodeloperatingdirectlyontherawaudiowaveform.Thejointprobabilityofawaveformx={x1,...,xT}isfactorisedasaproductofconditionalprobabilitiesasfollows:p(x)=T??t=1p(xt|x1,...,xt?1)(1)Eachaudiosamplextisthereforeconditionedonthesamplesatallprevioustimesteps.SimilarlytoPixelCNNs(vandenOordetal.,2016a;b),theconditionalprobabilitydistributionismodelledbyastackofconvolutionallayers.Therearenopoolinglayersinthenetwork,andtheoutputofthemodelhasthesametimedimensionalityastheinput.Themodeloutputsacategoricaldistributionoverthenextvaluextwithasoftmaxlayer.Themodelisoptimizedtomaximizethelog-likelihoodofthedataw.r.t.theparameters.Becauselog-likelihoodsaretractable,wetunehyper-parametersonavalidationsetandcaneasilymeasureover?tting/under?tting.2.1DILATEDCAUSALCONVOLUTIONSOutputHidden LayerHidden LayerHidden LayerInputFigure2:Visualizationofastackofcausalconvolutionallayers.ThemainingredientofWaveNetarecausalconvolutions.Byusingcausalconvolutions,wemakesurethemodelcannotviolatetheorderinginwhichwemodelthedata:thepredictionp(xt+1|x1,...,xt)emittedbythemodelattimesteptcannotdependonanyofthefuturetimestepsxt+1,xt+2,...,xT.ThisisvisualizedinFig.2.Forimages,theequivalentofacausalconvolutionisamaskedconvolution(vandenOordetal.,2016a)whichcanbeimplementedbyconstructingamasktensorandmultiplyingthiselementwisewiththeconvolutionkernelbeforeapplyingit.For1-Ddatasuchasaudioonecanmoreeasilyimplementthisbyshiftingtheoutputofanormalcon-volutionbyafewtimesteps.Attrainingtime,theconditionalpredictionsforalltimestepscanbemadeinparallelbecausealltimestepsofgroundtruthxareknown.Whengeneratingwiththemodel,thepredictionsarese-quential:aftereachsampleispredicted,itisfedbackintothenetworktopredictthenextsample.2Becausemodelswithcausalconvolutionsdonothaverecurrentconnections,theyaretypicallyfastertotrainthanRNNs,especiallywhenappliedtoverylongsequences.Oneoftheproblemsofcausalconvolutionsisthattheyrequiremanylayers,orlarge?lterstoincreasethereceptive?eld.Forexample,inFig.2thereceptive?eldisonly5(=#layers+?lterlength-1).Inthispaperweusedilatedconvolutionstoincreasethereceptive?eldbyordersofmagnitude,withoutgreatlyincreasingcomputationalcost.
Adilatedconvolution(alsocalleda`trous,orconvolutionwithholes)isaconvolutionwherethe?lterisappliedoveranarealargerthanitslengthbyskippinginputvalueswithacertainstep.Itisequivalenttoaconvolutionwithalarger?lterderivedfromtheoriginal?lterbydilatingitwithzeros,butissigni?cantlymoreef?cient.Adilatedconvolutioneffectivelyallowsthenetworktooperateonacoarserscalethanwithanormalconvolution.Thisissimilartopoolingorstridedconvolutions,butheretheoutputhasthesamesizeastheinput.Asaspecialcase,dilatedconvolutionwithdilation1yieldsthestandardconvolution.Fig.3depictsdilatedcausalconvolutionsfordilations1,2,4,and8.Dilatedconvolutionshavepreviouslybeenusedinvariouscontexts,e.g.signalprocessing(Holschneideretal.,1989;Dutilleux,1989),andimagesegmentation(Chenetal.,2015;Yu&Koltun,2016).
OutputDilation = 8Hidden LayerDilation = 4Hidden LayerDilation = 2Hidden LayerDilation = 1InputFigure3:Visualizationofastackofdilatedcausalconvolutionallayers.
Stackeddilatedconvolutionsef?cientlyenableverylargereceptive?eldswithjustafewlayers,whilepreservingtheinputresolutionthroughoutthenetwork.Inthispaper,thedilationisdoubledforeverylayeruptoacertainpointandthenrepeated:e.g.
1,2,4,...,512,1,2,4,...,512,1,2,4,...,512.
Theintuitionbehindthiscon?gurationistwo-fold.First,exponentiallyincreasingthedilationfactorresultsinexponentialreceptive?eldgrowthwithdepth(Yu&Koltun,2016).Forexampleeach1,2,4,...,512blockhasreceptive?eldofsize1024,andcanbeseenasamoreef?cientanddis-criminative(non-linear)counterpartofa1×1024convolution.Second,stackingtheseblocksfurtherincreasesthemodelcapacityandthereceptive?eldsize.2.2
SOFTMAXDISTRIBUTIONS
Oneapproachtomodelingtheconditionaldistributionsp(xt|x1,...,xt?1)overtheindividualaudiosampleswouldbetouseamixturemodelsuchasamixturedensitynetwork(Bishop,1994)ormixtureofconditionalGaussianscalemixtures(MCGSM)(Theis&Bethge,2015).However,vandenOordetal.(2016a)showedthatasoftmaxdistributiontendstoworkbetter,evenwhenthedataisimplicitlycontinuous(asisthecaseforimagepixelintensitiesoraudiosamplevalues).Oneofthereasonsisthatacategoricaldistributionismore?exibleandcanmoreeasilymodelarbitrarydistributionsbecauseitmakesnoassumptionsabouttheirshape.
Becauserawaudioistypicallystoredasasequenceof16-bitintegervalues(onepertimestep),asoftmaxlayerwouldneedtooutput65,536probabilitiespertimesteptomodelallpossiblevalues.Tomakethismoretractable,we?rstapplyaμ-lawcompandingtransformation(ITU-T,1988)tothedata,andthenquantizeitto256possiblevalues:
ln(1+μ|xt|)
,f(xt)=sign(xt)
ln(1+μ)
3