长文对深度学习的现状及局限性进行了批判性探讨
长文对深度学习的现状及局限性进行了批判性探讨长文对深度学习的现状及局限性进行了批判性探讨deep learning, implied that the technique is"poised to reinvent computing itself bouuture.(A, 2016). A recent New York Times Sunday Magazine article, largely alYet deep learning may well be approaching a wall, much as I anticipated earlier, atbeginning of the resurgence(Marcus, 2012), and as leading figures like Hinton (Sabour,Frosst, hinton, 2017)and Chollet(2017)havc begun to imply in rcccnt monthsWhat exactly is deep learning, and what has its shown about the nature of intelligenceWhat can we expect it to do, and where might we expect it to break down? How close orfar are we from"artificial general intelligence,, and a point at which machines show ahuman-like flexibility in solving unfamiliar problems? The purpose of this paper is bothto temper some irrational exuberance and also to consider what we as a field might needto move forwardThis paper is written simultaneously for researchers in the field and for a growing set ofAl consumers with less technical background who may wish to understand where thefield is headed. As such I will begin with a very brief, nontechnical introduction aimed atelucidating what deep learning systems do well and why ( Section 2), before turning to anassessment of deep learning's weaknesses(Section 3) and some fears that arise frommisunderstandings about deep learnings capabilities(Section 4), and closing withperspective on going forward(Section 5)Dccp learning is not likely to disappcar, nor should it. But five ycars into the ficldsresurgence seems like a good moment for a critical reflection, on what deep learning hasand has not been able to achieve2. What deep learning is, and what it does wellDccp learning, as it is primarily uscd, is essentially a statistical tcchniquc for classifyingpatterns, based on sample data, using neural networks with multiple layers. 53https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html4 For more technical introduction, there are many excellent recent tutorials on deep learning including(Chollet,2017)and( Goodfellow, Bengio, courville, 2016), as well as insightful blogs and online resources from ZacharyLipton, Chris Olah, and many others5Other applications of deep learning beyond classification are possible, too, though currently less popular, andoutside of the scope of the current article. These include using deep learning as an alternative to regression,as acomponent in generative models that create(e.g. )synthetic images, as a tool for compressing images, as a tool forlearning probability distributions, and (relatedly) as an important technique for approximation known as variationalInferencePage 3 of27eural networks in the deep learning literature typically consist of a set of input units thatstand for things like pixels or words, multiple hidden layers(the more such layers, thedeeper a network is said to be)containing hidden units(also known as nodes or neurons )and a set output units, with connections running between those nodes. In a typicalapplication such a network might be trained on a large sets of handwritten digits(theseare the inputs, rcprcscntcd as images )and labels( thesc arc the outputs) that identify thecategories to which those inputs belong(this image is a 2, that one is a 3, and so forth)Hidden layersInput layerOutput layerOver time an algorithm called back-propagation allows a process called gradient descentto adjust the connections between units using a process, such that any given input tends toproducc the corresponding output.Collectively, one can think of the relation between inputs and outputs that a neuralnetwork learns as a mapping. Neural networks, particularly those with multiple hiddenlayers(hence the term deep) are remarkably good at learning input-output mappingsSuch systems are commonly described as neural networks because the input nodeshidden nodes, and output nodes can be thought of as loosely analogous to biologicalncurons, albeit greatly simplified, and thc connections betwccn nodes can be thought ofas in some way reflecting connections between neurons. A longstanding question, outsidethe scope of the current paper, concerns the degree to which artificial neural networks arebiologically plausibleMost deep learning networks make heavy use of a technique called convolution(Lecun1989), which constrains the neural connections in the network such that they innatelycapture a property known as translational invariance this is essentially the idea that anobject can slide around an image while maintaining its identity; a circle in the top left canbe presumed, even absent direct experience) to be the same as a circle in the bottom rightPage 4 of27Deep learning is also known for its ability to self-generate intermediate representations,such as internal units that may respond to things like horizontal lines, or more complexlements of pictorial structureIn principle, given infinite data, dccp learning systcms arc powerful enough to representany finite deterministic "mapping between any given set of inputs and a set ofcorresponding outputs, though in practice whether they can learn such a mappingdepends on many factors. One common concern is getting caught in local minima, inwhich a systems gets stuck on a suboptimal solution, with no better solution nearby in thespace of solutions being searched. (Experts use a variety of techniques to avoid suchproblems, to reasonably good effect). In practice, results with large data sets are oftenquite good, on a wide range of potential mappingsIn speech recognition, for example, a neural network learns a mapping between a set ofspeech sounds, and set of labels(such as words or phonemes). In object recognition, aneural network learns a mapping between a set of images and a set of labels(such that,for example, pictures of cars are labeled as cars). In DeepMind's atari game system(Mnih et al, 2015), neural networks learned mappings between pixels and joystickpoSitionsDeep learning systems are most often used as classification system in the sense that themission of a typical nctwork is to decide which of a sct of categories( dcfincd by theoutput units on the neural network) a given input belongs to. With enough imagination,the power of classification is immense; outputs can represent words, places on a Goboard or virtually anything elseIn a world with in finite data, and infinite computational resources there might be littleneed for any other technique3. Limits on the scope of deep learningDeep learnings limitations begin with the contrapositive: we live in a world in whichdata are never infinite. Instead, systems that rely on deep learning frequently have togeneralize beyond the specific data that they have seen, whether to a new pronunciationof a word or to an image that differs from onc that the systcm has scen bcforc, and wheredata are less than infinite the ability of formal proofs to guarantee high-qualityperformance is more limitedPage 5 of27As discussed later in this article, generalization can be thought of as coming in twoflavors, interpolation between known examples, and extrapolation which requires goingbeyond a space of known training examples(Marcus, 1998a)For neural networks to generalize well, there generally must be a large amount of dataand the test data must be similar to the training data allowing new answers to beinterpolated in between old ones. In Krizhevsky et als paper (Krizhevsky, Sutskever,Hinton, 2012), a nine layer convolutional neural network with 60 million parameters and650,000 nodes was trained on roughly a million distinct examples drawn fromapproximately one thousand categories. 6This sort of brute force approach worked well in the very finite world of ImageNet, intowhich all stimuli can be classified into a comparatively small set of categories. It alsoworks well in stable domains like speech recognition in which exemplars are mapped inconstant way onto a limited set of speech sound categories, but for many reasons deeplearning cannot be considered (as it sometimes is in the popular press)as a generalsolution to artificial intelligenceHere are ten challenges faced by current deep learning systems3.1. Deep learning thus far is data hungryHuman beings can learn abstract relationships in a few trials. If I told you that a schmislerwas a sister over the age of 10 but under the age of 21, perhaps giving you a singlecxamplc, you could immediately infer whcther you had any schmistcrs, whcther your bestfriend had a schmister, whether your children or parents had any schmisters, and so forth(Odds are, your parents no longer do, if they ever did, and you could rapidly draw thatinference, too.In learning what a schmister is, in this case through explicit definition, you rely not onhundreds or thousands or millions of training examples, but on a capacity to representabstract relationships between algebra-like variablesHumans can learn such abstractions, both through explicit definition and more implicitmeans(Marcus, 2001). Indeed even 7-month old infants can do so, acquiring learnedabstract language-like rules from a small number of unlabeled examples, in just two6 USing a common technique known as data augmentation each example was actually presented along with its labelin a many different locations both in its original form and in mirror reversed form a second type of dataaugmentation varied the brightness of the images, yielding still more examples for training, in order to train thenetwork to recognize images with different intensities. Part of the art of machine learning involves knowing whatforms of data augmentation will and wont help within a given systemPage 6 of27minutes(Marcus, Vijayan, Bandi rao, vishton, 1999). Subsequent work by gervainand colleagues(2012)suggests that newborns are capable of similar computations.Deep learning currently lacks a mechanism for learning abstractions through explicit,verbal definition. and works best when there are thousands. millions or even billions oftraining examples, as in decp Minds work on board games and atari As Brenden lakeand his colleagues have recently emphasized in a series of papers, humans are far moreefficient in learning complex rules than deep learning systems are lake, Salakhutdinov,Tenenbaum, 2015; Lake, Ullman, Tenenbaum, gershman, 2016).(See also relatedwork by George et al (2017), and my own work with Steven Pinker on childrensoverregularization errors in comparison to neural networks(Marcus et al., 1992)Geoff Hinton has also worried about deep learning's reliance on large numbers of labeledexamples, and expressed this concern in his recent work on capsule networks with hiscoauthors(Sabour et al., 2017)noting that convolutional neural networks( the mostcommon deep learning architecture) may face"exponential inefficiencies that may lead totheir demise. A good candidate is the difficulty that convolutional nets have ingeneralizing to novel viewpoints [ie perspectives on object in visual recognition tasks]The ability to deal with translation[al invariance] is built in, but for the other .. [commontype of] trans formation we have to chose between replicating feature detectors on a gridthat grows exponentially.. or increasing the size of the labelled training set in a similarlyexponential way.”In problems where data are limited, deep learning often is not an ideal solution3.2. Deep learning thus far is shallow and has limited capacity fortransferAlthough deep learning is capable of some amazing things, it is important to realize thatthe word"deep?"in deep learning refers to a technical, architectural property(the largenumbcr of hidden layers uscd in a modcrn neural nctworks, where there predecessorsused only one)rather than a conceptual one(the representations acquired by suchnetworks don't, for example, naturally apply to abstract concepts like"justicedemocracy”or“ meddling)Even more down-tO-earth concepts like"ball oropponent' can lie out of reachConsider for example DeepMind's Atari game work (Mnih et al., 2015)on deepreinforcement learning, which combines deep learning with reinforcement learning(inwhich a lcarncr trics to maximize reward). Ostensibly, the results arc fantastic: the systcmmeets or beats human experts on a large sample of games using a single set of" hyperparameters? that govern properties such as the rate at which a network alters itsweights, and no advance knowledge about specific games, or even their rules. But it isPage7of27easy to wildly overinterpret what the results show. To take one example, according to awidely-circulated video of the system learning to play the brick-breaking Atari gameBreakout, "after 240 minutes of training, [the system] realizes that digging a tunnethought the wall is the most effective technique to beat the game??But the systcm has learned no such thing; it docsn't rcally undcrstand what a tunncl, orwhat a wall is; it has just learned specific contingencies for particular scenarios. Transfertests-in which the deep reinforcement learning system is confronted with scenariosthat differ in minor ways from the one ones on which the system was trained show thatdeep reinforcement learnings solutions are often extremely superficial. For example, ateam of researchers at Vicarious showed that a more efficient successor techniqueDeepMind's atari system [Asynchronous Advantage Actor-Critic; also known as A3C]failed on a variety of minor perturbations to Breakout(Kansky et al, 2011) from netraining set, such as moving the Y coordinate (height)of the paddle, or inserting a wallmidscreen. These demonstrations make clear that it is misleading to credit deepreinforcement learning with inducing concept like wall or paddle; rather, such remarksare what comparative(animal) psychology sometimes call overattributions It's not thatthe atari system genuinely learned a concept of wall that was robust but rather the systemsuperficially approximated breaking through walls within a narrow set of highly trainedcircumstancesMy own team of researchers at a startup company called Geometric Intelligence (lateracquircd by Ubcr)found similar results as well, in the context of a slalom gamc, In 2017a team of researchers at Berkeley and OpenAl has shown that it was not difficult toconstruct comparable adversarial examples in a variety of games, undermining not onlyDQN (the original Deep mind algorithm) but also A 3C and several other relatedtechniques(Huang, Papernot, Goodfellow, Duan, abbeel, 2017)Recent experiments by robin Jia and Percy Liang(2017) make a similar point, in adifferent domain: language. Various neural networks were trained on a questionanswering task known as SQuAD (derived from the Stanford Qucstion AnsweringDatabase), in which the goal is to highlight the words in a particular passage thatcorrespond to a given question. In one sample, for instance, a trained system correctly,and impressively, identified the quarterback on the winning of Super Bowl XXXlll asJohn Elway, based on a short paragraph. But Jia and Liang showed the mere insertion ofistractor sentences(such as a fictional one about the alleged victory of Googles JeffIn the same paper, Vicarious proposed an alternative to deep learning called schema networks(Kansky et al., 2017)that can handle a number of variations in the Atari game Breakout, albeit apparently without the multi-gamegenerality of Deep Mind's Atari systenPage 8 of27Dean in another Bowl games)caused performance to drop precipitously Across sixteenmodels, accuracy dropped from a mean of 75% to a mean of 36%As is so often the case, the patterns extracted by deep learning are more superficial thanthey initially appear3.3. Deep learning thus far has no natural way to deal withhierarchical structureTo a linguist like Noam Chomsky, the troubles Jia and liang documented would beunsurprising. Fundamentally, most current deep-learning based language modelsrepresent sentences as mere sequences of words whereas Chomsky has long argued thatlanguage has a hierarchical structure, in which larger structures are recursivelyconstructed out of smaller components. (For example, in the sentence the teenager whopreviously crossed the atlantic set a record for flying around the world, the main clause isthe teenager set a record for flying around the world, while the embedded clause whopreviously crossed the Atlantic is an embedded clause that specifies which teenager.In the 80s Fodor and Pylyshyn(1988)expressed similar concerns, with respect to anearlier breed of neural networks. Likewise, in(Marcus, 2001), I conjectured that singlerecurrent neural networks(SRNs; a forerunner to today's more sophisticated deeplearning based recurrent neural networks, known as RNNs; Elman, 1990) would havetrouble systematically representing and extending recursive structure to various kinds ofunfamiliar sentences(see the cited articles for more specific claims about which types)Earlier this year, Brenden Lake and Marco Baroni (2017)tested whether such pessimisticconjectures continued to hold true. As they put it in their title, contemporary neural netswere Still not systematic after all these years. rNNs couldgeneralize well when thedifferences between training and test. are small [but] when generalization requiressystematic compositional skills, RNNs fail spectacularlyimilar issues are likely to emerge in other domains, such as planning and motor controlin which complex hierarchical structure is needed, particular when a system is likely toencounter novel situations. One can see indirect evidence for this in the struggles withtransfer in Atari games mentioned above, and more generally in the field of robotics, inwhich systems generally fail to generalize abstract plans well in novel environments8 Here's the full Super Bowl passage; Jia and Liangs distractor sentence that confused the model is at the endPeyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is alsothe oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led theBroncos to victory in Super Bowl XXXIll at age 38 and is currently Denver's Executive Vice President of FootballOperations and General Manager. Quarterback Jeff Dean had jersey number 37 in Champ Bowl XXXIV.Page 9 of27The core problem, at least at present, is that deep learning learns correlations betweensets of features that are themselves flat" or nonhierachical, as if in a simple unstructuredlist, with every feature on equal footing. Hierarchical structure(e.g syntactic trees thatdistinguish between main clauses and embedded clauses in a sentence )are not inherentlyor directly represented in such systems, and as a result deep learning systems are forcedto usc a varicty of proxies that arc ultimately inadequate, such as the sequential positionof a word presented in a sequencesSystems like Word2 Vec(Mikolov, Chen, Corrado, Dean, 2013)that representindividuals words as vectors have been modestly successful; a number of systems thathave used clever trickstry to represent complete sentences in deep-learning compatible vector spaces(Socher,Huval, Manning, ng, 2012). But, as Lake and baronis experiments make clearrecurrent networks continue limited in their capacity to represent and generalize richstructure in a faithful manner3.4. Deep learning thus far has struggled with open-ended inferenceIf you can't represent nuance like the difference between " John promised mary to leave??and "John promised to leave Mary,, you cant draw inferences about who is leavingwhom, or what is likely to happen next. Current machine reading systems have achievedsome degree of success in tasks like SQuAD, in which the answer to a givenquestion is explicitly contained within a text, but far less success in tasks in whichinference goes beyond what is explicit in a text, either by combining multiple sentences(so called multi-hop inference)or by combining explicit sentences with backgroundknowledge that is not stated in a specific text selection. humans, as they read texts,frequently derive wide-ranging inferences that are both novel and only implicitlylicensed, as when they, for example, infer the intentions of a character based only onindirect dialogAltough Bowman and colleagues(Bowman, Angeli, Potts, manning, 2015; WilliamsNangia, bowman, 2017) have taken some important steps in this direction, there is, atpresent, no deep learning system that can draw open-ended inferences based on realworld knowledge with anything like human-level accuracy3.5. Deep learning thus far is not sufficiently transparentThe relative opacity of black box neural networks has been a major focus of discussionin the last few years(Samek, Wiegand, muller, 2017; Ribeiro, Singh, Guestrin2016. In their current incarnation, deep learning systems have millions or even billionsof parameters identifiable to their developers not in terms of the sort of humanPage 10 of 27
下载地址
用户评论