A Brief Overview of Deep Learning

amelius · on Jan 15, 2015

I found this to be insightful:

> ... human neurons are slow yet humans can perform lots of complicated tasks in a fraction of a second. More specifically, it is well-known that a human neuron fires no more than 100 times per second. This means that, if a human can solve a problem in 0.1 seconds, then our neurons have enough time to fire only 10 times --- definitely not much more than that. It therefore follows that a large neural network with 10 layers can do anything a human can in 0.1 seconds.

svantana · on Jan 15, 2015

This seems wrong. My understanding is that the ~200 spikes/second limit derives from the cell needing to "reload" before firing again, rather than some built in latency. Relaying a spike can be very quick indeed. A better conclusion would be that we don't have time for too many recursions in that short a time. Also, I can't think of any particularly hard problem that humans can solve in 0.1 seconds (see e.g. http://en.wikipedia.org/wiki/Hick%27s_law)

bnegreve · on Jan 15, 2015

> Also, I can't think of any particularly hard problem that humans can solve in 0.1 seconds

Recognizing someone is a hard problem that most humans can do efficiently (except me maybe)

gravity13 · on Jan 15, 2015

I actually came in here to ask for clarification on this.

This sounds presumptuous. Couldn't there be thousands of neural networks which all receive an input signal, and we know how to interpret the output of all of these, as a signal. Maybe the neural networks themselves are all only 10 layers deep, but if they're all running in parallel - this defeats this point, right?

I don't really know anything about it, though.

sesqu · on Jan 15, 2015

Combining a thousand networks into one signal is effectively one extra layer (though I don't think human neurons can actually handle a thousand incoming connections).

The observation was a comment on how deep a network needs to be to perform useful tasks.

im2w1l · on Jan 16, 2015

Humans have >10^14 synapses and slighly short of 10^11 neurons. Neurons thus have more than thousand incoming connections on average.

Purkinje cells can have on the order of hundreds of thousands of inputs.

anjc · on Jan 15, 2015

Good article, but I still don't understand why they're suddenly popular again. So processing is faster, but is there some development in processing which has improved this domain especially? Developments in "big data" processing? Concepts like mapreduce? What's with the resurgence :S

nharada · on Jan 15, 2015

Deep neural networks are quite powerful because they don't require someone to design the network specific to the task. Large deep nets will generally be able to learn feature representations by themselves, which makes them extremely powerful for tasks where we aren't quite sure of exactly what features we want.

However, deep nets require large amounts of training data and computational power. The fairly recent widespread adoption of general purpose GPUs has allowed much faster training. Combine this with the popularity of "big data", and you've got a perfect storm for deep neural nets.

Of course, the hype may be overvaluing deep nets as the future of AI. DNNs work well in practical applications, but they're poorly defined theoretically and the AI community suspects that we're still bad at training them -- a recent paper showed that a simpler shallow net can perform as well as a deep net if a deep net is trained first[1]. We're also fairly certain that deep nets are not how the brain actually works, and thus we'll need a different architecture in order to achieve human level performance on some tasks.

[1] http://arxiv.org/abs/1312.6184

discardorama · on Jan 15, 2015

There are a myriad reasons why they're popular again.

1. Hardware has caught up, and is cheap. When Backprop was invented back in the 80s, you couldn't train networks with more than a couple of 1000 nodes tops. Today, with GPUs, you can train networks with billions of parameters.

2. More data is available. Back in those days, you had a few dozens (maybe a few 100s) of examples in your training set. Today, people play with sets larges than 1TB.

3. Dramatic successes. For a while, the ImageNet competition was seeing slow and stead progress. Then DL comes along, and there's a 20% jump in performance (I'm too lazy to look up the exact numbers...). If you've ever competed in such competitions, progress is painfully slow (see, for example, the Netflix competition). So a jump of that magnitude in performance in 1 step is mind-blowing. On top of that, every year since then, the performance has increased significantly.

These are just 3 that come to mind.

chrdlu · on Jan 15, 2015

I feel like cloud technology may be a contributing factor as well.

At the same time, once Geoffrey Hinton used a deep neural network and participated in the ImageNet contest (2012: http://www.image-net.org/challenges/LSVRC/2012/results.html), his results beat the next best thing by a full 10%. The results were so astounding that many people immediately began re-visiting neural networks. Shortly afterwards, people proved it could beat the current technology for language processing and more. Now a days, it seems like a major leap has been in real-time translation with Skype and now Google launching machine translation applications/functions.

Side note, in my opinion, start-ups that are looking to compete with large giants like Google will have a pretty hard time. In the end, implementing deep neural nets that work is still extremely hard. The companies that do it right usually get bought up by one of the giants. Google has some of the leading researchers in academia on its side as well.

nightski · on Jan 15, 2015

But don't kid yourself, if you have ever been to CVPR, or one of the machine learning conferences you see all of the papers trying to squeeze a few more % accuracy on a test set out of existing algorithms. Maybe one or two papers actually do something novel. Rarely will their be a new approach altogether.

The point I am trying to make is that one shouldn't get star eyed by leading researchers in the field and assume they can't contribute. Simple novel ideas have led to massive changes in the industry.

chrdlu · on Jan 15, 2015

Fair point, Alex Krizhevsky did train the entire neural network on 2 GPUs in his bedroom. I stand corrected

hiddencost · on Jan 15, 2015

Hinton published a much more efficient training algoritm in the last couple decades. They've also shown extremely impressive performance in vision and speech.

hyperbovine · on Jan 15, 2015

They are adept at storing large numbers of patterns and therefore excel at many supervised learning tasks. There indeed was a development in processing, namely the emergence of the GPU, which led to the resurgence. It enabled people to train far larger networks. Large networks = more patterns.

joyofdata · on Jan 15, 2015

> therefore follows that a large neural network with 10 layers can do anything a human can in 0.1 seconds.

very funny ... as if ANNs are sufficiently comparable to actual neural activity. also I think it is naive to assess the "powerful"-ness of the brain to what is going on in a single neuron - it is certainly the parallel interaction which creates the human intelligence.

> And if human neurons turn out to be noisy (for example), which m...

it is pretty naive to consider noise as something of only handicapping nature - a lot of algorithms are as powerful as they are by utilizing noise and stochasticity

> What is learning? Learning is the problem of finding a setting of the neural network’s weights that achieves the best possible results on our training data.

Wrong - this is memorizing ... learning is the process leading to a low out-of-sample error.

lars · on Jan 15, 2015

This post is written by Ilya Sutskever, who has co-authored some of biggest breakthroughs in machine learning the last five years. Which do you think is most likely: a) That he has a naive understanding of machine learning and neuroscience, or b) that this was written informally, and without guarding against every possible way it can be misinterpreted. Please be a little charitable when interpreting other peoples writings.

joyofdata · on Jan 15, 2015

just curious - can you give an example for a big breaktrhough he co-authored?

nonetheless - some of his remarks are very specific and I don't see how informal style applies here to excuse them.

lars · on Jan 15, 2015

He was second author on the AlexNet paper, wherein Alex Krizhevsky, Sutskever and Hinton blew everyone else out of the water on the ImageNet competition [2]. Their error rate was about 10 percentage points lower than others. Relatively speaking they had about 40% fewer errors than anyone else. This is possibly the biggest result in computer vision the last five years. So it seems a little silly to educate him on the basics of machine learning :)

[1] http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

[2] http://www.image-net.org/challenges/LSVRC/2012/results.html

joyofdata · on Jan 15, 2015

well thanks for the info - but then I shift my critique to that I find it unnecessary to distort ML and biological concepts just to simplify the subject, when an accurate depiction wouldn't be much more difficult. Especially to not differentiate properly between memorization and generalization/learning is odd b/c this is one of the most prominent mistakes - it is specifically not the goal to minimize the in-sample-error! that would lead to very bad results most of the time

p1esk · on Jan 16, 2015

Actually, Ilya explains his statement regarding minimizing training errors in his comment exchange with Bengio:

"Although I didn't define it in the article, generalization (to me) means that the gap between the training and the test error is small. So for example, a very bad model that has similar training and test errors does not overfit, and hence generalizes, according to the way I use these concepts. It follows that generalization is easy to achieve whenever the capacity of the model (as measured by the number of parameters or its VC-dimension) is limited --- we merely need to use more training cases than the model has parameters / VC dimension. Thus, the difficult part is to get a low training error."

dharma1 · on Jan 15, 2015

I've been playing with Caffe for recognising images. It's kind of mind blowing how well it works. Yet the networks I tested could "only" recognise photos, not drawings or anything abstract.

A human could easily attribute meaning to a drawing, even if the drawing was very abstract or she had never seen a similar drawing before. Whereas a deep networks seem to rely on visual similarity to things it has seen in the past, on a pixel level. The networks I tried could tell something was a cartoon, but not what the cartoon depicted, even if it's something simple like a face.

The deep networks I tried also really struggled with recognising different textures. Like closeups of sand, water etc, things that a human would instantly recognise. They could classify it as a texture but not what kind of texture.

Houshalter · on Jan 18, 2015

NN's have been able to represent fairly abstract art, e.g. http://i.imgur.com/HU66Vo7.png?1

They can also generate abstract images when the images are optimized to be recognized by the NN: http://i.imgur.com/Mixk96V.png?1

I think it's likely that cartoons contain a lot of meaning and symbols that is specific to human culture. Imagine a stick figure in the simplest case. It's not obvious that a circle and sticks should be a person. Same with a lot of other cartoon features that look nothing like reality.

sadkingbilly · on Jan 15, 2015

"so I implemented a small neural network and trained it to sort 10 6-bit numbers, which was easy to do to my surprise"

Does anyone know what the inputs and outputs of a neural network that sorts numbers would look like?

_ntka · on Jan 15, 2015

Input: a 60-dimensional vector that is the concatenation of 10 6-dimensional binary vectors encoding the binary representation of the input numbers.

Output: the same, sorted.

At least that's one dead simple way to formulate the problem, multiple other solutions would work as well, and some would probably work better.

primaryobjects · on Jan 15, 2015

I started playing with this. I take each digit and normalize it by dividing by 9. Then use each normalized digit as an input:

Example sorting 987654 and 123456

  Input: 1, .9, .8, .7, .6, .5, .2, .3, .4, .5, .6, .7
  Expected output: .2, .3, .4, .5, .6, .7, 1, .9, .8, .7, .6, .5

You can then encode/decode the inputs and outputs accordingly. if (value <= 1) digit = 9; if (value <= 0.9) digit = 8; ... if (value <= 0.2) digit = 1; if (value <= 0.1) digit = 0; etc.

I'm able to get 100% accuracy on a limited training set with 2 hidden layers of 10 nodes. 33% accuracy on the test set (but likely need a lot more data to train with).

primaryobjects · on Jan 16, 2015

Update: I was able to train the network to sort sets of two 3-digit numbers. I used a neural network with 2 hidden layers of 25 nodes. The training/test accuracy after 10 minutes is 78%/74%. Not bad. https://github.com/primaryobjects/nnsorting

croddin · on Jan 15, 2015

Interesting. It sounds like he used binary. I wonder how different bases affect how well the network can learn things like sorting?

SeckinJohn · on Jan 15, 2015

you'd be better off if you mapped those 10 numbers to a number in [0:(10!-1)] interval since there are only 10! different potential answers. So you make the job of the algorithm easier (figuring out the "structure" of the function it's trying to solve for is easier when the result space is compressed without any 'signal' loss)

another improvement could be "enhancing" of the inputs: when you figure out how you will permute the numbers to sort them, create specific and randomized variations of that specific list of numbers and feed the learning algorithm with the correct results of those permutations too. for instance if you have 5 70 2 13 as a training input, the trainer algorithm could generate the following extra inputs based on this so that the algorithm will get a better chance of figuring out the sorting for a test input like 2 15 5 65:

2 5 23 70 2 5 33 70 ... 2 5 63 70 2 5 73 70 also: 2 5 14 70 2 5 13 69 etc. also modify more than 1 number at the same time(both systematically and also randomly) to generate even more "gray"-input

gamegoblin · on Jan 15, 2015

Could also have an 18 bit output where each 3 bit block told you the index of that input in the sorted array.

elliptic · on Jan 15, 2015

Has anyone had experience training deep nets for domains in which examples are large heterogenous collections (as opposed to speech, or text, or images), like say transactional or click-stream data?

_ntka · on Jan 15, 2015

This post, while very interesting, attempts to draw a completely unwarranted parallel between deep nets and the human brain, as if layers of artificial neurons running on a GPU and the cortical layers of your brain were two interchangeable things.

So far, there has been no evidence that the brain works anything like an artificial neural network. Maybe it does, and there are several theories in that direction, but at the moment we have no solid reason to think so.

lscharen · on Jan 15, 2015

The point of drawing comparisons to the human brain is that we know how quickly humans can perform visual recognition tasks and speed of signal propagation between neurons. Combining these two properties implies that the human brain is able to solve these tasks without feedback, i.e. no loops. Thus, a DNN should be able to perform similar tasks if it can be trained (which it can).

Recurrent neural nets add feedback and are are whole different kettle of fish.

avereveard · on Jan 15, 2015

The brain also appears to have dedicated network structures that are not trained, but constrained, i.e. programmed to one transformation, say, the equivalent of feeding both an image and it's edge enhanced version to the same DNN.

Current approach of feeding raw bitmaps to DNN falls short of that and is very sensitive to training data[1]

I remember an old paper I cannot find now about how to normalize image for NN processing in face recognition. Software extracted the face, centered it on a square and projected that square on a circle around the center to make face orientation irrelevant (hard to explain without images)

Anyway, it is unfair to expect a DNN to perform vision recognition tasks from raw bi-dimensional image points.

[1] http://www.i-programmer.info/news/105-artificial-intelligenc...

rdtsc · on Jan 15, 2015

Whatever happened to shallow learning (or you know the regular learning) everyone did before deep learning.

Anyone still doing that?

Is this like BigData. As soon as someone mentioned BigData, anyone in the world who touched data all of the sudden did BigData.

So is this something coming out of Google and Facebook and such and everyone else in Academia is happily building SVMs and 2 layer neural networks or some new discovery happend and overturned the whole ML and AI field on its head?

> Crucially, the number of units required to solve these problems is far from exponential --- on the contrary, the number of units required is often so “small” that it is even possible, using current hardware,

Number of units is not what's important. There are "only" what, 10B (100B?)neurons in the brain? But isn't the trick in the connections. And there are orders of magnitudes more connectsion (hundreds of trillions). Not exponential but even quadratic at those numbers is still quite large.

nl · on Jan 15, 2015

Whatever happened to shallow learning (or you know the regular learning) everyone did before deep learning.

Deep learning happened, and it pretty much always beats other approaches. Saying that sounds unbelievable, so here's a quote from Pete Warden:

I know I’m a broken record on deep learning, but almost everywhere it’s being applied it’s doing better than techniques that people have been developing for decades[1]

There's a great paper from a group of researchers who set out to prove that their technique, which they had many years of experience in (SVMs?) was just as good as deep learning (I can't remember their field). They ended up proving the opposite, and switched their whole lab over to doing deep learning. I can't find the paper (!!) so I'll refer you to [2] instead.

[1] http://petewarden.com/2015/01/01/five-short-links-76/

[2] http://petewarden.com/2014/06/10/why-is-everyone-so-excited-...

unlikelymordant · on Jan 15, 2015

deep learning is only applicable to some problems e.g. speech recognition and image classification where there is mountains of data to train many network parameters but also a certain level of complexity in the features. Most simple problems with a few thousand feature vectors for training are still solved by SVMs etc. Deep learning is getting excellent results in the problems it is good at though, better than any other classifier.

Remember SVMs were being thrown around as the ML wunderkind prior to Deep learning. After a while people figure out exactly what some things are good at and some things not so good.

wodenokoto · on Jan 15, 2015

A rather sizable part of the article talk about why large deep network can solve problems that shallow networks can't.

I won't say that the ML field got turned on its head by deep networks. I think people from the very start have wanted to try and make networks deeper, if only they'd known how.