Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It doesn't necessarily seem that surprising to me.

If I can be really hand-wavy about it -- A lot of deep neural networks achieve their nonlinearity in a very constrained way. They stack linear models on top of each other, and the nonlinearity comes from using a a relatively simple nonlinear function such as logistic or tanh to scale the models' outputs before feeding them into the next one. (Without that step, you'd just have a linear combination of linear functions, which would itself be linear.)

That's a pretty constrained form of nonlinearity compared to polynomial regression, which tries to directly fit some high-order polynomial. I don't have anything like the math chops to prove it this, but I believe that means that the neural network is going to tend to favor a relatively smoother decision boundary, whereas polynomial regression is a naturally high variance sort of affair.



I agree this is one of the reasons for the success of neural networks. But it was not obvious at first, and it still is quite hard to formalize and explain in mathematical terms. That's what I meant with "surprising".


No, it's not constrained at all. In fact, even single hidden layer networks with nearly arbitrary activation functions are universal approximators (Hornik et al. 1989). Polynomials are also universal approximators (Weierstrass).


I'm not trying to say that neural networks are inherently constrained. I'm saying that, in typical usage, they tend to be used a certain way that I believe introduces some useful constraints. You can use a single hidden layer and an arbitrary activation functions, but, in practice, it's a heck of a lot more common to use multiple hidden layers and tanh.

It's worth noting that neural networks didn't take off with Hornik et al. style simple-topology-complex-activation-function universal approximators. They took off a decade or so later, with LeCun-style complex-topology-simple-activation-function networks.

That arguably suggests that the paper is of more theoretical than practical interest. It's also worth noting that one of the practical challenges with a single hidden layer and a complex activation function is that it's susceptible to variance. Just like polynomial regression.


This kind of stuff is called inductive bias and is a sexy topic nowadays.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: