*Update*: In a dramatic turn of events, the SARM paper has been withdrawn. This appears to be the conclusion of a series of events which have unfolded over a peroid of 6 or so days, as the machine learning community struggled to understand and reproduce the results in this facinating manuscript. I too, was caught up in this wave of hype, which prompted me to write this blog post.

The hype revolves around Wang's paper's extraordinary claim - that a network could be trained effectively without back-propagation. The technique, a form of unsupervised pre-training, is described in the final section of this article, "A note on training". The authors claim that by doing pre-training *alone*, the network could achieve an accuracy rivaling VGGNet. This allowed the network to be trained in hours, not days, on just 0.5% of the data.

The withdrawal notice states that this is not the case.

To obtain the reported SARM performance, for each layer a number of candidate 0.5% subsets were drawn and tried, and the best performer was selected; the candidate search may become nearly exhaustive ... The current implementation for SARM-VGG will bring in dramatically higher complexity and can take multiple days.

The math in this blog post is still accurate. The ideas which underpin the SARM - sparse coding, dictionary learning, and proximal gradient are classical. No modification to this post is hence required. But this turn of events bring the efficacy of SARM into question. Future research may ultimately vindicate this architectural curiosity, but for now, sadly, the proof is not in the pudding.

There has been some buzz on reddit here and here about the paper Stacked Approximated Regression Machine: A Simple Deep Learning Approach. Approaching the paper with a measured dose of skepticism, I was pleasantly surprised to find the paper containing a beautiful kernel of an idea - one I can see becoming a fixture in our deep learning toolkit, the $k$-Approximated Regression Machine ($k$-ARM).

A deep neural network is a function which takes as input, $x$, data, into an output, $F(x)$. The word “deep” in deep learning refers to the network being a composition of layer functions: $$F(x)=\Psi^{m}(\,\Psi^{m-1}(\,\cdots\,\Psi^{1}(x)\,\cdots\,)\,)$$

A traditional choice of a layer function looks like $\Psi^{k}(x)=\sigma(W^{T}_k x)$. Here $W^{T}_k$ is a matrix, representing a linear transform, (such as a convolution or a fully connected layer) and $\sigma$ is a choice of a non-linear activation function. The goal in deep learning is to shape our function $F$, by any means possible, by tweaking the weights till they fit the data.

While traditional tools worked well, there has been recently a Cambrian explosion of nontraditional layers. Consider the following conceptual layer. This layer isn't defined explicitly, but is instead defined implicitly as the solution of a *sparse coding problem*:
$$\Psi(x)=\underset{y}{\mbox{argmin }}\big\{\tfrac{1}{2}\|Wy-x\|^{2}+\lambda\|y\|_{1}\big\}.$$
In the spirit of giving new names to old things, I dub this the Regression Machine (a better name would be the Sparse Coding Layer, but lets stick to this).

This layer behaves very differently from a regular deep learning layer and it might be worthwhile to consider what it does. $W$ is a set of weights, to be trained, and $\lambda$ is a parameter controlling how sparse the output is. Assuming a trained model, with sensible $W$, this layer is a sparse encoder. It takes $x$, its input and transforms it into its representation in structural primitives (the columns of $W$).

It is useful to think of a musical metaphor. The sound of a chord (our input $x$) is generated by a sparse superposition of sounds corresponding to a small number of notes (our sparse parameter $y$). Our layer takes this generative process and throws it in reverse, and attempts to, from the sound of a chord, to recover the individual notes.

If you have a background in engineering, these terms might be familiar to you in a different language. The generative process can be seen as the forward model,

and the problem of recovering the notes from its output, the inverse problem

This layer is therefore, paradoxically, the opposite of a regular deep learning layer. But it is this lopsided layer which makes the most sense, as it truly takes data from input space into explanation space - a higher level of abstraction.

Gregor and LeCun observe that sparse coding exhibits the property of mutual inhibition. Though $W$ may contain numerous similar columns, generally only one among these will be chosen to represent the input. It's no surprise then that this is the mechanism proposed by Olshausen et al to represent the early stages of visual processing.

Despite its implicit definition, we can still take its (sub) gradient, $$ \begin{align*} \nabla\Phi(x)_{ij} & =\begin{cases} [(Z(W^{T}W)Z^{T})^{-1}ZW^{T}]_{ij} & y_{i}^{\star}\neq0\\ 0 & y_{i}=0 \end{cases}\\ \frac{\partial}{\partial W_{ij}}\Phi(x) & =\frac{\partial}{\partial W_{ij}}(Z(W^{T}W)Z^{T})^{-1}Z[W^{T}x-\lambda\mbox{sign}(y^{\star})]\\ Z & =\mbox{operator which removes rows $j$s for which $\Psi(x)_j = 0$} \end{align*}$$

Hence there is no technical roadblock in integrating this into a deep learning framework.

This leads to the troubling question - why isn't this kind of layer de-facto? The trouble with this layer is that it is expensive. Computing this layer requires the solution of a small optimization problem (on the order of the number of features in a datum). Measured in milliseconds, this may be cheap - the cost of solving this problem is on the order of solving a linear system. But it is still an order of magnitude more expensive than a simple matrix multiplication passed through a nonlinearity. Adding a single layer like this would potentially increase the cost of every training step a hundredfold.

Let us instead try to approximate this layer. First, I define a (seemingly) cryptic operator $$ \Psi_{x}'(y) := \sigma(y-\alpha W^{T}(Wy-x)), \qquad \sigma(y)_{i}=\mbox{sign}(y_i)\max\{0,\left| y_{i} \right|-\lambda\} $$

I claim repeated applications of this operator give us a procedure for solving the sparse coding problem, i.e. finding $$\underset{y}{\mbox{argmin }}\big\{\overset{f(x)}{\overbrace{\tfrac{1}{2}\|Wy-x\|^{2}}}+\lambda\|y\|_{1}\big\}.$$

Lets pause to consider what this operator does. It proceeds in two steps

- (
**Gradient Step**) First we decrease $f$ with a gradient step $y - \alpha\nabla f(y)$. - (
**Sparsity Step**) Next, we apply $\sigma$, snapping all components of the vector from the step above between $[-\lambda, \lambda]$ to zero.

The first step decreases the objective, and the second promotes sparsity. Formally:

- $\Psi_x$ is the unique fixed point of the map $y\mapsto \Psi_{x}'(y),$ i.e. $\Psi(x)=\Psi_{x}'(\Psi(x))$
- From any inital point $y$, repeated application of the map will always converge to $\Psi(x).$

In informal but suggestive notation, $$\Psi(x)=\Psi_{x}'(\Psi_{x}'(\Psi_{x}'(\cdots))).$$

Which suggests a series of approximations $\Psi\approx\Psi_{k}$, (starting at $0$ for simplicity), $$\begin{align} \Psi_{0}(x) & =\Psi_{x}'(\mathbf{0})\\ \Psi_{1}(x) & =\Psi_{x}'(\Psi_{x}'(\mathbf{0}))\\ \Psi_{2}(x) & =\Psi'_{x}(\Psi_{x}'(\Psi_{x}'(\mathbf{0})))\\ & \vdots\\ \lim_{k\rightarrow\infty}\Psi_{k}(x) & =\Psi(x). \end{align}.$$

This approximation was observed by Gregor and LeCun has an architectural diagram looking like this

And our most aggressive approximation, $\Psi_0(x)$ has the form $$ \Psi_{0}(x) =\Psi_{x}'(\mathbf{0}) =\sigma(\mathbf{0}-\alpha W^{T}(W\mathbf{0}-x)) =\sigma(\alpha W^{T}x) $$

which should look familiar. Why, it's nothing more than our traditional layer discussed at the beginning! A tantalizing coincidence.

Where does the operator $\Psi_{x}'(y)$ come from? There is, in fact, a powerful framework in convex optimization which gives us the tools to craft such operators. The study of proximal operators gives us a means to solve problems of the form $$y^\star = \mbox{argmin}\{f(y)+g(y)\}$$

Where $f$ is smooth and convex and $g$ is just convex. For reasons which will be clear later, $f$ is usually chosen to be the vessel of our data, and $g$ to be structurally simple, like the 1-norm. Sparse coding, discussed earlier, is a special case of this where $$f(y)=\tfrac{1}{2}\|Wy-x\|^{2},\qquad g(y)=\lambda\|y\|_{1}$$

This framework gives us a recipie on how to replace the 1-norm with any convex function $g$, such as any p-norm, grouped 2-norms and more.

Given $f$ and $g$, the proximal gradient method defines the operator $$\Psi_{f}^{'}[g](y)=\sigma[g](y+\alpha\nabla f(y)),\qquad\sigma[g](y)=\mbox{argmin}_{\bar{y}}\{\tfrac{1}{2}\|\bar{y}-y\|^{2}+g(\bar{y})\}$$

so that, you guessed it

- $y^\star$ is the unique fixed point of the map $y\mapsto \Psi_{f}'[g](y)$, i.e. $y^\star = \Psi_{f}'[g](y^\star)$
- From any $y$, for small enough $\alpha>0$ repeated application of the map will always converge to $y^\star$

See Boyd's slides for more details. A few notes are in order. First, though you can in principle stick in any $g$ you want, you cannot escape having to compute $\sigma[g]$. This method hinges on $\sigma[g]$ having something resembling a closed form solution. Second, this framework encompasses constrained optimization with a bit of syntactic sugar: $$\underset{y\in S}{\mbox{argmin }}f(y)=\underset{y}{\mbox{argmin}}\{f(y)+\delta(y \, | \, S)\},\qquad\delta(y)=\begin{cases} 0 & y\in S\\ \infty & y\notin S \end{cases}.$$

So the constrained problem can be "simulated" by using $g(y) = \delta(y \, | \, S)$. Finally, the map described in the previous section is a special case of the proximal gradient iteration. I'll leave it as an exercise to show

$$\sigma[\lambda\|\cdot\|_1](y)_{i}=[\mbox{argmin}_{\bar{y}}\{\tfrac{1}{2}\|\bar{y}-y\|^{2}+\lambda\|\bar{y}\|_{1}\}]_{i}=\mbox{sign}(y_{i})\max\{0,\left|y_{i}\right|-\lambda\}$$Here are some $g$'s, and their $\sigma[g]$'s: $$\begin{alignat*}{2} & g(y)=0 & & \sigma[g](y)=y\\ & g(y)=\lambda\|y\|_{2} & & \sigma[g](y)=\max\{0,1-\lambda/\|y\|_{2}\}y\\ & g(y)=\delta(y,\{x\mid\|x\|_{2}\leq1\}) & & \sigma[g](y)=y/\|y\|_2\\ & g(y)=\delta(y,\mathbf{R}_{+}) & \qquad & \sigma[g](y)_{i}=\max\{0,y_{i}\}\\ & g(y)=\delta(y,\mathbf{R}_{+})+\lambda\|y\|_{1} & \qquad & \sigma[g](y)_{i}=\max\{0,y_{i}-\lambda\} \end{alignat*}$$

Now for each $g$, we have its corresponding $g$-Regression Machines, $$\Psi[g](x)=\underset{y}{\mbox{argmin }}\big\{\tfrac{1}{2}\|Wy-x\|^{2}+g(y)\big\},$$ with corresponding $g$-ARMs defined using the iterations of the map $$\Psi[g]_{x}'(y):=\sigma[g](y-\alpha W^{T}(Wy-x))$$

Take for example the Non-Negative Regression Machine, with $g(x)=\delta(x,{\mathbf R}_+)$. This has $$\Psi[\delta(\cdot|\mathbf{R}_{+})]=\underset{y\geq0}{\mbox{argmin }}\tfrac{1}{2}\|Wy-x\|^{2},\quad\Psi[\delta(\cdot|\mathbf{R}_{+})]_{x}'(y):=\max\{0,y-\alpha W^{T}(Wy-x)\}$$ With a 0-ARM of $$\Psi[\delta(\cdot|\mathbf{R}_{+})]_{0}(y)=\max\{0,W^{T}y\},$$

exactly a traditional layer with `ReLu`

Activation.

Finding the $W$'s for the network still remains a nagging problem. Fortunately, the usual set of tools apply. Given a large labeled dataset, training a stacked $k$-ARM with backpropagation is still viable. The interpretation of a neural net as a sparse code, however, lends itself to a natural heuristic for initializing $W$'s.

With a large unlabeled dataset {$x_1,\dots,x_n$}, the problem of finding the weights $W$ for a single Regression Machine is the Dictionary Learning Problem. Dictionary Learning learns both the $W$'s and $y_i$'s jointly, in the following (large) optimization problem: $$\underset{W,y_{i}}{\mbox{argmin}} \sum_{i=1}^{n}\left(\tfrac{1}{2}\|Wy_{i}-x_{i}\|^{2}+\lambda\|y_{i}\|_{1}\right).$$

Similarly, a single Non-Negative Regression Machine can trained via Non-Negative Matrix Factorization $$\underset{W\geq0,y_i\geq0}{\mbox{argmin}} \sum_{i=1}^{n}\tfrac{1}{2}\|Wy_{i}-x_{i}\|^{2}$$

assuming the data is positive entry-wise. These two methods can be seen as variations of the mother of unsupervised learning, Principle Component Analysis (PCA). Like PCA, we are trying to find an explanation of the data as a small linear combinations of vectors. Unlike PCA, we have extra restrictions, like sparsity or non-negativity.

When the ARMs are stacked, therefore, it seems natural to train the network greedily, since the output of the previous layer is input in the new layer. The SARM authors of the paper propose the following heuristic for initializing a sparse ARM:

- Take the raw data, $x_i$, and solve the Dictionary Learning problem to yield $W$ and $y_i$'s.
- Use $y_i$'s as the raw data for the next layer, and repeat.

The method of choice for Dictionary Learning problem described is $k$-PCA.

This technique eerily recalls our forgotten hero the Restricted Boltzman Machine (RBM). Like the RBM, it too is a form of unsupervised pre-training which learns its weights greedily. Unlike the RBM, we don't need to deal with the uncomfortable transplantation of the weights into a deep, directed network. The $k$-ARM unifies initialization and inference, making our weight surgery less jarring.

This rationalization may soothe those of us who crave explanations for things which work, but the most valuable proof is in the pudding. The stacked ARMs show exciting results on training data, and is a great first step in what I see as an exciting direction of research.

Oh, and here's a link to my website.

The Picture in the title, as well as the sparse encoding picture, is taken from Olshausen et al, Sparse Coding With An Overcomplete Basis Set: A Strategy Employed by V12?. Thanks for all the excellent commenters in here, especially jcannell for excellent insights (and comments), and Marcos Ginestra, for proofreading the whole thing. Also thanks to Kiuhnm for pointing an error in the gradient formula, and other corrections.

In [2]:

```
from IPython.core.display import HTML
HTML(open("style.css","r").read())
```

Out[2]:

In [ ]:

```
```