AANN 16/05/2024

Universal approximation theorem
- Theorem
- (Sketch of) proof

Universal approximation theorem

Theorem

Assume \(\sigma\in C(\mathbb{R},\mathbb{R})\) non-constant with finite (and distinct) limits as \(x\to\pm\infty\) and \((\sigma\circ~x)_{i}=\sigma(x_{i})\) (i.e. the element-wise application). If \(n,m\in\mathbb{N}\) and \(f\in C(K,\mathbb{R}^{m})\) where \(K\subset\mathbb{R}^{n}\) is compact, then, for all \(\epsilon>0\) there exists \(k\in\mathbb{N},A\in\mathbb{R}^{k\times~n},b\in\mathbb{R}^{k},D\in\mathbb{R}^{m\times k}\) such that

\[ \left\| f(x) - g(x) \right\|_{\infty} < \epsilon \]

where \(g(x) = D \cdot (\sigma \circ (A\cdot x + b))\).

Recall the supremum norm is given by \(\left\| f \right\|_{\infty} = \sup_{x}|f(x)|\).

This means that, for a continuous function \(f\), given a suitable activation function \(\sigma\), a wide enough neural network with a single hidden layer is capable of approximating \(f\) arbitrarily well (in the supremum norm). This doesn't tell you how to construct such a network, but it does guarantee that one exists.

The universal approximation theorem as usually stated is slightly stronger than this but this is the result I am most interested in. The "proof" here is cobbled together from a few sources (including Wikipedia).

(Sketch of) proof

Uniform convergence in \(\mathbb{R}^{m}\) follows from uniform convergence in each of the coordinates, so it is sufficient to consider the case \(m=1\)
Lemma we can approximate the piecewise linear functions arbitrarily well.
The Stone-Weierstrass theorem tells us that the piecewise linear functions are dense in \(C(K,\mathbb{R})\) hence contains our desired function \(g\).

Lemma

We get step functions by scaling the input sufficiently and using the assumption that the \(\sigma\) has finite (and distinct) limits. This means we can choose \(A\) and \(b\) such that \(\sigma(A\cdot~x+b)\) to get functions arbitrarily close to some step function. The value of \(b\) controls the location of the step. With \(D\) we can then scale this to arbitrary height.
Taking a linear combination of linear combinations can be achieved with \(D\), so we can take linear combinations of the step function to approximate a ramp function:

\[ f(x) = \begin{cases} 0 & x< 0 \\ x & 0\leq x\leq 1 \\ 1 & 1< x \end{cases} \]

Using these ramp functions we can construct arbitrary piecewise linear functions.

Stone-Weierstrass theorem

Let \(K\) be a compact Hausdorff space and \(\mathcal{F}\) a subalgebra of \(C(K,\mathbb{R})\) containing a non-zero constant function. Then \(\mathcal{F}\) is dense in \(C(K,\mathbb{R})\) if and only if \(\mathcal{F}\) separates points in \(K\).

Recall that a set of functions on \(K\) separates points if for any \(x,y\in~K\) if \(x\neq~y\) then there is an \(f\in\mathcal{F}\) such that \(f(x)\neq~f(y)\). A space is Hausdorff if different elements have disjoint neighbourhoods.