Frequency principle/spectral bias

The frequency principle/spectral bias is a phenomenon observed in the study of Artificial Neural Networks(ANNs), specifically  deep neural networks(DNNs). It describes the tendency of deep neural networks to fit target functions from low to high frequencies during the training process.

This phenomenon is referred to as the frequency principle (F-Principle) by Zhi-Qin John Xu et al. or spectral bias by Nasim Rahaman et al. The F-Principle can be robustly observed in DNNs, regardless of overparametrization. A key mechanism of the F-Principle is that the regularity of the activation function translates into the decay rate of the loss function in the frequency domain.

The discovery of the frequency principle has inspired the design of DNNs that can quickly learn high-frequency functions. This has applications in scientific computing, image classification, and point cloud fitting problems. Furthermore, it provides a means to comprehend phenomena in practical applications and has inspired numerous studies on deep learning from the frequency perspective.

Experimental results


In one-dimensional problems, the Discrete Fourier Transform (DFT) of the target function and the output of DNNs can be obtained, and we can observe from Fig.1 that the blue line fits the low-frequency faster than the high-frequency.



In two-dimensional problems, Fig.2 utilises DNN to fit an image of the camera man. The DNN starts learning from a coarse image and produces a more detailed image as training progresses. This demonstrates learning from low to high frequencies, which is analogous to how the biological brain remembers an image. This example shows the 2D frequency principle, which utilises DNNs for image restoration by leveraging preferences for low frequencies, such as in inpainting tasks. However, it is important to account for insufficient learning of high-frequency structures. To address this limitation, certain algorithms have been developed, which are introduced in the Applications section.

In high-dimensional problems, one can use projection method to visualize the frequency convergence in one particular direction or use Gaussian filter to roughly see the convergence of the low-frequency part and the high-frequency part.

Theoretical results
Based on the following assumptions, i.e., i) certain regularity of target function, sample distribution function and activation function;   ii) bounded training trajectory with loss convergence, Luo et al. prove that the change of high-frequency loss over the total loss decays with the separated frequency with a certain power, which is determined by the regularity assumption. A key aspect of the proof is that composite functions maintain a certain regularity, causing decay in the frequency domain. Thus this result can be applied to general network structures with multiple layers. While this characterization of the F-Principle is very general, it is too coarse-grained to differentiate the effects of network structure or special properties of DNNs. It provides only a qualitative understanding rather than quantitatively characterizing differences.

There is a continuous framework to study machine learning and suggest gradient flows of neural networks are nice flows and obey the F-Principle. This is because they are integral equations which have higher regularity. The increased regularity of integral equations leads to faster decay in the Fourier domain.

Algorithms designed to overcome the challenge of high-frequency
Phase shift DNN: PhaseDNN converts high-frequency component of the data downward to a low-frequency spectrum for learning, and then converts the learned one back to the original high frequency.

Adaptive activation functions: Adaptive activation functions replace the activation function $$\sigma(x)$$  by $$\sigma(\mu ax)$$, where $$\mu$$ is a fixed scale factor with  $$\mu\geq1$$ and $$a$$ is a trainable variable shared for all neurons.

Multi-scale DNN: To alleviate the high-frequency difficulty for high-dimensional problems, a Multi-scale DNN (MscaleDNN) method considers the frequency conversion only in the radial direction. The conversion in the frequency space can be done by scaling, which is equivalent to an inverse scaling in the spatial space.

For the first a MscaleDNN takes the following form $$ f(\boldsymbol{x};\boldsymbol{\theta}) = \boldsymbol{W}^{[L-1]} \sigma\circ(\cdots (\boldsymbol{W}^{[1]} \sigma\circ(\boldsymbol{K}\odot(\boldsymbol{W}^{[0]} \boldsymbol{x}) + \boldsymbol{b}^{[0]} ) + \boldsymbol{b}^{[1]} )\cdots)+\boldsymbol{b}^{[L-1]}, $$ where $$\boldsymbol{x}\in\mathbb{R}^d$$, $$\boldsymbol{W}^{[l]}\in\mathbb{R}^{m_{l+1}\times m_{l}}$$, $$m_l$$ is the neuron number of $$l$$-th hidden layer, $$m_0=d$$, $$\boldsymbol{b}^{[l]}\in\mathbb{R}^{m_{l+1}}$$, $$\sigma$$ is a scalar function and $$\circ$$ means entry-wise operation, $$\odot$$ is the Hadamard product and$$\boldsymbol{K}=(\underbrace{a_1,a_1,\cdots,a_1}_{\text{1st part}},a_2,\cdots,a_{i-1},\underbrace{a_i,a_i,\cdots,a_i}_{\text{ith part}},\cdots,\underbrace{a_{N},a_{N}\cdots,a_{N}}_{\text{Nth part}})^T$$ where $$\boldsymbol{K}\in\mathbb{R}^{m_{1}}$$, $$a_i=i$$ or $$a_i=2^{i-1}$$. This structure is called Multi-scale DNN-1 (MscaleDNN-1).

The second kind of MscaleDNN which is denoted as MscaleDNN-2 in Fig.3 is a sum of $$N$$ subnetworks, in which each scale input goes through a subnetwork. In MscaleDNN-2, weight matrices from $$W^{[1]}$$ to $$W^{[L-1]}$$ are block diagonal. Again, the scale coefficient $$a_i=i$$ or $$a_i=2^{i-1}$$.

Fourier feature network: Fourier feature network map input $$\boldsymbol{x}$$ to $$\gamma(\boldsymbol{x})=[a_1 \cos(2\pi \boldsymbol{b}_{1}^{T}\boldsymbol{x}), a_1 \cos(2\pi \boldsymbol{b}_{1}^{T}\boldsymbol{x}),\cdots ,a_m \cos(2\pi \boldsymbol{b}_{m}^{T}\boldsymbol{x}),a_m \cos(2\pi \boldsymbol{b}_{m}^{T}\boldsymbol{x})]$$ for imaging reconstruction tasks. $$\gamma(\boldsymbol{x})$$ is then used as the input to neural network. An extended Fourier feature network for PDE problem, where the selection for $$b_i$$ is from different ranges. Ben Mildenhall et al. successfully apply this multiscale Fourier feature input in the neural radiance fields for view synthesis

Frequency perspective for understanding experimental phenomena
Compression phase: The F-Principle explains the compression phase in information plane. The entropy or information quantifies the possibility of output values, i.e., more possible output values lead to a higher entropy. In learning a discretized function, the DNN first fits the continuous low-frequency components of the discretized function, i.e., large entropy state. Then, the DNN output tends to be discretized as the network gradually captures the high-frequency components, i.e., entropy decreasing. Thus, the compression phase appears in the information plane.

Increasing complexity: The F-Principle also explains the increasing complexity of DNN output during the training.

Strength and limitation: The F-Principle points out that deep neural networks are good at learning low-frequency functions but difficult to learn high-frequency functions.

 Early-stopping trick: As noise is often dominated by high-frequency, with early-stopping, a neural network with spectral bias can avoid learn high-frequency noise.