Talk:Knowledge distillation

Knowledge capacity
This article starts with the claim: "While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized"

This is misleading. It suggests that deep networks have greater storage capacity. They don't. Hornik et al [1] showed already in 1989 that only one hidden layer is sufficient to approximate any function to any degree of accuracy. Deep learning may make it easier to find good representations, but that's not the same thing.

[1] Hornik, Stinchcombe & White, Multilayer feedforward networks are universal approximators, Neural Networks, vol.2, issue 5, pp. 359-366, 1989. https://www.sciencedirect.com/science/article/pii/0893608089900208 Olle Gällmo (talk) 11:59, 16 May 2024 (UTC)