Talk:Neural scaling law

More data
https://epochai.org/blog/extrapolating-performance-in-language-modelling-benchmarks

PaLM2 paper. Almost no details, but there's something.

2 Scaling law experiments 2.1 Scaling laws. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Downstream metric evaluations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 A Detailed results A.1 Scaling laws pony in a strange land (talk) 15:25, 20 May 2023 (UTC)

[[2305.18565] PaLI-X: On Scaling up a Multilingual Vision and Language Model](https://arxiv.org/abs/2305.18565)

[[2306.13575] Scaling MLPs: A Tale of Inductive Bias](https://arxiv.org/abs//2306.13575) > performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), > lack of inductive bias compensated. > MLP mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours.

Trading training and inference costs
[[2104.03113] Scaling Scaling Laws with Board Games](https://arxiv.org/abs/2104.03113)

> Training compute and inference compute (MCTS) can be traded off against each other. 10x more MCTS steps is almost the same as training 10x more.

Figure 6, 7 of Alphacode https://arxiv.org/pdf/2203.07814.pdf

Scaling by data quality
[[2206.14486] Beyond neural scaling laws: beating power law scaling via data pruning](https://arxiv.org/abs/2206.14486)

> the scaling of error-(dataset size)

> faster than power law scaling, even possibly exponential scaling, if we have high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size

If  replicates, incorporate it too. https://arxiv.org/abs/2306.11644


 * 1) started with The Stack (a 3 TB collection of code) and text from StackOverflow
 * 2) used a LLM to select 6B "high-quality" tokens from (1)
 * 3) used GPT-3.5 to generate 1B tokens of text similar to textbooks
 * 4) trained a small (1.3B parameter) model ("phi-1") on (2) and (3)
 * 5) used GPT-3.5 to generate text similar to textbook exercises
 * 6) fine-tuned phi-1 on (5)
 * 7) tested phi-1 on HumanEval to evaluate its programming ability

RL scaling
Scaling laws for reward model overoptimization

GATO? RoboCat?

Theoretical explanations
13 the tradeoffs of large-scale learning L Bottou, O Bousquet - Optimization for machine learning, 2011