Length Generalisation Error in Chain of Thought

When training a LLM to reason in 10 steps, the further the test reasoning sequence length are from 10, the larger the output error.

(Please use a modern browser to see the interactive version of this visualization)

Simulation with the following parameters: Training length = 10, training error = 0.1, the length generalisation width = 14.3

Source: The AK DispatchGet the dataEmbed Download imageCreated with Datawrapper