Ben Rhodeland's picture
4

Ben Rhodeland

brhodeland
·

AI & ML interests

None yet

Recent Activity

updated a collection about 3 hours ago
Ollama Fodder
reacted to mike-ravkine's post with 🤯 2 days ago
measuring the information content of a reasoning trace seems like a straightforward reasoning LLM KPI, but how can we achieve this? what if we keep it simple: gzip the resulting text and take the length of the compressed stream... "compressed bytes of information per output token" becomes the KPI if we split across correct answers vs incorrect answers vs truncated answers and group by difficulty, a whole new world of analysis becomes not just possible but visually intuitive and almost trivial: 1) what is the model's overall reasoning efficiency? this is the slope of the scatterplot curve segments (there may be more then one..) 2) is the model able to apply more test-time compute towards more difficult variations of the task? the two on the left are not, the two on the right are. 3) when applying more test-time compute, is that compute useful? this is the curvature of the scatterplot trends - the two in the middle are 'losing their mojo' as answers get longer the information content falls down 4) is the model applying multiple approaches to the task? (right) do those approaches change with difficulty? 5) are truncations because we don't have enough context budget (left) or because the model has lost its mind and gone into a repeat loop (middle two) and does this happen across the board (middle left) or only when the problem is more difficult (middle right) would love to hear your guys feedback on this kind of analysis, is anyone doing similar work? this approach generates 12 plots per model (one for each task) so quite a bit of data and i've been hesitant to publish it so far, consider this post a toe tip.
updated a collection 2 days ago
Ollama Fodder
View all activity

Organizations

None yet