# Section under main title LLMLAGBENCH_INTRO = """ Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff, creating a **strict knowledge boundary** beyond which models cannot provide accurate information without querying external sources. More subtly, when this limitation is unknown or ignored, LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning or false classification tasks, **compromising response accuracy**. LLMLagBench provides a systematic approach for **identifying the earliest probable temporal boundaries** of an LLM's training data by evaluating its knowledge of recent events. The benchmark comprises of **1,700+ curated questions** about events sampled from news reports published between January 2021 and October 2025. Wwe plan to update the question set regularly. Each question could not be accurately answered before the event was reported in news media. We evaluate model responses using a **0-2 scale faithfulness metric** (which is basically accuracy of model responses to queries about time-sensitive knowledge when compared with gold answers) and apply the **PELT (Pruned Exact Linear Time)** changepoint detection algorithm to identify where model performance exhibits statistically significant drops, revealing their actual knowledge cutoffs. Our analysis of major LLMs reveals that knowledge infusion **operates differently across training phases**, often resulting in **multiple partial cutoff points** rather than a single sharp boundary. **Provider-declared cutoffs** and model self-reports **frequently diverge** from empirically detected boundaries by months or even years, underscoring the necessity of independent empirical validation. We describe our methodology in https://arxiv.org/pdf/2511.12116. """ # Section above the leaderboard table LEADERBOARD_INTRO = """ The leaderboard below ranks models by their **Overall Average** faithfulness score (0-2 scale) across all 1,700+ questions spanning 2021-2025. The table also displays **Provider Cutoff** dates as declared by model developers, **1st and 2nd Detected Cutoffs** identified by LLMLagBench's PELT algorithm, and additional metadata including release dates and model parameters. **Notable discrepancies** between provider-declared cutoffs and empirically detected cutoffs reveal cases **where models' actual knowledge boundaries differ significantly from official declarations** — sometimes by months or even years. """ # Section for Model Comparison MODEL_COMPARISON_INTRO = """ The visualizations below present detailed per-model analysis using the PELT (Pruned Exact Linear Time) changepoint detection algorithm to **identify significant shifts in faithfulness** scores over time. - **Blue scatter points** represent individual faithfulness scores (0-2 scale, left y-axis) for questions ordered by event date. - **Red horizontal lines** indicate mean faithfulness within segments identified by PELT, with red dashed vertical lines marking detected changepoints—possible training boundaries where performance characteristics shift. - **The green curve** shows cumulative average faithfulness over time. - **The orange line** (right y-axis) tracks cumulative refusals, revealing how often models decline to answer questions beyond their knowledge boundaries. - **Yellow percentage labels** indicate refusal rates within each segment—models instruction-tuned to acknowledge their limitations exhibit sharp increases in refusals after cutoff dates, while others continue attempting answers despite lacking relevant training data, potentially leading to hallucination. Select models to compare how different architectures and training approaches handle temporal knowledge boundaries. Some models exhibit single sharp cutoffs, while others show multiple partial boundaries possibly corresponding to distinct pretraining, continued pretraining, and post-training phases. """ AUTHORS = """