Chow–Liu Ordering for Long-Context Reasoning in Chain-of-Agents

Naman Gupta · Vaibhav Singh · Arun Iyer · Kirankumar Shiragur · Pratham Grover · Ramakrishna B Bairi · Ritabrata Maiti · Sankarshan Damle · Shachee Mishra Gupta · Rishikesh Maurya · Vageesh D C

Microsoft

Accepted at ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems (MemAgent)

OpenReview PDF

TL;DR. Sequential multi-agent reasoning (Chain-of-Agents) over long contexts is highly sensitive to the order in which chunks are processed. We model inter-chunk dependencies with a Chow–Liu tree over embedding-based pairwise similarities, root it at the chunk most similar to the query, and process chunks in breadth-first order. This dependency-aware ordering consistently beats default document order and dense semantic ordering across LongQA, LongQA-MC, and NarrativeQA, with up to +10.68% relative EM gain.

Method Overview

Figure 1. Overview of Chain-of-Agents (CoA) with chunk order induced by a breadth-first traversal of the Chow–Liu dependency tree (CL–order). First, we build a complete graph on chunks where each edge weight is an embedding-based proxy for mutual information between two chunks. The maximum-spanning tree over this graph yields the Chow–Liu tree, a second-order approximation of the global dependency structure. A breadth-first traversal rooted at the chunk most similar to the query gives the CL–order used by the worker agents.

Why Order Matters in Chain-of-Agents

Large Language Models reason brilliantly within their effective context window, but degrade rapidly when the relevant evidence is spread across hundreds of thousands of tokens. Chain-of-Agents (CoA) is a popular sequential framework for this regime: a long document is split into chunks, and a chain of LLM-based worker agents reads them one at a time, each updating a bounded shared memory $M_k$. A final manager agent reads $M_N$ and produces the answer.

From a probabilistic standpoint, CoA is trying to approximate the ideal posterior

$$ P^\star(y \mid q, x_{1:N}) $$

that a model with infinite context could compute directly. CoA replaces this with a latent-state factorization in which only a compressed summary is passed between agents. Because each memory update operates under a fixed token budget, ingesting a new chunk forces the worker to discard or compress earlier evidence. The final state $M_N$, and therefore the answer, depends on the order $\pi$ in which chunks are processed:

$$ P_\pi(y \mid q, x_{1:N}) \neq P^\star(y \mid q, x_{1:N}). $$

In the idealized expressive limit, all orderings would be equivalent. In practice, real LLMs have finite capacity, so the discrepancy

$$ \mathcal{E}(\pi) = \mathrm{KL}\!\left(P^\star \,\|\, P_\pi\right) $$

is strictly positive and depends on how well $\pi$ respects the latent dependencies among chunks. Despite this, existing CoA pipelines typically use the default document order, or rank chunks by dense similarity to the query — neither of which models pairwise relationships between chunks.

Chow–Liu Trees as an Ideal Structural Surrogate

We treat the answer $Y$ and the chunks $X_1, \dots, X_N$ as random variables conditional on the query $q$. The joint distribution $P(Y, X_{1:N} \mid q)$ captures both (i) which chunks contain answer-relevant evidence, and (ii) which chunks are jointly required (multi-hop reasoning, corroborating details, related entities).

Directly modeling this joint is intractable. Following the classical Chow–Liu construction, we approximate it with a tree-structured graphical model rooted at the answer node:

$$ P_T(Y, X_{1:N} \mid q) \;=\; P(Y \mid q) \prod_{i=1}^{N} P\bigl(X_i \,\big|\, Z_{\mathrm{pa}_T(i)},\, q\bigr). $$

The Chow–Liu tree $T^\star$ is the maximum-likelihood tree over this family, which is equivalently the maximum-weight spanning tree on pairwise mutual information:

$$ T^\star \;=\; \arg\max_{T \text{ tree}} \sum_{(i,j)\in E(T)} \hat{I}(Z_i; Z_j). $$

A linearization of this tree (e.g., breadth-first from the answer node) yields an ordering that (i) places chunks most informative about $Y$ early, and (ii) keeps mutually dependent chunks adjacent, exposing jointly relevant evidence to CoA before compression irrevocably discards it. This is exactly the bias CoA needs.

An Unsupervised, Embedding-Based Surrogate

Estimating mutual information from data is unrealistic in the long-context regime — labeled $(q, x_{1:N}, y)$ triples are scarce. Motivated by contrastive MI bounds, we instead use a sentence encoder $\phi(\cdot)$ and treat cosine similarity

$$ s_{ij} \;=\; \frac{\phi(x_i)^\top \phi(x_j)}{\|\phi(x_i)\|\,\|\phi(x_j)\|} $$

as a surrogate indicator of dependency strength. The embedding-based Chow–Liu tree is the maximum-weight spanning tree over $\{s_{ij}\}$.

At inference time the answer $Y$ is unobserved, so we cannot root the tree at it. The observable signal most correlated with $Y$ is the query, so we choose as root the chunk whose embedding is most similar to the query embedding:

$$ r(q) \;=\; \arg\max_i\, \mathrm{cos}\!\left(\phi(q),\, \phi(x_i)\right). $$

A BFS from $r(q)$ produces the CL–order $\pi_{T^{\mathrm{sim}}, q}$ used by the workers.

The CL–CoA algorithm

Algorithm 1 Chow–Liu CoA dependency-aware chunk ordering for Chain-of-Agents

Inputquery $q$; chunks $\{x_i\}_{i=1}^{N}$; embedding encoder $\phi$; worker and manager LLMs.

Outputpredicted answer $\hat{y}$.

for $i = 1, \dots, N$ do $e_i \leftarrow \phi(x_i)$ # embed each chunk
for all $i < j$ do $s_{ij} \leftarrow \cos(e_i, e_j)$ # pairwise dependency proxy
$T^{\mathrm{sim}} \leftarrow \mathrm{MaxSpanningTree}\bigl(\{s_{ij}\}\bigr)$ # embedding-based Chow–Liu tree
$e_q \leftarrow \phi(q)$; $r \leftarrow \arg\max_i \cos(e_q, e_i)$ # query-conditioned root
$\pi \leftarrow \mathrm{BFS}(T^{\mathrm{sim}},\, r)$ # breadth-first processing order
$M_0 \leftarrow \varnothing$ # initialize shared memory
for $k = 1, \dots, N$ do
1. $j \leftarrow \pi(k)$
2. $M_k \leftarrow \mathrm{Worker}\bigl(q,\, x_j,\, M_{k-1}\bigr)$ # compress new evidence into memory
$\hat{y} \leftarrow \mathrm{Manager}(q,\, M_N)$
return $\hat{y}$

Lines 1–5 build the embedding-based Chow–Liu tree and turn it into a query-conditioned BFS ordering. Lines 6–9 are the standard Chain-of-Agents inner loop — we change only the order in which chunks are fed to the workers.

Results

We evaluate on three long-context benchmarks from HELMET: LongQA and LongQA-MC from $\infty$-Bench, and NarrativeQA restricted to queries with context length over 256K tokens. For free-form answers we report Ragas answer relevance; for multiple choice we report exact-match (EM) accuracy. Chunks are 8K tokens with an 8K-token memory budget per agent. Embeddings are Text-Embedding-3-Large.

Model	LongQA (Ragas ↑)			LongQA-MC (EM ↑)			NarrativeQA (Ragas ↑)
Model	Default	Dense	CL–order	Default	Dense	CL–order	Default	Dense	CL–order
Qwen-3-14B	41.43	42.25	44.12	24.89	26.20	30.26	35.72	38.26	41.23
GPT-4.1-mini	51.94	47.96	54.35	65.22	67.39	70.29	50.16	50.81	52.39
GPT-4.1	59.03	58.56	60.68	82.84	84.33	85.07	57.30	55.93	58.08

Table 1. Performance on long-context benchmarks with the GPT-4.1 series and Qwen-3. CL–order consistently outperforms both Default document order and the Dense semantic-score baseline across all three models and all three datasets.

Across the board, CL–order wins on every (model, benchmark) pair. The aggregate gains over the strongest baseline are +10.68% relative EM on LongQA-MC versus default order, and +6.89% relative EM versus dense semantic ranking. On Ragas-based answer relevance the corresponding relative gains are +5.86% and +6.01%. Notably, the dense semantic baseline is inconsistent — for GPT-4.1 and GPT-4.1-mini on LongQA, it actually hurts performance relative to the default document order, confirming that purely local query similarity is the wrong objective when chunks must compose under a memory bottleneck.

Ablation: BFS on Chow–Liu vs. greedy DFS

Figure 2. Greedy DFS over the complete chunk graph vs. BFS over the Chow–Liu tree (CL–order), on LongQA-MC. CL–order wins across all three models. DFS chains chunks by purely local similarity; one step toward a high-similarity but contextually irrelevant neighbor is enough to derail the traversal. The Chow–Liu MST instead commits to globally optimal pairwise dependencies before any traversal happens.

Ablation: scoring function and embeddings

We also swap Text-Embedding-3-Large for two alternatives. With the open-weight Qwen-3-Embedding-8B, CL–order again wins consistently, with the largest gains on the smaller models. With sparse BM25 the gains shrink and are less consistent: BM25 measures lexical overlap rather than semantic dependency, and is only a coarse proxy for mutual information.

Takeaways

Order is structure, not preprocessing. Under a memory bottleneck, chunk ordering is a structured-inference choice that materially changes what the final memory state can represent.
Pairwise dependencies matter more than per-chunk relevance. Ranking chunks by query similarity (the Dense baseline) doesn't preserve composability, and can actively hurt.
Chow–Liu trees are a great fit. They give the maximum-likelihood tree approximation to the latent dependency structure, are cheap to compute from embeddings, and require no training data.
Global > local. Building the MST first and traversing second beats greedy local chaining (DFS) on every model we tested.
Smaller models benefit most. Gains are most pronounced for GPT-4.1-mini and Qwen-3-14B — smaller workers are more sensitive to noisy context, and CL–order gives them cleaner trajectories.

BibTeX

@inproceedings{gupta2026chowliu, title = {Chow--Liu Ordering for Long-Context Reasoning in Chain-of-Agents}, author = {Gupta, Naman and Singh, Vaibhav and Iyer, Arun and Shiragur, Kirankumar and Grover, Pratham and Bairi, Ramakrishna B and Maiti, Ritabrata and Damle, Sankarshan and Gupta, Shachee Mishra and Maurya, Rishikesh and D C, Vageesh}, booktitle = {ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems}, year = {2026}, url = {https://openreview.net/forum?id=krfs16Y8SA} }