Chow–Liu Ordering for Long-Context Reasoning in Chain-of-Agents

Naman Gupta  ·  Vaibhav Singh  ·  Arun Iyer  ·  Kirankumar Shiragur  ·  Pratham Grover  ·  Ramakrishna B Bairi  ·  Ritabrata Maiti  ·  Sankarshan Damle  ·  Shachee Mishra Gupta  ·  Rishikesh Maurya  ·  Vageesh D C
Microsoft
Accepted at ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems (MemAgent)

TL;DR. Sequential multi-agent reasoning (Chain-of-Agents) over long contexts is highly sensitive to the order in which chunks are processed. We model inter-chunk dependencies with a Chow–Liu tree over embedding-based pairwise similarities, root it at the chunk most similar to the query, and process chunks in breadth-first order. This dependency-aware ordering consistently beats default document order and dense semantic ordering across LongQA, LongQA-MC, and NarrativeQA, with up to +10.68% relative EM gain.

Method Overview

Overview of Chow–Liu CoA chunk ordering

Figure 1. Overview of Chain-of-Agents (CoA) with chunk order induced by a breadth-first traversal of the Chow–Liu dependency tree (CL–order). First, we build a complete graph on chunks where each edge weight is an embedding-based proxy for mutual information between two chunks. The maximum-spanning tree over this graph yields the Chow–Liu tree, a second-order approximation of the global dependency structure. A breadth-first traversal rooted at the chunk most similar to the query gives the CL–order used by the worker agents.

Why Order Matters in Chain-of-Agents

Large Language Models reason brilliantly within their effective context window, but degrade rapidly when the relevant evidence is spread across hundreds of thousands of tokens. Chain-of-Agents (CoA) is a popular sequential framework for this regime: a long document is split into chunks, and a chain of LLM-based worker agents reads them one at a time, each updating a bounded shared memory $M_k$. A final manager agent reads $M_N$ and produces the answer.

From a probabilistic standpoint, CoA is trying to approximate the ideal posterior

$$ P^\star(y \mid q, x_{1:N}) $$

that a model with infinite context could compute directly. CoA replaces this with a latent-state factorization in which only a compressed summary is passed between agents. Because each memory update operates under a fixed token budget, ingesting a new chunk forces the worker to discard or compress earlier evidence. The final state $M_N$, and therefore the answer, depends on the order $\pi$ in which chunks are processed:

$$ P_\pi(y \mid q, x_{1:N}) \neq P^\star(y \mid q, x_{1:N}). $$

In the idealized expressive limit, all orderings would be equivalent. In practice, real LLMs have finite capacity, so the discrepancy

$$ \mathcal{E}(\pi) = \mathrm{KL}\!\left(P^\star \,\|\, P_\pi\right) $$

is strictly positive and depends on how well $\pi$ respects the latent dependencies among chunks. Despite this, existing CoA pipelines typically use the default document order, or rank chunks by dense similarity to the query — neither of which models pairwise relationships between chunks.

Chow–Liu Trees as an Ideal Structural Surrogate

We treat the answer $Y$ and the chunks $X_1, \dots, X_N$ as random variables conditional on the query $q$. The joint distribution $P(Y, X_{1:N} \mid q)$ captures both (i) which chunks contain answer-relevant evidence, and (ii) which chunks are jointly required (multi-hop reasoning, corroborating details, related entities).

Directly modeling this joint is intractable. Following the classical Chow–Liu construction, we approximate it with a tree-structured graphical model rooted at the answer node:

$$ P_T(Y, X_{1:N} \mid q) \;=\; P(Y \mid q) \prod_{i=1}^{N} P\bigl(X_i \,\big|\, Z_{\mathrm{pa}_T(i)},\, q\bigr). $$

The Chow–Liu tree $T^\star$ is the maximum-likelihood tree over this family, which is equivalently the maximum-weight spanning tree on pairwise mutual information:

$$ T^\star \;=\; \arg\max_{T \text{ tree}} \sum_{(i,j)\in E(T)} \hat{I}(Z_i; Z_j). $$

A linearization of this tree (e.g., breadth-first from the answer node) yields an ordering that (i) places chunks most informative about $Y$ early, and (ii) keeps mutually dependent chunks adjacent, exposing jointly relevant evidence to CoA before compression irrevocably discards it. This is exactly the bias CoA needs.

An Unsupervised, Embedding-Based Surrogate

Estimating mutual information from data is unrealistic in the long-context regime — labeled $(q, x_{1:N}, y)$ triples are scarce. Motivated by contrastive MI bounds, we instead use a sentence encoder $\phi(\cdot)$ and treat cosine similarity

$$ s_{ij} \;=\; \frac{\phi(x_i)^\top \phi(x_j)}{\|\phi(x_i)\|\,\|\phi(x_j)\|} $$

as a surrogate indicator of dependency strength. The embedding-based Chow–Liu tree is the maximum-weight spanning tree over $\{s_{ij}\}$.

At inference time the answer $Y$ is unobserved, so we cannot root the tree at it. The observable signal most correlated with $Y$ is the query, so we choose as root the chunk whose embedding is most similar to the query embedding:

$$ r(q) \;=\; \arg\max_i\, \mathrm{cos}\!\left(\phi(q),\, \phi(x_i)\right). $$

A BFS from $r(q)$ produces the CL–order $\pi_{T^{\mathrm{sim}}, q}$ used by the workers.

The CL–CoA algorithm

Algorithm 1   Chow–Liu CoA dependency-aware chunk ordering for Chain-of-Agents
Inputquery $q$; chunks $\{x_i\}_{i=1}^{N}$; embedding encoder $\phi$; worker and manager LLMs.
Outputpredicted answer $\hat{y}$.
  1. for $i = 1, \dots, N$ do   $e_i \leftarrow \phi(x_i)$  # embed each chunk
  2. for all $i < j$ do   $s_{ij} \leftarrow \cos(e_i, e_j)$  # pairwise dependency proxy
  3. $T^{\mathrm{sim}} \leftarrow \mathrm{MaxSpanningTree}\bigl(\{s_{ij}\}\bigr)$  # embedding-based Chow–Liu tree
  4. $e_q \leftarrow \phi(q)$;   $r \leftarrow \arg\max_i \cos(e_q, e_i)$  # query-conditioned root
  5. $\pi \leftarrow \mathrm{BFS}(T^{\mathrm{sim}},\, r)$  # breadth-first processing order
  6. $M_0 \leftarrow \varnothing$  # initialize shared memory
  7. for $k = 1, \dots, N$ do
    1. $j \leftarrow \pi(k)$
    2. $M_k \leftarrow \mathrm{Worker}\bigl(q,\, x_j,\, M_{k-1}\bigr)$  # compress new evidence into memory
  8. $\hat{y} \leftarrow \mathrm{Manager}(q,\, M_N)$
  9. return $\hat{y}$

Lines 1–5 build the embedding-based Chow–Liu tree and turn it into a query-conditioned BFS ordering. Lines 6–9 are the standard Chain-of-Agents inner loop — we change only the order in which chunks are fed to the workers.

Results

We evaluate on three long-context benchmarks from HELMET: LongQA and LongQA-MC from $\infty$-Bench, and NarrativeQA restricted to queries with context length over 256K tokens. For free-form answers we report Ragas answer relevance; for multiple choice we report exact-match (EM) accuracy. Chunks are 8K tokens with an 8K-token memory budget per agent. Embeddings are Text-Embedding-3-Large.

Model LongQA (Ragas ↑) LongQA-MC (EM ↑) NarrativeQA (Ragas ↑)
DefaultDenseCL–order DefaultDenseCL–order DefaultDenseCL–order
Qwen-3-14B 41.4342.2544.12 24.8926.2030.26 35.7238.2641.23
GPT-4.1-mini 51.9447.9654.35 65.2267.3970.29 50.1650.8152.39
GPT-4.1 59.0358.5660.68 82.8484.3385.07 57.3055.9358.08

Table 1. Performance on long-context benchmarks with the GPT-4.1 series and Qwen-3. CL–order consistently outperforms both Default document order and the Dense semantic-score baseline across all three models and all three datasets.

Across the board, CL–order wins on every (model, benchmark) pair. The aggregate gains over the strongest baseline are +10.68% relative EM on LongQA-MC versus default order, and +6.89% relative EM versus dense semantic ranking. On Ragas-based answer relevance the corresponding relative gains are +5.86% and +6.01%. Notably, the dense semantic baseline is inconsistent — for GPT-4.1 and GPT-4.1-mini on LongQA, it actually hurts performance relative to the default document order, confirming that purely local query similarity is the wrong objective when chunks must compose under a memory bottleneck.

Ablation: BFS on Chow–Liu vs. greedy DFS

DFS vs CL-order on LongQA-MC

Figure 2. Greedy DFS over the complete chunk graph vs. BFS over the Chow–Liu tree (CL–order), on LongQA-MC. CL–order wins across all three models. DFS chains chunks by purely local similarity; one step toward a high-similarity but contextually irrelevant neighbor is enough to derail the traversal. The Chow–Liu MST instead commits to globally optimal pairwise dependencies before any traversal happens.

Ablation: scoring function and embeddings

We also swap Text-Embedding-3-Large for two alternatives. With the open-weight Qwen-3-Embedding-8B, CL–order again wins consistently, with the largest gains on the smaller models. With sparse BM25 the gains shrink and are less consistent: BM25 measures lexical overlap rather than semantic dependency, and is only a coarse proxy for mutual information.

Takeaways

  • Order is structure, not preprocessing. Under a memory bottleneck, chunk ordering is a structured-inference choice that materially changes what the final memory state can represent.
  • Pairwise dependencies matter more than per-chunk relevance. Ranking chunks by query similarity (the Dense baseline) doesn't preserve composability, and can actively hurt.
  • Chow–Liu trees are a great fit. They give the maximum-likelihood tree approximation to the latent dependency structure, are cheap to compute from embeddings, and require no training data.
  • Global > local. Building the MST first and traversing second beats greedy local chaining (DFS) on every model we tested.
  • Smaller models benefit most. Gains are most pronounced for GPT-4.1-mini and Qwen-3-14B — smaller workers are more sensitive to noisy context, and CL–order gives them cleaner trajectories.

BibTeX

@inproceedings{gupta2026chowliu, title = {Chow--Liu Ordering for Long-Context Reasoning in Chain-of-Agents}, author = {Gupta, Naman and Singh, Vaibhav and Iyer, Arun and Shiragur, Kirankumar and Grover, Pratham and Bairi, Ramakrishna B and Maiti, Ritabrata and Damle, Sankarshan and Gupta, Shachee Mishra and Maurya, Rishikesh and D C, Vageesh}, booktitle = {ICLR 2026 Workshop on Memory for LLM-Based Agentic Systems}, year = {2026}, url = {https://openreview.net/forum?id=krfs16Y8SA} }
Back to Blog
Research LLM Long Context Chain-of-Agents ICLR 2026