Decentralized Instruction Tuning

Conflict-Aware Splitting and Weight Merging

Minsik Choi1,2,* Geewook Kim1,3,*,†

1NAVER Cloud AI 2Korea University 3KAIST AI

*Equal contribution  ·  Corresponding author

To appear at ICML 2026

Code·BibTeX

Dataset-level gradients form sharp directional clusters — heterogeneity is structured, not noise.

t-SNE of dataset-level gradients
t-SNE of dataset-level gradients across 136 Vision-FLAN tasks. Distinct clusters emerge along instruction type — a signal MERIT exploits via PCA.
PCA-based groups
PCA-based groups extend along conflicting directions from θ0, while the merged model re-centers — aggregating complementary updates rather than averaging noise.

TL;DR. Split a heterogeneous mixture along top PCA axes of dataset-level gradient conflict, train each branch independently, merge once. Improves the 8-benchmark multimodal average from 54.3 → 57.0 (~5% relative) on Qwen2.5-VL-3B with 136 Vision-FLAN tasks, while eliminating step-level gradient synchronization.

Instruction tuning large multimodal models on heterogeneous mixtures is bottlenecked by gradient interference and bandwidth-heavy synchronization. We develop a local quadratic theory inside a shared flat basin showing that merging is never worse than the weighted average of individual losses, with improvement governed by curvature-weighted variance; that conflict-aware PCA splitting maximizes the merging gain along high-curvature directions; and that merging acts as curvature-weighted spectral filtering with implicit norm regularization. These implications motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, extracts dominant conflict axes via PCA, partitions datasets accordingly, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging.

Method

A five-step pipeline that happens once around a merge-ready initialization θ0.

01
Merge-ready init
LLaVA Stage 1 → Stage 2 produce θ0 whose neighborhood forms a shared flat basin.
02
Gradient conflicts
Per-dataset mean gradients on a small calibration set (n=200) with stride-s subsampling.
03
PCA split
Cosine similarity matrix → top-r eigenvectors → token-weighted median split into K=2r groups.
04
Branch training
Fine-tune K branches independently from θ0 — zero cross-branch communication.
05
Token-weighted merge
θ̄ = Σk (nk / Σj nj) · θk. One-shot averaging, no retraining.
Flat-basin geometry
Flat-basin geometry of the merge-ready initialization. Branches fine-tuned from θ0 remain in a single connected low-loss region (verified via linear mode connectivity — 10/10 barriers exactly zero). Merging realizes curvature-weighted variance reduction and implicit norm regularization.

Results

Controlled 3B study on 136 Vision-FLAN tasks + 7B scaling on a 176-source 1.6 M mixture + text-only generalization on 66 FLAN tasks.

MethodSeedLLaVA-WMMVetTextVQA AI2DMathVistaMMMUAvg.
Joint (1 ep)69.241.936.468.062.634.241.954.3
Joint (2 ep)70.042.837.663.462.536.543.054.7
Random-869.542.235.073.761.733.540.554.5
MERIT-1D71.043.135.072.462.136.541.455.2
MERIT-2D70.847.436.674.161.536.040.755.7
MERIT-3D70.552.037.775.262.535.442.757.0

8-benchmark average on Qwen2.5-VL-3B, 5-seed primary comparison.

Analysis

Three empirical checks that the theory’s assumptions hold in practice.

Hessian-PCA alignment
Hessian–PCA alignment. PCA conflict directions reach |cos|=0.133 with top-10 Hessian eigenvectors (z ≈ 4,000), vs. 5.6×10−5 for random unit vectors — the split targets high-curvature subspaces exactly as predicted.
Calibration size sensitivity
Calibration stability. Mean cosine similarity to a full 1,000-sample reference gradient saturates around n = 200 across 100 Vision-FLAN tasks (0.847 ± 0.106) — small calibration is sufficient.
SGD learning-rate sweep
Robustness across optimizers. Under an SGD learning-rate sweep MERIT matches or exceeds joint training across the practical range (η ≥ 8×10−5), mirroring the condition-number reduction derived from the theory.

BibTeX

@inproceedings{merit,
  title     = {Decentralized Instruction Tuning: Conflict-Aware Splitting
               and Weight Merging},
  author    = {Choi, Minsik and Kim, Geewook},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2026},
  note      = {To appear}
}