zh

Research

Four threads, one practice.

My work centers on the parts of reinforcement learning that determine whether an agent will survive deployment — not just whether it works on a benchmark.

4

Active themes

7

Publications

6

Distinct venues

42.8

Highest IF

Plasticity

Networks that stay malleable when objectives, distributions, and rewards keep changing.

A common failure mode in long-running RL is that the network gradually loses its ability to adapt — gradients shrink, features ossify, performance plateaus even when more data arrives. This is plasticity loss.

My recent work tackles plasticity from two angles:

  1. ArchitecturalPlasticity-Aware Mixture of Experts (PA-MoE) routes traffic between experts so that the system as a whole stays adaptive even when individual experts saturate. We provide theoretical justification and validate on adaptive video streaming under non-stationary QoE.

  2. Diagnostic — companion preprints characterize the conditions under which plasticity collapses, and propose remedies that can be folded into existing pipelines.

The long-term goal is to produce continually learning agents that don’t need to be restarted from scratch every time the environment shifts.

World Models

Learned models that plan, prune, and search — connecting representation to control.

Model-based RL promises sample efficiency: if an agent can simulate its environment accurately, it can plan instead of just react. The catch is that imperfect models compound errors over rollout horizons, so the value of “imagination” decays quickly.

My work in this thread asks how to make rollouts more useful per step:

  • Multi-Step Pruning Policy (MSPP, Information Sciences 2024) introduces parallel pruning policies that diversify rollouts and lift sample efficiency. We give a convergence analysis and a corresponding policy gradient theorem.

  • Erlang Planning Network (Pattern Recognition 2022) is an iterative model-based framework that views planning from multiple perspectives, treating planning depth as an adaptive resource.

What ties these together is the view of a world model not as a passive simulator, but as a structured space that policies can search inside.

Multi-Agent RL

Scalable, reliable cooperation across many agents — from card games to traffic networks.

Real coordination problems rarely fit neatly into the single-agent frame. My work in multi-agent RL focuses on action-space design and scalability — making sure that as the agent count grows, both training and inference remain feasible.

  • Scalable and Reliable MARL for Traffic Assignment (Communications in Transportation Research 2025, IF 14.5) targets city-scale traffic networks, with an action-space formulation that handles fleets of agents without combinatorial blow-up.

  • During my time at InspirAI (2022–2023), I built a general-purpose card-game AI framework deployed across Sanguosha, Hearthstone, Landlord (Dou Dizhu), and GuanDan. The Landlord agent reached super-human level, defeating top-ranked professional players in head-to-head matches; the GuanDan deployment delivered a +6% win rate against the previous baseline.

  • At Baidu (2021), I proposed and shipped EDA-MAPPO (Expert-Data-Assisted MAPPO), pushing a multi-agent policy directly into a client production environment.

The throughline: multi-agent RL only matters if it survives contact with deployment.

Real-System RL

Stable, sample-efficient RL for production — adaptive streaming, networking, control.

A lot of “RL works” in papers fail to hold up under the constraints of a real system — non-stationarity, partial observability, hard latency budgets, safety boundaries. My applied work focuses on RL that is deployable, not merely demonstrable.

  • Adaptive video streaming (IEEE TMM 2026) — plasticity-aware policies for QoE shifts. The deployed system has to keep performing as networks and content evolve.

  • UAV communications and networking (IEEE Communications Surveys & Tutorials 2025, IF 42.8) — a survey of how DRL fundamentals apply to a domain with strict reliability and latency constraints.

  • Speed servo control (Algorithms 2018) — one of the early works applying deep RL to jump speed servo systems. Cited 58×.

What I’ve taken from each: stability beats peak performance, and a system that is only marginally better but never explodes is the one that ships.

Through-line

Stability beats peak performance. A system that is marginally better but never explodes is the one that ships — and the one that keeps shipping a year later.