I am currently a researcher at Shanghai AI Labotory. My research interests include Embodied AI, Computer Vision, Robotic Manipulation and Autonomous Driving.

I received my PhD degree from Robotics Institute, Shanghai Jiao Tong University, supervised by Prof. Honghai Liu, and obtained my bachelor’s degree from Central South University. I am currently working with Jiangmiao Pang on Embodied AI. Over the preceding period, I have worked with Prof. Hongyang Li and Prof. Yu Qiao at Shanghai AI Labotory.

I am an interdisciplinary lifelong learner, with an academic journey that has evolved from bio-mechatronics, computer vision, and autonomous driving to embodied AI. I am currently dedicated to advancing Embodied AGI, with an emphasis on generalizable robotic manipulation.

📝 Selected Publications

🤖 Embodied AI * indicates equal contribution

RSS 2024

Learning Manipulation by Predicting Interaction

Jia Zeng$^\ast$, Qingwen Bu$^\ast$, Bangjun Wang$^\ast$, Wenke Xia$^\ast$, Li Chen, Hao Dong, H.Song, D.Wang, D.Hu, P.Luo, H.Cui, B.Zhao, X.Li, Y.Qiao, Hongyang Li

We propose a representation learning framework towards robotic manipulation that learns Manipulation by Predicting Interaction (MPI).
RSS 2024 | Project Page | Code

NeurIPS 2024

Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation

Qingwen Bu$^\ast$, Jia Zeng$^\ast$, Chen Li$^\ast$, Yanchao Yang, Guyue Zhou, Junchi Yan, Ping Luo, Heming Cui, D.Hu, Yi Ma, Hongyang Li

We propose CLOVER, which employs a text-conditioned video diffusion model for generating visual plans as reference inputs, then leverages these sub-goals to guide the feedback-driven policy to generate actions with an error measurement strategy.
NeurIPS 2024 | Code

ICLR 2025

Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Yang Tian$^\ast$, Sizhe Yang$^\ast$, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang

We propose an end-to-end model, Seer, which employs an inverse dynamics model based on the robot’s predicted visual states to forecast actions. We term such architectural framework the Predictive Inverse Dynamics Model (PIDM).
ICLR 2025 | Project Page | Code

RSS 2025

Novel Demonstration Generation with Gaussian Splatting Enables Robust One-Shot Manipulation

Sizhe Yang$^\ast$, Wenye Yu$^\ast$, Jia Zeng, Jun Lv, Kerui Ren, Cewu Lu, Dahua Lin, Jiangmiao Pang

RoboSplat is framework that leverages 3D Gaussian Splatting (3DGS) to generate novel demonstrations for RGB-based policy learning. RoboSplat can generate diverse and visually realistic data across six types of generalization (object poses, object types, camera views, scene appearance, lighting conditions, and embodiments).
RSS 2025 | Project Page | Code

RSS 2025

HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit

Qingwei Ben$^\ast$, Feiyu Jia$^\ast$, Jia Zeng, Junting Dong, Dahua Lin, Jiangmiao Pang

we propose HOMIE, a novel humanoid teleoperation cockpit integrates a humanoid loco-manipulation policy and a low-cost exoskeleton-based hardware system.
RSS 2025 | Project Page | Code

RSS 2025

Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation

Yuyin Yang$^\ast$, Zetao Cai$^\ast$, Yang Tian, Jia Zeng, Jiangmiao Pang

PPI integrates the prediction of target gripper poses and object pointflow with the continuous actions estimation, providing enhanced spatial localization and satisfying flexibility in handling movement restrictions.
RSS 2025 | Project Page | Code

🚗 Autonomous Driving

CVPR 2023

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

Jia Zeng, Li Chen, Hanming Deng, Lewei Lu, Junchi Yan, Yu Qiao, Hongyang Li

We apply knowledge distillation to camera-only 3D object detection, investigate how to distill focal knowledge when confronted with an imperfect 3D object detector teacher.
CVPR 2023

Preprint

Geometric-aware Pretraining for Vision-centric 3D Object Detection

Linyan Huang, Huijie Wang, Jia Zeng, et al.

We propose a geometric-aware pretraining method called GAPretrain, which distills geometric-rich information from LiDAR modality into camera-based 3D object detectors.
arXiv

IEEE T-PAMI 2023

Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe

Hongyang Li$^\ast$, Chonghao Sima$^\ast$, Jifeng Dai$^\ast$, Wenhai Wang$^\ast$, Lewei Lu$^\ast$, Huijie Wang$^\ast$, Jia Zeng$^\ast$, Zhiqi Li$^\ast$, et al.

we conduct a thorough review on Bird’s-Eye-View (BEV) perception in recent years and provide a practical recipe according to our analysis in BEV design pipeline.
IEEE T-PAMI | Github

SCIENTIA SINICA Informationis

Open-sourced data ecosystem in autonomous driving: the present and future

Hongyang Li$^\ast$, Yang Li$^\ast$, Huijie Wang$^\ast$, Jia Zeng$^\ast$, Huilin Xu, et al.

We undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess.
SCIENTIA SINICA Informationis | arXiv