Skip to content
View xid32's full-sized avatar
:octocat:
Keep it up!
:octocat:
Keep it up!

Block or report xid32

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
xid32/README.md

Hi there 👋

I'm Xingjian Diao, a Ph.D. candidate in Computer Science at Dartmouth College 🌲, co-advised by Prof. Soroush Vosoughi and Prof. Jiang Gui. During my Ph.D. at Dartmouth, I interned twice at Amazon on computer vision and robotics (Summer 2025) and VLM systems (Summer 2026), and at Samsung Research America on agentic memories (Spring 2026).

Previously, I completed my M.S. in Computer Science at Northwestern University 💜, advised by Prof. Nabil Alshurafa. I received my B.S. in Computer Science from the University of Pittsburgh 💙, graduating with Cum Laude honors.


🔍 Research

My research focuses on multimodal learning for video, audio, and language understanding. I have developed methods for multimodal reasoning, efficient multimodal learning, and generative multimodal modeling, aiming to build scalable and generalizable multimodal models that advance multimodal question answering, video understanding, and audio–visual reasoning across complex real-world scenarios and dynamic environments. Highlights of my work include:


🧑‍💻 Internship Experience

  • Amazon Science (Jun 2026 – Sept 2026)
    Applied Scientist Intern, Sunnyvale, CA
    Research on vision language models.

  • Samsung Research America (Mar 2026 – Jun 2026)
    NLP Research Intern, Mountain View, CA
    Research on agentic memories.

  • Amazon Science (Jun 2025 – Sept 2025)
    Applied Scientist Intern, Santa Cruz, CA
    Research on computer vision and robotics.


Pinned Loading

  1. SoundMind SoundMind Public

    We introduce the Audio Logical Reasoning (ALR) dataset, consisting of 6,446 text-audio annotated samples specifically designed for complex reasoning tasks. Building on this resource, we propose Sou…

    Python 1.1k 131

  2. NAACL_2025_TWM NAACL_2025_TWM Public

    We introduce temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of Multimodal foundation models (MFMs). This plug-and-play module can be easily integrated into …

    Python 314 30