Repositories

A minimal codebase for finetuning large multimodal models
Supports LLaVA, Llama-3.2-Vision, Qwen, and other models across various formats including single/multiple image and video models.
Bilingual Singing Voice Synthesis
The first bilingual SVS system for English and Chinese, published at ASRU 2023.
Hippocampal-inspired Multimodal Memory
Architecture for long audiovisual understanding with pattern separation, memory consolidation, and cross-modal retrieval.
Generating Natural Adversarial Examples with Stable Diffusion
Using Stable Difussion for natural adversarial examples generation. Published at ICLR 2024 Tiny Papers Track.
Speech Information Retrieval And Lookup Dataset
Evaluates speech models' ability to process and comprehend long spoken inputs.
Use this template Last edited on May 30, 2025