AIDA.4201 Vision Language Models
Id: 042975 Credits: 3-3Description
This course studies vision language models (VLM) that jointly reason over visual and textual information. Topics include vision and language representation learning, cross-modal alignment, contrastive objectives, multimodal transformers, large-scale pretraining, and instruction-tuned vision-language systems. Students will analyze and implement modern VLM architectures such as CLIP-style models, multimodal LLMs, and retrieval-augmented VLMs. The course emphasizes theoretical principles, system design tradeoffs, evaluation, and ethical considerations, culminating in a semester-long project.
Prerequisites
AIDA.3221 Deep Learning.
Course prerequisites/corequisites are determined by the faculty and approved by the curriculum committees. Students are required to fulfill these requirements prior to enrollment. For courses offered through online or GPS delivery, students are responsible for confirming with the instructor or department that all enrollment requirements have been satisfied before registering.