Research
Our research teams investigate object-centric AI and computer vision applied to fashion — building systems that perceive, understand, and generate clothing the way humans do: as structured compositions of meaningful parts.
Object-Centric Representation
Building structured, compositional world models that understand fashion as a scene of discrete, meaningful objects.
Our core research direction is object-centric representation learning — teaching neural networks to decompose complex fashion imagery into structured, interpretable object slots. Rather than treating an outfit as a single monolithic feature vector, our models learn to bind attributes such as garments, textures, colors, and accessories to separate representational slots. This compositional structure enables systematic generalization: models trained on seen combinations can reason about unseen ones, a critical requirement for the long-tail complexity of fashion.
Focus Areas
- Slot Attention and iterative routing for garment decomposition
- Unsupervised object discovery in cluttered fashion scenes
- Compositional generation and disentangled latent spaces
- Cross-image object correspondence and part matching
Fine-Grained Visual Understanding
Pushing the limits of what vision models can recognize and describe in fashion imagery.
Fine-grained recognition in fashion is exceptionally challenging due to subtle inter-class differences, extreme intra-class variation, and long-tail class distributions. Our team develops specialized architectures and training strategies — including hierarchical attention, part-aware pooling, and contrastive objectives — to achieve human-level discrimination between similar garments across diverse real-world imaging conditions.
Focus Areas
- Part-aware and region-guided attention mechanisms
- Hierarchical attribute classification (category → style → detail)
- Few-shot and zero-shot recognition across fashion domains
- Robust recognition under pose, lighting, and occlusion variance
Generative Fashion AI
Controllable image synthesis grounded in structured object representations.
We research generative models that leverage object-centric representations to enable fine-grained, controllable fashion synthesis. By conditioning diffusion and flow-matching models on structured object slots rather than global embeddings, we achieve precise attribute-level editing — change the sleeve style without altering the collar, or swap textures while preserving silhouette — capabilities that unstructured generative models fundamentally lack.
Focus Areas
- Slot-conditioned diffusion models for garment editing
- Virtual try-on via object-level appearance transfer
- Compositional outfit generation and style mixing
- Controllable texture and pattern synthesis
Multimodal & Language-Grounded Vision
Connecting natural language to structured visual representations of fashion.
Fashion understanding requires bridging visual and linguistic modalities. Our multimodal research focuses on grounding natural language descriptions to specific object slots, enabling applications like language-guided search, detailed product captioning, and conversational outfit recommendation. We build on large vision-language models while introducing object-centric bottlenecks that enforce compositional alignment between text and image regions.
Focus Areas
- Slot-grounded vision-language pre-training
- Dense captioning of garment attributes and styling details
- Language-guided part-level image editing
- Compositional text-to-outfit retrieval
Interested in our research?
We publish our findings and release open-source tools. Follow our latest updates in the News section.