Moon Ye-Bin — VLM & Agentic Systems Researcher

UNDER REVIEWprivacy / agent

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction▾

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

The first benchmark for active privacy extraction in agents, where all 22 tested models leak heavily. We prove that soft-constraint defenses like prompting or alignment can never achieve both task success and zero leakage at once, since it is a fundamental property of softmax-based models. This motivates a system-level defense: private field isolation hashes private values before they reach the model, so the model never sees them, blocking leakage (90%+ privacy) while keeping task accuracy near baseline.

arXiv

UNDER REVIEWGUI agent

Entropy-Aware GUI Grounding: From Failure Analysis to Improved Localization

Chengxin Liu, Moon Ye-Bin, Tae-Hyun Oh
Work done with Samsung DS

UNDER REVIEWvideo diffusion

Early Failure Detection and Intervention in Video Diffusion Models▾

Kwon Byung-Ki, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Tae-Hyun Oh

Text-to-video diffusion often fails silently at inference, making trial-and-error regeneration costly. A Real-time Inspection module turns latents into intermediate video previews so alignment scorers can spot failures in just 39.2ms, and intervention fires only when failure is predicted. The pipeline first reuses a well-aligned single-frame preview as a semantic anchor, then, if needed, calls a VLM to diagnose the faulty preview and refine the prompt, yielding up to 2.64x less time overhead than post-hoc regeneration.

arXiv

UNDER REVIEWretouching / VLM / agent

RetouchLLM: Training-free Code-based Image Retouching with Vision Language Models▾

Moon Ye-Bin, Roy Miles, Tae-Hyun Oh, Ismail Elezi, Jiankang Deng
Work done with Huawei Noah's Ark Lab

RetouchLLM is an iterative retouching framework guided by a style-guided selection score that converges toward a target style without any training data. Its white-box, code-based design brings transparency and reproducibility while operating directly on high-resolution images, and by leveraging an LLM and VLM it supports natural-language instructions for personalized retouching aligned with user intent.

arXiv

ICML 2026MLLM / rankability

Zero-shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language▾

Nam Hyeon-Woo, Moon Ye-Bin, Sohwi Lim, Kwon Byung-Ki, Tae-Hyun Oh

MLLM embeddings (e.g., Qwen-VL) recover 90% of the linear-probing upper bound for ordinal ranking, far above CLIP's 61%, thanks to attribute-conditioned embeddings and a reduced modality gap. The effect generalizes even to speaker-age ranking in audio.

Paper

WACV 2026video retrieval / VLM

Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts▾

Jaehun Bang, Moon Ye-Bin, Kyungdon Joo, Tae-Hyun Oh

Video retrieval models do well on salient content but drop sharply on surrounding-context and temporally complex queries, since video-level representations average out localized details. Our SS Datasets enable fine-grained spatio-temporal evaluation and training for context-rich, temporally localized supervision.

Paper

WACV 2026image retrieval

Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching▾

Wonseok Choi, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, et al., Tae-Hyun Oh
Work done with Samsung Research

Decomposing images into patches lets pre-trained global-feature models do localized instance retrieval with surprisingly strong results and no task-specific fine-tuning. Product Quantization keeps patch-level indexing scalable, and LocScore measures how precisely the retrieved region aligns with the target object.

Project Paper

NeurIPS 2025model discovery / agent

Automated Model Discovery via Multi-modal & Multi-step Pipeline▾

Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin, Junhyun Nam, Tae-Hyun Oh
Work done with Samsung SAIT

Interpreting time-series data as visual graphs with VLMs yields higher model discovery accuracy than interpreting it as text with LLMs. By leveraging VLM-based agents, the system visualizes and analyzes the given time-series data to propose model candidates, then iteratively evaluates and refines them to identify the most suitable model.

Project Paper

TMLR 2025evaluation / VLM

VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models▾

Nam Hyeon-Woo, Moon Ye-Bin, Wonseok Choi, Lee Hyun, Tae-Hyun Oh

VLM's Eye Examination probes how a VLM perceives images, from primitive color and shape to semantic levels. The color exam shows VLMs are sensitive to the red spectrum, and the LLM component shapes both shape sensitivity and patch-wise semantic discrimination.

Code Paper

PR LETTERS 2025synthetic data / imbalance

SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems▾

Moon Ye-Bin†, Nam Hyeon-Woo†, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun Oh

SYNAuG augments imbalanced training data with synthetic images to balance the distribution. Despite the synthetic-to-real domain gap, this helps when at least 5 to 10 real samples are available. In long-tail recognition, fairness, and spurious correlation, it beats algorithmic approaches trained only on real data, showing the value of controlling imbalance from the data side.

Paper arXiv

ECCV 2024hallucination / evaluation

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models▾

Moon Ye-Bin†, Nam Hyeon-Woo†, Wonseok Choi, Tae-Hyun Oh

BEAF is a benchmark with dataset and metrics for evaluating hallucination in large vision-language models. It manipulates an image by removing an object and tracks how the model's answer to the same question changes. With these change-aware metrics, we find that even high-accuracy models hallucinate heavily, driven mainly by the model's "Yes" bias.

Project Paper

PATTERN RECOGNITION 2024low-shot segmentation

ENInst: Enhancing Weakly-supervised Low-shot Instance Segmentation▾

Moon Ye-Bin, Dongmin Choi, Yongjin Kwon, Junsik Kim, Tae-Hyun Oh
Work done with ETRI

From performance bottleneck analyses, ENInst targets two sub-tasks: MRF-based instance-wise refinement to improve pixel localization, and a novel classifier composition that parameterizes classifiers with base classifiers and Gaussian random vectors to improve classification accuracy. IPIU 2022 Best Paper.

Paper arXiv

ICCV 2023augmentation / text-driven

TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation▾

Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, Tae-Hyun Oh

TextManiA augments visual features with attribute vectors from the text embedding space, making augmentation semantically meaningful rather than random noise. It works not only with aligned text-visual spaces like CLIP, but also with embeddings from independent LLMs such as GPT-2 and BERT.

Project Paper

ECCV 2022scene reconstruction / HDR

HDR-Plenoxels: Self-Calibrating High Dynamic Range Radiance Fields▾

Kim Jun-Seong†, Kim Yu-Ji†, Moon Ye-Bin, Tae-Hyun Oh

HDR-Plenoxels learns the plenoptic function of a 3D scene from a joint understanding of 3D information, physical radiance fields, and the varying camera settings inherent in 2D LDR images.

Project Paper

ICLR 2022federated learning

FedPara: Low-rank Hadamard Product Parameterization for Efficient Federated Learning▾

Nam Hyeon-Woo, Moon Ye-Bin, Tae-Hyun Oh

FedPara re-parameterizes layer weights with a low-rank Hadamard product, reaching larger capacity at lower communication cost than original layers in federated learning. Its personalized variant, pFedPara, splits weights into global and local parameters for more robust results than competing methods.

Paper

Controlled data,
traceable model behavior.

Experience & Education

Publications

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction▾

Entropy-Aware GUI Grounding: From Failure Analysis to Improved Localization

Early Failure Detection and Intervention in Video Diffusion Models▾

RetouchLLM: Training-free Code-based Image Retouching with Vision Language Models▾

Zero-shot Rankability: Revealing Latent Ordinal Structure in Multimodal Large Language Models via Language▾

Beyond the Highlights: Video Retrieval with Salient and Surrounding Contexts▾

Patch-wise Retrieval: A Bag of Practical Techniques for Instance-level Matching▾

Automated Model Discovery via Multi-modal & Multi-step Pipeline▾

VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models▾

SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems▾

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models▾

ENInst: Enhancing Weakly-supervised Low-shot Instance Segmentation▾

TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation▾

HDR-Plenoxels: Self-Calibrating High Dynamic Range Radiance Fields▾

FedPara: Low-rank Hadamard Product Parameterization for Efficient Federated Learning▾

Awards & Honors

Patents

US — Granted

US — Applications

Korea — Granted

Korea — Applications

Academic Service

Industry Projects