Ph.D. in vision-language models & agent, POSTECH, advised by Prof. Tae-Hyun Oh · prev. Research Intern at Huawei Noah's Ark Lab, London
I build controllable multimodal data so that model behavior becomes understandable and traceable — from model improvement to model evaluation. I am also interested in deploying models as autonomous agents.
SEEKING FULL-TIME POSITION FROM AUG 2026
Experience & Education
2025.01 — 2025.12
Research Intern · Huawei Noah's Ark Lab, London
Full-time, extended. Built RetouchLLM, a training-free agentic image-retouching framework using VLMs as iterative code-based editors. With Roy Miles, Ismail Elezi, and Jiankang Deng.
2022.03 — 2026.08 (exp.)
Ph.D. in Electrical Engineering · POSTECH
Thesis: Controllable Multi-modal Synthetic Data: Methods for Model Improvement and Evaluation. Advisor: Prof. Tae-Hyun Oh.
2020.03 — 2022.02
M.S. in Electrical Engineering · POSTECH (Sports AIX Program)
Thesis: Data and Annotation Efficient Image Recognition and Segmentation. Advisor: Prof. Tae-Hyun Oh.
2016.03 — 2020.02
B.S. in Electrical and Electronics Engineering · Chung-Ang University
Department Honors.
Publications
† equal contribution · full list on Google Scholar
Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh
The first benchmark for active privacy extraction in agents, where all 22 tested models leak heavily. We prove that soft-constraint defenses like prompting or alignment can never achieve both task success and zero leakage at once, since it is a fundamental property of softmax-based models. This motivates a system-level defense: private field isolation hashes private values before they reach the model, so the model never sees them, blocking leakage (90%+ privacy) while keeping task accuracy near baseline.
Kwon Byung-Ki, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, Tae-Hyun Oh
Text-to-video diffusion often fails silently at inference, making trial-and-error regeneration costly. A Real-time Inspection module turns latents into intermediate video previews so alignment scorers can spot failures in just 39.2ms, and intervention fires only when failure is predicted. The pipeline first reuses a well-aligned single-frame preview as a semantic anchor, then, if needed, calls a VLM to diagnose the faulty preview and refine the prompt, yielding up to 2.64x less time overhead than post-hoc regeneration.
Moon Ye-Bin, Roy Miles, Tae-Hyun Oh, Ismail Elezi, Jiankang Deng Work done with Huawei Noah's Ark Lab
RetouchLLM is an iterative retouching framework guided by a style-guided selection score that converges toward a target style without any training data. Its white-box, code-based design brings transparency and reproducibility while operating directly on high-resolution images, and by leveraging an LLM and VLM it supports natural-language instructions for personalized retouching aligned with user intent.
Nam Hyeon-Woo, Moon Ye-Bin, Sohwi Lim, Kwon Byung-Ki, Tae-Hyun Oh
MLLM embeddings (e.g., Qwen-VL) recover 90% of the linear-probing upper bound for ordinal ranking, far above CLIP's 61%, thanks to attribute-conditioned embeddings and a reduced modality gap. The effect generalizes even to speaker-age ranking in audio.
Video retrieval models do well on salient content but drop sharply on surrounding-context and temporally complex queries, since video-level representations average out localized details. Our SS Datasets enable fine-grained spatio-temporal evaluation and training for context-rich, temporally localized supervision.
Wonseok Choi, Sohwi Lim, Nam Hyeon-Woo, Moon Ye-Bin, et al., Tae-Hyun Oh Work done with Samsung Research
Decomposing images into patches lets pre-trained global-feature models do localized instance retrieval with surprisingly strong results and no task-specific fine-tuning. Product Quantization keeps patch-level indexing scalable, and LocScore measures how precisely the retrieved region aligns with the target object.
Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin, Junhyun Nam, Tae-Hyun Oh Work done with Samsung SAIT
Interpreting time-series data as visual graphs with VLMs yields higher model discovery accuracy than interpreting it as text with LLMs. By leveraging VLM-based agents, the system visualizes and analyzes the given time-series data to propose model candidates, then iteratively evaluates and refines them to identify the most suitable model.
Nam Hyeon-Woo, Moon Ye-Bin, Wonseok Choi, Lee Hyun, Tae-Hyun Oh
VLM's Eye Examination probes how a VLM perceives images, from primitive color and shape to semantic levels. The color exam shows VLMs are sensitive to the red spectrum, and the LLM component shapes both shape sensitivity and patch-wise semantic discrimination.
Moon Ye-Bin†, Nam Hyeon-Woo†, Wonseok Choi, Nayeong Kim, Suha Kwak, Tae-Hyun Oh
SYNAuG augments imbalanced training data with synthetic images to balance the distribution. Despite the synthetic-to-real domain gap, this helps when at least 5 to 10 real samples are available. In long-tail recognition, fairness, and spurious correlation, it beats algorithmic approaches trained only on real data, showing the value of controlling imbalance from the data side.
Moon Ye-Bin†, Nam Hyeon-Woo†, Wonseok Choi, Tae-Hyun Oh
BEAF is a benchmark with dataset and metrics for evaluating hallucination in large vision-language models. It manipulates an image by removing an object and tracks how the model's answer to the same question changes. With these change-aware metrics, we find that even high-accuracy models hallucinate heavily, driven mainly by the model's "Yes" bias.
Moon Ye-Bin, Dongmin Choi, Yongjin Kwon, Junsik Kim, Tae-Hyun Oh Work done with ETRI
From performance bottleneck analyses, ENInst targets two sub-tasks: MRF-based instance-wise refinement to improve pixel localization, and a novel classifier composition that parameterizes classifiers with base classifiers and Gaussian random vectors to improve classification accuracy. IPIU 2022 Best Paper.
Moon Ye-Bin, Jisoo Kim, Hongyeob Kim, Kilho Son, Tae-Hyun Oh
TextManiA augments visual features with attribute vectors from the text embedding space, making augmentation semantically meaningful rather than random noise. It works not only with aligned text-visual spaces like CLIP, but also with embeddings from independent LLMs such as GPT-2 and BERT.
Kim Jun-Seong†, Kim Yu-Ji†, Moon Ye-Bin, Tae-Hyun Oh
HDR-Plenoxels learns the plenoptic function of a 3D scene from a joint understanding of 3D information, physical radiance fields, and the varying camera settings inherent in 2D LDR images.
FedPara re-parameterizes layer weights with a low-rank Hadamard product, reaching larger capacity at lower communication cost than original layers in federated learning. Its personalized variant, pFedPara, splits weights into global and local parameters for more robust results than competing methods.