Results
We compare our object removal results with the state-of-the-art models below and show the results of our model on the COCO 2017 validation dataset.
Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images. Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object.
To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques.
Training
aims to learn a mapping from AlphaCLIP's embedding space to IP-Adapter's embedding space.
This is achieved by training a MLP that maps the output of AlphaCLIP to the format expected by IP-Adapter. We give AlphaCLIP a full
mask (so that it focuses on the whole image) and the source image, project it using the mlp and apply MSE loss with the clip image embeddings of the same image for the specified clip image encoder of IP-Adapter.
Inference
is done by giving both pairs of source image and mask and source image and 1-mask to AlphaCLIP and projecting the embeddings to the IP-Adapter's embedding space using the trained MLP.
Then, we give the projected embeddings to the projection block to remove the object from the embeddings. This is done by subtracting the background-focused embeddings projection of foreground-focused embeddings from itself.
Finally, we give the modified embeddings to the IP-Adapter's generator to get the inpainted image.
We compare our object removal results with the state-of-the-art inpainting method SD-Inpaint
We compare our object removal results with the state-of-the-art models below and show the results of our model on the COCO 2017 validation dataset.
We display the effect of our projection block by displaying unconditional image generation for foreground-focused, background-focused and projected embeddings
Our work has been inspired by many recent works in the field. Here are some of the most relevant ones:
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
@misc{ekin2024clipaway,
title={CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models},
author={Yigit Ekin and Ahmet Burak Yildirim and Erdem Eren Caglar
and Aykut Erdem and Erkut Erdem and Aysegul Dundar},
year={2024},
eprint={2406.09368},
archivePrefix={arXiv},
primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition'
is_active=True alt_name=None in_archive='cs' is_general=False description='Covers
image processing, computer vision, pattern recognition, and scene understanding.
Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
}