CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

1Bilkent University 2Hacettepe University 3Koç University

Abstract

Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images. Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object.

To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques.

Method

Method

Training aims to learn a mapping from AlphaCLIP's embedding space to IP-Adapter's embedding space. This is achieved by training a MLP that maps the output of AlphaCLIP to the format expected by IP-Adapter. We give AlphaCLIP a full mask (so that it focuses on the whole image) and the source image, project it using the mlp and apply MSE loss with the clip image embeddings of the same image for the specified clip image encoder of IP-Adapter.

Inference
is done by giving both pairs of source image and mask and source image and 1-mask to AlphaCLIP and projecting the embeddings to the IP-Adapter's embedding space using the trained MLP. Then, we give the projected embeddings to the projection block to remove the object from the embeddings. This is done by subtracting the background-focused embeddings projection of foreground-focused embeddings from itself. Finally, we give the modified embeddings to the IP-Adapter's generator to get the inpainted image.

Comparison with SD-Inpaint

We compare our object removal results with the state-of-the-art inpainting method SD-Inpaint

Comparison results of COCO 2017

Results

We compare our object removal results with the state-of-the-art models below and show the results of our model on the COCO 2017 validation dataset.

Comparison results of COCO 2017

Focused Embeddings

We display the effect of our projection block by displaying unconditional image generation for foreground-focused, background-focused and projected embeddings

Effect of projection block results

Related Links

Our work has been inspired by many recent works in the field. Here are some of the most relevant ones:

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

BibTeX

@misc{ekin2024clipaway,
      title={CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models}, 
      author={Yigit Ekin and Ahmet Burak Yildirim and Erdem Eren Caglar 
        and Aykut Erdem and Erkut Erdem and Aysegul Dundar},
      year={2024},
      eprint={2406.09368},
      archivePrefix={arXiv},
      primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' 
      is_active=True alt_name=None in_archive='cs' is_general=False description='Covers 
      image processing, computer vision, pattern recognition, and scene understanding. 
      Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
    }