top of page

Powering Visual Search with Image Embedding


Within the world of recommendation systems, a significant challenge arises when certain images lack contextual information, especially in cold-start situations. My strategy for addressing this issue involves leveraging visual features to search for related content or products, effectively overcoming the cold-start problem, and enhancing the overall recommendation capabilities of the system.

The codes and notebooks that accompany this project is available here:

The workflow for this project is the same as the one I did for the fashion recommender system in this post, but I will be focusing on the embedding model (feature extraction) here.

This project aims to extract latent features from images using an embedding model and then, utilize these features to provide image recommendations based on a given query image.

Data Preparation

The dataset utilized in this project is identical to the one detailed here:

Data Preparation notebook is available here:

Model Architecture

I will be using an autoencoder network with convolutional layers to obtain the latent features of a given image. An autoencoder functions as a neural network designed to replicate its input in the output. Within its structure, there's a hidden layer, denoted as h (latent features), which forms a code to represent the input. The network can be conceptualized in two components: an encoder function, where h = f(x), and a decoder, responsible for generating a reconstruction r = g(h).

To extract valuable features from the autoencoder, a common approach is to limit the dimension of the hidden layer h to be smaller than the input x. This concept is similar to dimensionality reduction techniques like PCA. However, autoencoders, equipped with nonlinear encoder functions f and nonlinear decoder functions g, have the capability to learn a more robust nonlinear generalization than PCA.

The following is the model architecture that I have used to obtain the latent features of 512 dimensional vectors, compressed from 49512 dimensional vectors (3 x 128 x 128):

The model was trained using the PyTorch framework with a T4 GPU for 40 epochs, with a batch size of 128 and learning rate of 3e-4.

Evaluation Methods

In the project, I will use the following metric and exploratory techniques to evaluate the model's effectiveness in capturing essential features:

  • MSE Loss: Quantifies the reconstruction accuracy by minimizing the Mean Squared Error.

  • Latent Space Visualization: Examines the proximity of the same product categories in the latent space.

  • mAP@5 for Product Categories: Calculates the mAP@5 for each product category by retrieving the top 5 matches using k-nearest neighbors, with Euclidean distance between embeddings as the distance measure.

Furthermore, I intend to explore the model's performance by reconstructing images through the decoder network. Simultaneously, I will delve into the encoder network, visualizing the input image as it pass through its layers

A side note for the way mAP is calculated for this project: the product's category recommended by the system will serve as a true positive if aligned with the query category. However, if more time and resources are available, other metrics such as color, style, fabric, or a combination thereof could be explored to measure precision. A retrieval evaluation dataset could also be generated to measure precision.

Model Experiments

I explored the following model variants to extract visual embeddings:

  • Baseline (4096-d): Convolutional AutoEncoder with an encoder-decoder structure. The encoder consists of 7 convolutional layers and 3 max-pooling layers, gradually reducing spatial dimensions to (256 x 4 x 4), which is then flattened to obtain the 4096 dimensional embedding. The decoder uses 7 transposed convolutional layers to upsample the encoded features back to the original input size.

  • +Layer (512-d): The updated architecture retains the encoder-decoder structure but introduces additional convolutional layers in both the encoder and decoder. In particular, I added 1 additional convolutional and 1 max-pooling layer in the encoder network, resulting in a total of 8 convolutional and 4 max-pooling layers, leading to higher level of feature abstraction and lower spatial dimensions to (512 x 1 x 1). On the decoder side, 1 additional transposed convolutional layers are added to handle the increased complexity of the encoded features, resulting in a total of 8 transposed convolutional layers in the decoder network.

Evaluation and Analysis


Index Size


Baseline (4096-d)

163.8 MB


+Layer (512-d)

25 MB


During Baseline experimentation, I created an index with 10K products selected at random. For +Layer, I created a separate index with 12K~ products with a script that collects up to 1K product per category. This may affect mAP score due to the different category distribution in the product catalog.

Loss Function

training and validation loss curve for +Layer model, loss calculated with Mean Squared Error (MSE).

The gradual decrease in both training and validation loss indicates that the model is learning and improving its performance over time. However, the fact that the loss curves have not yet converged suggests that the model may still benefit from further training epochs.

mAP for Product Categories

side by side comparison of mAP results between the +Layer model (512-d) (left) and the baseline model (4096-d) (right)

side by side comparison of class distribution in the product index, category axis is sorted alphabetically for easy comparison by category.

class distribution in the training set

Overall, the +Layer model that resulted in the 512 dimensional embeddings have a higher score at 0.53 vs the baseline model at 0.44. mAP score across categories seem to be following a similar trend, though the +Layer model tend to score higher across categories.

From previous training of an object detection model , I've noticed that increased variability in visual attributes in a product category tend to require more data for a model to generalize to that category.

For this report, I'm going to look at 3 product categories that I know tend to have higher visual variability, in particular: jewelry (necklaces, bracelets, etc), coats (jackets, long coats, etc), and shirts (long sleeve, short sleeve, sleeveless, sweaters, etc), and compare its result to categories with less variability and similar or lower range of class instances in the training set, such as pants, hats, and sunglasses.

When looking at the numbers, it's apparent that category such as sunglasses, score higher in mAP score, even though it has lower number of instances in the training and index set. The same can be seen in pants, with similar number of instances with shirts, but pants still tend to score higher in mAP score.

Latent Space Visualization

The UMAP visualizations below illustrate the embeddings of items within the index set in a three-dimensional embedding space. These embeddings are derived from the +Layer model, initially residing in a 512-dimensional space. The visualization is presented using TensorBoard with n_neighbors=15 and min_dist=0.1

Higher number of n_neighbors tend to reflect the global structure of the data, while higher number of min_dist lead to less emphasis on global structure.

Distinct clustering patterns are evident among certain product categories, showcasing well-defined groupings for certain categories. For instance, the red category representing pants exhibits strong clustering, predominantly staying within its designated group. However, the orange category denoting shirts lacks a clear, dedicated cluster, appearing scattered across the visualization and lacking coherence in its distribution within the embedding space.

Image Reconstruction and Encoder Network Exploration

Feature visualization provides insights into how the model, trained on the dataset, constructs its understanding of images across intermediate layers. I took some of the images across different categories in the dataset and compared the original image and reconstructed image to see what the model is seeing. The following results are obtained from the +Layer model:

from the image of a shoes, the model seem to generalize well to the global structure of the shoes as can be seen with crisp line of silhouette, but doesn't retain finer feature detail such as the gold heel and topline. It also discarded the noise on the top left corner but still retain some of the noise on the top right of the image.

The model doesn't seem to "see" the sweater as well, the reconstructed sweater is quite blurred and it seems to be more focused on capturing the noise on the bottom right of the image. Notice how the sweater is pushed to the back and blurred out, while the noise is pulled to the front and has a crisper silhouette.

With the ring above, it captures the global structure well, while hasn't quite captured the finer detail, such as the stone, it seems to almost be learning the diamond band around the larger stone in the middle.

The model looks like it's almost capturing the the finer detail, as the colored areas in purple and blue aligns pretty well with the detail in the original coat. The silhouette isn't as crisp the ring or shoes above. It's also capturing the noise on the right side of the image.

Let's take a look at a couple images inside the encoder network:

I used two images to explore the details retained by the model in intermediate layers of the encoder network. The reconstructed image is also presented, employing the +Layer model for this experiment. Please note that 'viridis' is applied to the visualization of intermediate layers. As the model progresses through the layers, the "channel" increases, so the visualization primarily highlights differences in shapes rather than precisely capturing colors, considering an RGB image with 3 channels.

The model performs well in capturing the dress's general features, but it tends to incorporate unwanted details, like noise from the background. This results in the dress appearing to have an off-shoulder sleeve in the reconstructed image, a feature not present in the original sleeveless dress.

It's quite difficult to see what features the model consider important through the layers, it seems to try to incorporate all the objects present, thus the resulting in the "off-shoulder sleeve" in the reconstructed image.

The reconstructed image of the handbag captures the essential elements well, including the shape with a clear outline. However, the model struggles with finer details, such as accurately placing the clasp. While it recognizes the clasp's presence, it can't pinpoint its exact location. The noise on the bottom right is effectively blurred, but some noise remains in the top right.

In examining the intermediate layers of the encoder network, it becomes apparent that the model assigns significance to the noise in the top right corner, marked by a consistent yellow pixel transitioning to green across the layers. Surprisingly, the model doesn't seem to capture substantial features related to the bag, except for a persistent yellow pixel emerging in the middle from layer 10 onward.

Similarity Search Results

Since the model was created to recommend similar items, let's now explore some sample recommendations of the baseline vs +Layer model based on the visual attributes they capture:


  • Train model for more epochs as model hasn't fully converged.

  • Augment the training dataset for categories with lower mAP scores and those with fewer instances.

  • Explore the possibility of increasing category representation by increasing instances in the index.

  • Note the variations in class distributions between the baseline and +Layer model index sets, making it challenging to accurately compare mAP scores across specific categories. For future evaluations, ensure consistency by employing the same product index for comparison.

  • Depending on the business use case and objective, consider creating category restricted indexes to improve recommendations. For example, if the purpose is to recommend items to increase some business metrics (e.g., CTR, UPT, etc), the similarity of recommendation of item's category vs query category may not be as relevant compared to the online metrics.



bottom of page