Many studies have been carried out to integrate multi-modal data into a global feature space. In such a dataset, heterogeneous data like text, images, and videos, could be accessed and processed in a uniform manner. However, the integration of multi-modal data also means the loss of information, which makes it necessary to find methods that can extract relevant information from the global dataset both effectively and efficiently. That is, the search results from the dataset should have good quality and can be obtained at a low time cost. In this project, we would like to compare both search quality and efficiency of several search methods in a dataset uniformly storing embedded caption-image pairs. Specifically, we used CLIP to pre-process the dataset into high-dimensional vectors. Then, we applied different search methods, such as Nearest Neighbors and various Faiss methods with different parameters, on text-to-image and image-to-image search. Finally, we utilized precision@k and NDCG as the metrics for measurement. The text or image to search for might not only be selected from the dataset but also could be arbitrarily generated. During our evaluation, we discovered the trade-off between search quality and efficiency. As a result, we found that the clustering Faiss built on inner product could reach the optimal balance.