Announcing Prototrain-ranker: Open Source Search and Ranking Framework
<p><a href="https://www.linkedin.com/in/hng888/">Huy Nguyen</a>, Research Engineer, Verizon Media & <a href="https://www.linkedin.com/in/ericmdodds/">Eric Dodds</a>, Research Scientist, Verizon Media</p><p>E-commerce fashion and furniture sites use a fundamentally different way of searching for content based on visual similarity. We call this “Search 2.0” in homage to Andrej Karpathy’s Software 2.0 <a href="https://medium.com/@karpathy/software-2-0-a64152b37c35">essay</a>. Today we’re announcing the release of an open source ranking framework called <a href="https://github.com/yahoo/Prototrain">prototrain-ranker</a> which you can use in your modern search projects. This is based on our extensive research in search technology and ranking optimizations. </p><p>We’ll describe the visual search problem, how it fits into a developing trend of search engines and the evolving technologies surrounding the industry, and why we open sourced our model and machine learning framework, inviting you to use and work with us to improve.</p><figure data-orig-width="1148" data-orig-height="806" class="tmblr-full"><img src="https://66.media.tumblr.com/7067c8f8686b1f01e55c28bf4e1de862/tumblr_inline_pr5f2zNcgk1wxhpzr_540.png" alt="image" data-orig-width="1148" data-orig-height="806"/></figure><p><b></b></p><p>The Search 1.0 stack is one that many engineers and search practitioners are familiar with. It involves indexing documents and relies upon matching keywords to terms in a collection of documents to surface relevant content at query time. In contrast, Search 2.0 relies upon “embeddings” rather than documents, and k-nearest-neighbors retrieval rather than term matching to surface relevant content. The programmer does not directly specify the map from content to embeddings. Instead, the programmer specifies how this map is derived from data.</p><p>Think of embeddings as points in a high-dimensional space that are used to represent some piece of content or metadata. In a Search 2.0 system, embeddings lying close to each other in this space are more highly “related” or “relevant” than points that are far apart.<br/></p><figure data-orig-width="1674" data-orig-height="658" class="tmblr-full"><img src="https://66.media.tumblr.com/4148253191e504460b88e42646d81022/tumblr_inline_pr5f65By871wxhpzr_540.png" alt="image" data-orig-width="1674" data-orig-height="658"/></figure><p>Instead of parsing a query for specific terms and then matching for those terms in our document index, a Search 2.0 system would encode the query into an embedding space and retrieve the data associated with nearby embeddings.<br/></p><p>Prototrain-ranker provides two things: (1) a “ranker” model for mapping content to embeddings and performing search, and (2) our “prototrain” framework for training prototype machine learning models like our ranker Search 2.0 system.<br/></p><p><b></b></p><p><b>Why Search 2.0</b></p><p><b></b></p><p>Whether we’re searching over videos, images, text, or other media, we can represent each type of data as an embedding using the proper deep learning techniques.</p><p>Representing metadata as embeddings in high-dimensional space opens up the world of search to the powerful machinery of deep learning tools.<b> </b><br/></p><p><b></b></p><p>We can learn ranking functions and directly encode “relevance” into embeddings, avoiding the need for brittle and hand-engineered ranking functions. For example, it would be error-prone and tedious to program a Search 1.0 search engine to respond to queries like “images with a red bird in the upper right-hand corner”. Certainly one could build specific classifiers for each one of these attributes (color, object, location) and index them. But each individual classifier and rule to parse the results would take work to build and test, with any new attribute entailing additional work and opportunities for errors and brittleness. Instead one could build a Search 2.0 system by obtaining pairs of images and descriptions that directly capture one’s notion of “relevance” to train an end-to-end ranking model.</p><p>The flexibility of this approach – defining relevance as an abstract distance using examples rather than potentially brittle rules – allows several other capabilities in a straightforward manner. These capabilities include multi-modal input (e.g. text with an image), interpolating between queries (“something between this sofa and that one”), and conditioning a query (“a dress like this, but in any color”).<br/></p><p>Reframing search as a nearest-neighbor retrieval also has other benefits. We separate the process of ranking from the process of storing data. In doing so, we are able to reduce rules and logic of Search 1.0 matching and ranking into a portable matrix multiplication routine. This makes the search engine massively parallel and allows it to take advantage of GPU hardware which has been optimized over decades to efficiently execute matrix multiplication.<br/></p><p><b>Why we open sourced prototrain-ranker</b><br/></p><p>The code we open source today enables a key component in the Search 2.0 system. It allows one to “learn” embeddings by defining pairs of relevant and irrelevant data items. We provide as an example the necessary processing to train the model on the <a href="http://cvgl.stanford.edu/projects/lifted_struct/">Stanford Online Products dataset</a>, which provides multiple images of each of the thousands of products. The notion of relevance here is that two images contain the same item.</p><p>We also use the prototrain framework for training other machine learning models such as image classifiers. You can too. Please check out the <a href="https://github.com/yahoo/Prototrain">framework</a> and/or the <a href="https://github.com/yahoo/Prototrain/tree/master/models/ranker">ranker model</a>. We hope you will have questions or comments, and will want to contribute to the project. Engage with us via GitHub or <a href="mailto:%20eric.mcvoy.dodds@verizonmedia.com">email</a> directly if you have questions.<br/></p>