Efficient Personal Search at Scale with Vespa, the Open Source Big Data Serving Engine
<p><a href="https://www.linkedin.com/in/jon-bratseth-6a6585/">Jon Bratseth</a>, Distinguished Architect, Verizon Media<br/></p><p><b></b></p><p><a href="https://vespa.ai/">Vespa</a>, the open source big data serving engine, includes a mode which provides personal search at scale for a fraction of the cost of alternatives. In this article, we explain streaming search and discuss how to use it.</p><p>Imagine you are tasked with building the next email service, a massive personal data store centered around search. How would you do it? An obvious answer is to just use a regular search engine, write all documents to a big index and simply restrict queries to match documents belonging to a single user. Although this works, it’s incredibly costly.</p><p>Successful personal data stores have a tendency to become massive — the amount of personal data produced in the world outweighs public data by many orders of magnitude. Storing indexes in addition to raw data means paying for extra disk space and the overhead of updating this massive index each time a user changes or adds data. Index updates are costly, especially when they need to be handled in realtime, which users often expect for their own data. Systems need to handle billions of writes per day so this quickly becomes the dominating cost of the entire system.</p><p><b></b></p><p>However, when you think about it, there’s really no need to go through the trouble of maintaining global indexes when each user only searches her own data. What if we instead just maintain a separate small index per user? This makes both index updates and queries cheaper but leads to a new problem: writes will arrive randomly over all users, which means we’ll need to read and write a user’s index on every update without help from caching. A billion writes per day translates to about 25k read-and-write operations per second peak. Handling traffic at that scale either means using a few thousand spinning disks, or storing all data on SSD’s. Both options are expensive.</p><p>Large scale data stores already solve this problem for appending writes, by using some variant of multilevel log storage. Could we leverage this to layer the index on top of a data store? That helps, but it means we need to do our own development to put these systems together in a way that performs at scale every time for both queries and writes. And we still need to pay the cost of storing the indexes in addition to the raw user data.</p><p>Do we need indexes at all though? It turns out that we don’t. Indexes consist of pointers from words/tokens to the documents containing them. This allows us to find those documents faster than would be possible if we had to read the content of the documents to find the right ones, at the considerable cost of maintaining those indexes. In personal search, however, any query only accesses a small subset of the data, and the subsets are known in advance. If we take care to store the data of each subset together we can achieve search with low latency by simply reading the data at query time — what we call streaming search. In most cases, subsets of data (i.e most users) are so small that this can be done serially on a single node. Subsets of data that are too large to stream quickly on a single node can be split over multiple nodes streaming in parallel.</p><p><b>Numbers</b></p><p>How many documents can be searched per node per second with this solution? Assuming a node with 500 Mb/sec read speed (either from an SSD or multiple spinning disks), and 1k average compressed document size, the disk can search max 500Mb/sec / 1k/doc = 500,000 docs/sec. If each user stores 1000 documents on average this gives a max throughput per node of 500 queries/second. This is not an exact computation since we disregard time used to seek and write, and inefficiency from reading non-compacted data on one hand, and assume an overly pessimistic zero effect from caching on the other, but it is a good indication that our solution is cost effective.</p><p>What about latency? From the calculation above we see that the latency from finding the matching documents will be 2 ms on average. However, we usually care more about the 99% latency (or similar). This will be driven by large users which need to be split among multiple nodes streaming in parallel. The max data size per node is then a trade-off between latency for such users and the overall cost of executing their queries (less nodes per query is cheaper). For example, we can choose to store max 50.000 documents per user per node such that we get a max latency of 100 ms per query. Lastly, the total number of nodes decides the max parallelism and hence latency for the very largest users. For example, with 20 nodes in total per cluster, we can support 20 * 50k = 1 million documents for a single user with 100 ms latency.</p><p><b></b></p><p><b>Streaming search</b></p><p>Alright, we now have a cost-effective solution to implement the next email provider: store just the raw data of users, in a log-level store. Locate the data of each user on a single node in the system for locality (or 2–3 nodes for redundancy), but split over multiple nodes for users that grow large. Implement a fully functional search and relevance engine on top of the raw data store, which distributes queries to the right set of nodes for each user and merges the results. This will be inexpensive and efficient, but it sounds like a lot of work! It would be great if somebody already did all of this, ran it at scale for years and then released it as open source.</p><p>Well, as luck would have it, we already did this in <a href="https://vespa.ai/">Vespa</a>. In addition to the standard indexing mode, Vespa includes a streaming mode for documents which provides this solution, implemented by layering the full search engine functionality over the raw data store built into Vespa. When this solution is compared to indexed search in Vespa or more complicated sharding solutions in Elasticsearch for personal search applications, we typically see about an order of magnitude reduction in the cost of achieving a system which can sustain the query and update rates needed by the application with stable latencies over long time periods. It has been used to implement various applications such as storing and searching massive amounts of emails, personal typeahead suggestions, personal image collections, and private forum group content.</p><p><b>Streaming search on Vespa</b></p><p>The steps to using streaming search on <a href="https://vespa.ai/">Vespa</a> are:</p><ul><li>Set <a href="https://docs.vespa.ai/documentation/reference/services-content.html#document">streaming mode</a> for the document type(s) in question in services.xml.</li><li>Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the <a href="https://docs.vespa.ai/documentation/reference/services-content.html#document">document id</a>, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123</li><li>Pass the group id in queries by setting the query property <a href="https://docs.vespa.ai/documentation/reference/search-api-reference.html#streaming.groupname">streaming.groupname</a> in queries.</li></ul><p>Set <a href="https://docs.vespa.ai/documentation/reference/services-content.html#document">streaming mode</a> for the document type(s) in question in services.xml.</p><p>Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the <a href="https://docs.vespa.ai/documentation/reference/services-content.html#document">document id</a>, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123</p><p>Pass the group id in queries by setting the query property <a href="https://docs.vespa.ai/documentation/reference/search-api-reference.html#streaming.groupname">streaming.groupname</a> in queries.</p><p>That’s it!</p><p>By following the above steps, you’ll have created a scalable, battle-proven personal search solution which is an order of magnitude cheaper than any available alternative, with full support for <a href="https://docs.vespa.ai/documentation/query-language.html">structured and text search</a>, <a href="https://docs.vespa.ai/documentation/ranking.html">advanced relevance including natural language and machine-learned models</a>, and powerful <a href="https://docs.vespa.ai/documentation/grouping.html">grouping and aggregation</a> for features like faceting. For more details see the <a href="https://docs.vespa.ai/documentation/streaming-search.html">documentation on streaming search</a>.</p><p>Have fun using Vespa and let us know (<a href="https://twitter.com/vespaengine">tweet</a> or <a href="mailto:info@vespa.ai">email</a>) what you’re building and any features you’d like to see.</p>