Latest Blogposts

Stories and updates you can see

Reset

Filter Events

Image Date Details*
March 20, 2019
March 20, 2019
Share

Dash Open 03: Alan Carroll - Networking On The Edge: IPv6, HTTP2, Apache Traffic Server

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Gil Yehuda (Sr. Director of Open Source at Verizon Media) interviews Alan Carroll, PhD, Senior Software Engineer for Global Networking / Edge at Verizon Media. Alan discusses networking at Verizon Media and how user traffic and proxy happens through Apache Traffic Server. He also shares his love of model rockets. Audio and transcript available here. You can also listen to this episode of Dash Open on iTunes or SoundCloud.

Dash Open 03: Alan Carroll - Networking On The Edge: IPv6, HTTP2, Apache Traffic Server

March 20, 2019
March 8, 2019
March 8, 2019
rosaliebeevm
Share

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

yahoodevelopers: By Akshay Sarma, Principal Engineer, Verizon Media & Brian Xiao, Software Engineer, Verizon Media This is the first of an ongoing series of blog posts sharing releases and announcements for Bullet, an open-sourced lightweight, scalable, pluggable, multi-tenant query system. Bullet allows you to query any data flowing through a streaming system without having to store it first through its UI or API. The queries are injected into the running system and have minimal overhead. Running hundreds of queries generally fit into the overhead of just reading the streaming data. Bullet requires running an instance of its backend on your data. This backend runs on common stream processing frameworks (Storm and Spark Streaming currently supported). The data on which Bullet sits determines what it is used for. For example, our team runs an instance of Bullet on user engagement data (~1M events/sec) to let developers find their own events to validate their code that produces this data. We also use this instance to interactively explore data, throw up quick dashboards to monitor live releases, count unique users, debug issues, and more. Since open sourcing Bullet in 2017, we’ve been hard at work adding many new features! We’ll highlight some of these here and continue sharing update posts for future releases. Windowing Bullet used to operate in a request-response fashion - you would submit a query and wait for the query to meet its termination conditions (usually duration) before receiving results. For short-lived queries, say, a few seconds, this was fine. But as we started fielding more interactive and iterative queries, waiting even a minute for results became too cumbersome. Enter windowing! Bullet now supports time and record-based windowing. With time windowing, you can break up your query into chunks of time over its duration and retrieve results for each chunk.  For example, you can calculate the average of a field, and stream back results every second: In the above example, the aggregation is operating on all the data since the beginning of the query, but you can also do aggregations on just the windows themselves. This is often called a Tumbling window: With record windowing, you can get the intermediate aggregation for each record that matches your query (a Sliding window). Or you can do a Tumbling window on records rather than time. For example, you could get results back every three records: Overlapping windows in other ways (Hopping windows) or windows that reset based on different criteria (Session windows, Cascading windows) are currently being worked on. Stay tuned! Apache Pulsar support as a native PubSub Bullet uses a PubSub (publish-subscribe) message queue to send queries and results between the Web Service and Backend. As with everything else in Bullet, the PubSub is pluggable. You can use your favorite pubsub by implementing a few interfaces if you don’t want to use the ones we provide. Until now, we’ve maintained and supported a REST-based PubSub and an Apache Kafka PubSub. Now we are excited to announce supporting Apache Pulsar as well! Bullet Pulsar will be useful to those users who want to use Pulsar as their underlying messaging service. If you aren’t familiar with Pulsar, setting up a local standalone is very simple, and by default, any Pulsar topics written to will automatically be created. Setting up an instance of Bullet with Pulsar instead of REST or Kafka is just as easy. You can refer to our documentation for more details. Plug your data into Bullet without code While Bullet worked on any data source located in any persistence layer, you still had to implement an interface to connect your data source to the Backend and convert it into a record container format that Bullet understands. For instance, your data might be located in Kafka and be in the Avro format. If you were using Bullet on Storm, you would perhaps write a Storm Spout to read from Kafka, deserialize, and convert the Avro data into the Bullet record format. This was the only interface in Bullet that required our customers to write their own code. Not anymore! Bullet DSL is a text/configuration-based format for users to plug in their data to the Bullet Backend without having to write a single line of code. Bullet DSL abstracts away the two major components for plugging data into the Bullet Backend. A Connector piece to read from arbitrary data-sources and a Converter piece to convert that read data into the Bullet record container. We currently support and maintain a few of these - Kafka and Pulsar for Connectors and Avro, Maps and arbitrary Java POJOs for Converters. The Converters understand typed data and can even do a bit of minor ETL (Extract, Transform and Load) if you need to change your data around before feeding it into Bullet. As always, the DSL components are pluggable and you can write your own (and contribute it back!) if you need one that we don’t support. We appreciate your feedback and contributions! Explore Bullet on GitHub, use and help contribute to the project, and chat with us on Google Groups. To get started, try our Quickstarts on Spark or Storm to set up an instance of Bullet on some fake data and play around with it.

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

March 8, 2019
February 28, 2019
February 28, 2019
Share

Vespa Product Updates, February 2019: Boolean Field Type, Environment Variables, and Advanced Search Core Tuning

In last month’s Vespa update, we mentioned Parent/Child, Large File Config Download, and a Simplified Feeding Interface. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to helpful feedback and contributions from the community, Vespa continues to grow. This month, we’re excited to share the following updates: Boolean field type Vespa has released a boolean field type in #6644. This feature was requested by the open source community and is targeted for applications that have many boolean fields. This feature reduces memory footprint to 1/8 for the fields (compared to byte) and hence increases query throughput / cuts latency. Learn more about choosing the field type here. Environment variables The Vespa Container now supports setting environment variables in services.xml. This is useful if the application uses libraries that read environment variables. Advanced search core tuning You can now configure index warmup - this reduces high-latency requests at startup. Also, reduce spiky memory usage when attributes grow using resizing-amortize-count - the default is changed to provide smoother memory usage. This uses less transient memory in growing applications. More details surrounding search core configuration can be explored here. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to see.

Vespa Product Updates, February 2019: Boolean Field Type, Environment Variables, and Advanced Search Core Tuning

February 28, 2019
February 27, 2019
February 27, 2019
Share

Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications

By Arun Gupta Effective monitoring of applications depends on high-quality instrumentation. By measuring key metrics for your applications, you can identify performance characteristics, bottlenecks, detect failures, and plan for growth. Here are some examples of metrics that you might want about your applications: - How much processing is being done, which could be in terms of requests, queries, transactions, records, backend calls, etc. - How long is a particular part of the code taking (ie, latency), which could be in the form of total time spent as well as statistics like weighted average (based on sum and count), min, max, percentiles, and histograms. - How many resources are being utilized, like memory, entries in a hashmap, length of an array, etc. Further, you might want to know details about your service, such as: - How many users are querying the service? - Latency experience by users, sliced by users’ device types, countries of origin, operating system versions, etc. - Number of errors encountered by users, sliced by types of errors. - Sizes of responses returned to users. At Verizon Media, we have applications and services that run at a very large-scale and metrics are critical for driving business and operational insights. We set out to find a good metrics library for our Java services that provide lots of features but performs well at scale. After evaluating available options, we realized that existing libraries did not meet our requirements: - Support for dynamic dimensions (ie, tags) - Metrics need to support associative operations - Works well in very high traffic applications - Minimal garbage collection pressure - Report metrics to multiple monitoring systems As a result, we built and open sourced UltraBrew Metrics, which is a Java library for instrumenting very large-scale applications. Performance UltraBrew Metrics can operate at millions of requests per second per JVM without measurably slowing the application down. We currently use the library to instrument multiple applications at Verizon Media, including one that uses this library 20+ million times per second on a single JVM. Here are some of the techniques that allowed us to achieve our performance target: - Minimize the need for synchronization by: - Using Java’s Unsafe API for atomic operations. - Aligning data fields to L1/L2-cache line size. - Tracking state over 2 time-intervals to prevent contention between writes and reads. - Reduce the creation of objects, including avoiding the use of Java HashMaps. - Writes happen on caller thread rather than dedicated threads. This avoids the need for a buffer between threads. Questions or Contributions To learn more about this library, please visit our GitHub. Feel free to also tweet or email us with any questions or suggestions. Acknowledgments Special thanks to my colleagues who made this possible: - Matti Oikarinen - Mika Mannermaa - Smruti Ranjan Sahoo - Ilpo Ruotsalainen - Chris Larsen - Rosalie Bartlett - The Monitoring Team at Verizon Media

Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications

February 27, 2019
February 25, 2019
February 25, 2019
Share

Freeze Windows and Collapsed Builds

Min Zhang, Software Dev Engineer, Verizon Media Pranav Ravichandran, Software Dev Engineer, Verizon Media Freeze Windows Want to prevent your deployment jobs from running on weekends? You can now freeze your Screwdriver jobs and prevent them from running during specific time windows using the freezeWindows feature. Screwdriver will collapse all the frozen jobs inside the window to a single job and run it as soon as the window expires. The job will be run from the last commit within the window. Screwdriver Users The freezeWindows setting takes a cron expression or a list of them as the value. Caveats: - Unlike buildPeriodically, freezeWindows should not use hashed time therefore the symbol H for hash is disabled. - The combinations of day of week and day of month are invalid. Therefore only one out of day of week and day of month can be specified. The other field should be set to ?. - All times are in UTC. In the following example, job1 will be frozen during the month of March, job2 will be frozen on weekends, and job3 will be frozen from 10 PM to 10 AM. shared: image: node:6 jobs: job1: freezeWindows: ['* * ? 3 *'] requires: [~commit] steps: - build: echo "build" job2: freezeWindows: ['* * ? * 0,6,7'] requires: [~job1] steps: - build: echo "build" job3: freezeWindows: ['* 0-10,22-23 ? * *'] requires: [~job2] steps: - build: echo "build" In the UI, jobs within the freeze window appear as below (deploy and auxiliary): Collapsed Builds Screwdriver now supports collapsing all BLOCKED builds of the same job into a single build (the latest one). With this feature, users with concurrent builds no longer need to wait until all of them finish in the series to get the latest release out. Screwdriver Users To opt in for collapseBuilds, Screwdriver users can configure their screwdriver.yaml using annotations as shown below: jobs: main: annotations: screwdriver.cd/collapseBuilds: true image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] In the UI, collapsed build appears as below: Cluster Admin Cluster admin can configure the default behavior as collapsed or not in queue-worker configuration. Compatibility List In order to use freeze windows and collapsed builds, you will need these minimum versions: - API - v0.5.578 - Queue-worker - v2.5.2 - Buildcluster-queue-worker:v1.1.8Contributors Thank you to the following contributors for making this feature possible: - minz1027 - pranavrcQuestions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Freeze Windows and Collapsed Builds

February 25, 2019
February 21, 2019
February 21, 2019
Share

Restrict PRs from forked repository

Dao Lam, Software Engineer, Verizon Media Previously, any Screwdriver V4 user can start PR jobs (jobs configured to run on ~pr) by forking the repository and creating a PR against it. For many pipelines, this is not a desirable behavior due to security reasons since secrets and other sensitive data might get exposed in the PR builds. Screwdriver V4 now allows users to specify whether they want to restrict forked PRs or all PRs using pipeline-level annotation screwdriver.cd/restrictPR. Example: annotations: screwdriver.cd/restrictPR: fork shared: image: node:8 jobs: main: requires: - ~pr - ~commit steps: - echo: echo test Cluster admins can set the default behavior for the cluster by setting the environment variable: RESTRICT_PR. Explore the guide here Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.581Contributors Thanks to the following contributors for making this feature possible: - d2lam - stjohnjohnson Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Restrict PRs from forked repository

February 21, 2019
February 20, 2019
February 20, 2019
Share

Shared “Verizon Media Case Study: Zero Trust Security With Athenz” at the OpenStack Summit in Berlin

By James Penick, Architect Director, Verizon Media At Verizon Media, we’ve developed and open sourced a platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures called Athenz. Athenz addresses zero trust principles, including situations where authenticated clients require explicit authorization to be allowed to perform actions, and authorization needs to always be limited to the least privilege required. During the OpenStack Summit in Berlin, I discussed Athenz and its integration with OpenStack for fully automated role-based authorization and identity provisioning. We are using Athenz to bootstrap our instances deployed in both private and public clouds with service identities in the form of short-lived X.509 certificates that allow one service to securely communicate with another. Our OpenStack instances are powered by Athenz identities at scale. To learn more about Athenz, give feedback, or contribute, please visit our Github and chat with us on Slack.

Shared “Verizon Media Case Study: Zero Trust Security With Athenz” at the OpenStack Summit in Berlin

February 20, 2019
February 13, 2019
February 13, 2019
Share

Efficient Personal Search at Scale with Vespa, the Open Source Big Data Serving Engine

Jon Bratseth, Distinguished Architect, Verizon Media Vespa, the open source big data serving engine, includes a mode which provides personal search at scale for a fraction of the cost of alternatives. In this article, we explain streaming search and discuss how to use it. Imagine you are tasked with building the next email service, a massive personal data store centered around search. How would you do it? An obvious answer is to just use a regular search engine, write all documents to a big index and simply restrict queries to match documents belonging to a single user. Although this works, it’s incredibly costly. Successful personal data stores have a tendency to become massive — the amount of personal data produced in the world outweighs public data by many orders of magnitude. Storing indexes in addition to raw data means paying for extra disk space and the overhead of updating this massive index each time a user changes or adds data. Index updates are costly, especially when they need to be handled in realtime, which users often expect for their own data. Systems need to handle billions of writes per day so this quickly becomes the dominating cost of the entire system. However, when you think about it, there’s really no need to go through the trouble of maintaining global indexes when each user only searches her own data. What if we instead just maintain a separate small index per user? This makes both index updates and queries cheaper but leads to a new problem: writes will arrive randomly over all users, which means we’ll need to read and write a user’s index on every update without help from caching. A billion writes per day translates to about 25k read-and-write operations per second peak. Handling traffic at that scale either means using a few thousand spinning disks, or storing all data on SSD’s. Both options are expensive. Large scale data stores already solve this problem for appending writes, by using some variant of multilevel log storage. Could we leverage this to layer the index on top of a data store? That helps, but it means we need to do our own development to put these systems together in a way that performs at scale every time for both queries and writes. And we still need to pay the cost of storing the indexes in addition to the raw user data. Do we need indexes at all though? It turns out that we don’t. Indexes consist of pointers from words/tokens to the documents containing them. This allows us to find those documents faster than would be possible if we had to read the content of the documents to find the right ones, at the considerable cost of maintaining those indexes. In personal search, however, any query only accesses a small subset of the data, and the subsets are known in advance. If we take care to store the data of each subset together we can achieve search with low latency by simply reading the data at query time — what we call streaming search. In most cases, subsets of data (i.e most users) are so small that this can be done serially on a single node. Subsets of data that are too large to stream quickly on a single node can be split over multiple nodes streaming in parallel. Numbers How many documents can be searched per node per second with this solution? Assuming a node with 500 Mb/sec read speed (either from an SSD or multiple spinning disks), and 1k average compressed document size, the disk can search max 500Mb/sec / 1k/doc = 500,000 docs/sec. If each user stores 1000 documents on average this gives a max throughput per node of 500 queries/second. This is not an exact computation since we disregard time used to seek and write, and inefficiency from reading non-compacted data on one hand, and assume an overly pessimistic zero effect from caching on the other, but it is a good indication that our solution is cost effective. What about latency? From the calculation above we see that the latency from finding the matching documents will be 2 ms on average. However, we usually care more about the 99% latency (or similar). This will be driven by large users which need to be split among multiple nodes streaming in parallel. The max data size per node is then a trade-off between latency for such users and the overall cost of executing their queries (less nodes per query is cheaper). For example, we can choose to store max 50.000 documents per user per node such that we get a max latency of 100 ms per query. Lastly, the total number of nodes decides the max parallelism and hence latency for the very largest users. For example, with 20 nodes in total per cluster, we can support 20 * 50k = 1 million documents for a single user with 100 ms latency. Streaming search Alright, we now have a cost-effective solution to implement the next email provider: store just the raw data of users, in a log-level store. Locate the data of each user on a single node in the system for locality (or 2–3 nodes for redundancy), but split over multiple nodes for users that grow large. Implement a fully functional search and relevance engine on top of the raw data store, which distributes queries to the right set of nodes for each user and merges the results. This will be inexpensive and efficient, but it sounds like a lot of work! It would be great if somebody already did all of this, ran it at scale for years and then released it as open source. Well, as luck would have it, we already did this in Vespa. In addition to the standard indexing mode, Vespa includes a streaming mode for documents which provides this solution, implemented by layering the full search engine functionality over the raw data store built into Vespa. When this solution is compared to indexed search in Vespa or more complicated sharding solutions in Elasticsearch for personal search applications, we typically see about an order of magnitude reduction in the cost of achieving a system which can sustain the query and update rates needed by the application with stable latencies over long time periods. It has been used to implement various applications such as storing and searching massive amounts of emails, personal typeahead suggestions, personal image collections, and private forum group content. Streaming search on Vespa The steps to using streaming search on Vespa are: - Set streaming mode for the document type(s) in question in services.xml. - Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123 - Pass the group id in queries by setting the query property streaming.groupname in queries. Set streaming mode for the document type(s) in question in services.xml. Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123 Pass the group id in queries by setting the query property streaming.groupname in queries. That’s it! By following the above steps, you’ll have created a scalable, battle-proven personal search solution which is an order of magnitude cheaper than any available alternative, with full support for structured and text search, advanced relevance including natural language and machine-learned models, and powerful grouping and aggregation for features like faceting. For more details see the documentation on streaming search. Have fun using Vespa and let us know (tweet or email) what you’re building and any features you’d like to see.

Efficient Personal Search at Scale with Vespa, the Open Source Big Data Serving Engine

February 13, 2019
February 12, 2019
February 12, 2019
Share

Serving article comments using reinforcement learning of a neural net

Don’t look at the comments. When you allow users to make comments on your content pages you face the problem that not all of them are worth showing — a difficult problem to solve, hence the saying. In this article I’ll show how this problem has been attacked using reinforcement learning at serving time on Yahoo content sites, using the Vespa open source platform to create a scalable production solution. Yahoo properties such as Yahoo Finance, News and Sports allow users to comment on the articles, similar to many other apps and websites. To support this the team needed a system that can add, find, count and serve comments at scale in real time. Not all comments are equally as interesting or relevant though, and some articles can have hundreds of thousands of comments, so a good commenting system must also choose the right comments among these to show to users viewing the article. To accomplish this, the system must observe what users are doing and learn how to pick comments that are interesting. Here I’ll explain how this problem was solved for Yahoo properties by using Vespa — the open source big data serving engine. I’ll start with the basics and then show how comment selection using a neural net and reinforcement learning was implemented.Real-time comment serving As mentioned, the team needed a system that can add, find, count, and serve comments at scale in real time. The team chose Vespa, the open big data serving engine for this, as it supports both such basic serving as well as incorporating machine learning at serving time (which we’ll get to below). By storing each comment as a separate document in Vespa, containing the ID of the article commented upon, the ID of the user commenting, various comment metadata, and the comment text itself, the team could issue queries to quickly retrieve the comments on a given article for display, or to show a comment count next to the article: In addition, this document structure allowed less-used operations such as showing all the articles of a given user and similar. The Vespa instance used at Yahoo for this store about a billion comments at any time, serve about 12.000 queries per second, and about twice as many writes (new comments + comment metadata updates). Average latency for queries is about 4 ms, and write latency roughly 1 ms. Nodes are organized in two tiers as a single Vespa application: A single stateless cluster handling incoming queries and writes, and a content cluster storing the comments, maintaining indexes and executing the distributed part of queries in parallel. In total, 32 stateless and 96 stateful nodes are spread over 5 regional data centers. Data is automatically sharded by Vespa in each datacenter, in 6–12 shards depending on the traffic patterns of that region.Ranking comments Some articles on Yahoo pages have a very large number of comments — up to hundreds of thousands are not uncommon, and no user is going to read all of them. Therefore it is necessary to pick the best comments to show each time someone views an article. Vespa does this by finding all the comments for the article, computing a score for each, and picking the comments with the best scores to show to the user. This process is called ranking. By configuring the function to compute for each comment as a ranking expression in Vespa, the engine will compute it locally on each data partition in parallel during query execution. This allows executing these queries with low latency and ensures that more comments can be handled by adding more content nodes, without causing an increase in latency. The input to the ranking function is features which are typically stored in the document (here: a comment) or sent with the query. Comments have various features indicating how users interacted with the comment, as well as features computed from the comment content itself. In addition, the system keeps track of the reputation of each comment author as a feature. User actions are sent as update operations to Vespa as they are performed. The information about authors is also continuously changing, but since each author can write many comments it would be wasteful to have to update each comment every time there is new information about the author. Instead, the author information is stored in a separate document type — one document per author, and a document reference in Vespa is used to import that author feature into each comment. This allows updating the author information once and have it automatically take effect for all comments by that author. With these features, it’s possible in Vespa to configure a mathematical function as a ranking expression which computes the rank score or each comment to produce a ranked list of the top comments, like the following:Using a neural net and reinforcement learning The team used to rank comments with a handwritten ranking expression having hardcoded weighting of the features. This is a good way to get started but obviously not optimal. To improve it they needed to decide on a measurable target and use machine learning to optimize towards it. The ultimate goal is for users to find the comments interesting. This can not be measured directly, but luckily we can define a good proxy for interest based on signals such as dwell time (the amount of time the users spend on the comments of an article) and user actions (whether users reply to comments, provide upvotes and downvotes, etc). The team knew they wanted user interest to go up on average, but there is no way to know what the correct value of the measure of interest might be for any single given list of comments. Therefore it’s hard to create a training set of interest signals for articles (supervised learning), so reinforcement learning was chosen instead: Let the system make small changes to the live machine-learned model iteratively, observe the effect on the signal used as a proxy for user interest, and use this to converge on a model that increases it. The model chosen here was a neural net with multiple hidden layers, roughly illustrated as follows: The advantage of using a neural net compared to a simple function such as linear regression is that it can capture non-linear relationships in the feature data without anyone having to guess which relationship exists and hand-write functions to capture them (feature engineering). To explore the space of possible rankings, the team implemented a sampling algorithm in a Searcher to perturb the ranking of comments returned from each query. They logged the ranking information and user interest signals such as dwell time to their Hadoop grid where they are joined. This generates a training set each hour which is used to retrain the model using TensorFlow-on-Spark, which produces a new model for the next iteration of the reinforcement learning cycle. To implement this on Vespa, the team configured the neural net as the ranking function for comments. This was done as a manually written ranking function over tensors in a rank profile. Here is the production configuration used: rank-profile neuralNet {  function get_model_weights(field) {    expression: if(query(field) == 0, constant(field), query(field))  }  function layer_0() { # returns tensor(hidden[9])    expression: elu(xw_plus_b(nn_input,                              get_model_weights(W_0),                              get_model_weights(b_0),                              x))  }  function layer_1() { # returns tensor(out[9])    expression: elu(xw_plus_b(layer_0,                              get_model_weights(W_1),                              get_model_weights(b_1),                              hidden))  }  # xw_plus_b returns tensor(out[1]), so sum converts to double  function layer_out() {    expression: sum(xw_plus_b(layer_1,                              get_model_weights(W_out),                              get_model_weights(b_out),                              out))  }  first-phase {    expression: freshnessRank  }  second-phase {    expression: layer_out    rerank-count: 2000  } } More recently Vespa added support for deploying TensorFlow SavedModels directly (as well as similar support for tools saving in the ONNX format), which would also be a good option here since the training happens in TensorFlow. Neural nets have a pair of weight and bias tensors for each layer, which is what the team wanted the training process to optimize. The simplest way to include the weights and biases in the model is to add them as constant tensorsto the application package. However, with reinforcement learning it is necessary to be able update these tensor parameters frequently. This could be achieved by redeploying the application package frequently, as Vespa allows that to be done without restarts or disruption to ongoing queries. However, it is still a somewhat heavy-weight process, so another approach was chosen: Store the neural net parameters as tensors in a separate document type in Vespa, and create a Searcher component which looks up this document on each incoming query, and adds the parameter tensors to it before it’s passed to the content nodes for evaluation. Here is the full production code needed to accomplish this serving-time operation: import com.yahoo.document.Document; import com.yahoo.document.DocumentId; import com.yahoo.document.Field; import com.yahoo.document.datatypes.FieldValue; import com.yahoo.document.datatypes.TensorFieldValue; import com.yahoo.documentapi.DocumentAccess; import com.yahoo.documentapi.SyncParameters; import com.yahoo.documentapi.SyncSession; import com.yahoo.search.Query; import com.yahoo.search.Result; import com.yahoo.search.Searcher; import com.yahoo.search.searchchain.Execution; import com.yahoo.tensor.Tensor; import java.util.Map; public class LoadRankingmodelSearcher extends Searcher {    private static final String VESPA_ID_FORMAT = "id:canvass_search:rankingmodel::%s";    // https://docs.vespa.ai/documentation/ranking.html#using-query-variables:    private static final String FEATURE_FORMAT = "query(%s)";      /** To fetch model documents from Vespa index */    private final SyncSession fetchDocumentSession;    public LoadRankingmodelSearcher() {        this.fetchDocumentSession =           DocumentAccess.createDefault()                         .createSyncSession(new SyncParameters.Builder().build());    }    @Override    public Result search(Query query, Execution execution) {        // Fetch model document from Vespa        String id = String.format(VESPA_ID_FORMAT, query.getRanking().getProfile());        Document modelDoc = fetchDocumentSession.get(new DocumentId(id));        // Add it to the query        if (modelDoc != null) {            modelDoc.iterator().forEachRemaining((Map.Entry e) ->                addTensorFromDocumentToQuery(e.getKey().getName(), e.getValue(), query)           );        }        return execution.search(query);    }    private static void addTensorFromDocumentToQuery(String field,                                                     FieldValue value,                                                     Query query) {        if (value instanceof TensorFieldValue) {            Tensor tensor = ((TensorFieldValue) value).getTensor().get();            query.getRanking().getFeatures().put(String.format(FEATURE_FORMAT, field),                                                 tensor);        }    } } The model weight document definition is added to the same content cluster as the comment documents and simply contains attribute fields for each weight and bias tensor of the neural net (where each field below is configured with “indexing: attribute | summary”): document rankingmodel {    field modelTimestamp type long { … }  field W_0 type tensor(x[9],hidden[9]) { … }  field b_0 type tensor(hidden[9]) { … }  field W_1 type tensor(hidden[9],out[9]) { … }  field b_1 type tensor(out[9]) { … }  field W_out type tensor(out[9]) { … }  field b_out type tensor(out[1]) { … } } Since updating documents is a lightweight operation it is now possible to make frequent changes to the neural net to implement the reinforcement learning process.Results Switching to the neural net model with reinforcement learning has already led to a 20% increase in average dwell time. The average response time when ranking with the neural net increased to about 7 ms since the neural net model is more expensive. The response time stays low because in Vespa the neural net is evaluated on all the content nodes (partitions) in parallel. This avoids the bottleneck of sending the data for each comment to be evaluated over the network and allows increasing parallelization indefinitely by adding more content nodes. However, evaluating the neural net for all comments for outlier articles which have hundreds of thousands of comments would still be very costly. If you read the rank profile configuration shown above, you’ll have noticed the solution to this: Two-phase ranking was used where the comments are first selected by a cheap rank function (termed freshnessRank) and the highest scoring 2000 documents (per content node) are re-ranked using the neural net. This caps the max CPU spent on evaluating the neural net per query.Conclusion and future work In this article I have shown how to implement a real comment serving and ranking system on Vespa. With reinforcement learning gaining popularity, the serving system needs to become a more integrated part of the machine learning stack, and by using Vespa this can be accomplished relatively easily with a standard open source technology. The team working on this plan to expand on this work by applying it to other domains such as content recommendation, incorporating more features in a larger network, and exploring personalized comment ranking.

Serving article comments using reinforcement learning of a neural net

February 12, 2019
February 8, 2019
February 8, 2019
Share

Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving

Online evaluation of machine-learned models (model serving) is difficult to scale to large datasets. Vespa.ai is an open source big data serving solution used to solve this problem and in use today on some of the largest such systems in the world. These systems evaluate models over millions of data points per request for hundreds of thousands of requests per second. If you’re in Warsaw on February 27th, please join Jon Bratseth (Distinguished Architect, Verizon Media) at the Big Data Technology Warsaw Summit, where he’ll share “Scalable machine-learned model serving” and answer any questions. Big Data Technology Warsaw Summit is a one-day conference with technical content focused on big data analysis, scalability, storage, and search. There will be 27 presentations and more than 500 attendees are expected. Jon’s talk will explore the problem and architectural solution, show how Vespa can be used to achieve scalable serving of TensorFlow and ONNX models, and present benchmarks comparing performance and scalability to TensorFlow Serving. Hope to see you there!

Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving

February 8, 2019
February 8, 2019
February 8, 2019
Share

Meta Pull Request Checks

Screwdriver now supports adding extra status checks on pull requests through Screwdriver build meta. This feature allows users to add custom checks such as coverage results to the Git pull request. Note: This feature is only available for Github plugin at the moment. Screwdriver Users To add a check to a pull request build, Screwdriver users can configure their screwdriver.yaml with steps as shown below: jobs: main: steps: - status: | meta set meta.status.findbugs '{"status":"FAILURE","message":"923 issues found. Previous count: 914 issues.","url":"http://findbugs.com"}' meta set meta.status.coverage '{"status":"SUCCESS","message":"Coverage is above 80%."}' These commands will result in a status check in Git that will look something like: For more details, see our documentation. Compatibility List In order to use the new meta PR comments feature, you will need these minimum versions: - API:v0.5.559Contributors Thanks to the following people for making this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Meta Pull Request Checks

February 8, 2019
February 4, 2019
February 4, 2019
Share

Serving article comments using neural nets and reinforcement learning

Yahoo properties such as Yahoo Finance, Yahoo News, and Yahoo Sports allow users to comment on the articles, similar to many other apps and websites. To support this we needed a system that can add, find, count and serve comments at scale in real time. Not all comments are equally as interesting or relevant though, and some articles can have hundreds of thousands of comments, so a good commenting system must also choose the right comments among these to show to users viewing the article. To accomplish this, the system must observe what users are doing and learn how to pick comments that are interesting. In this blog post, we’ll explain how we’re solving this problem for Yahoo properties by using Vespa - the open source big data serving engine. We’ll start with the basics and then show how comment selection using a neural net and reinforcement learning has been implemented. Real-time comment serving As mentioned, we need a system that can add, find, count, and serve comments at scale in real time. Vespa allows us to do this easily by storing each comment as a separate document, containing the ID of the article commented upon, the ID of the user commenting, various comment metadata, and the comment text itself. Vespa then allows us to issue queries to quickly retrieve the comments on a given article for display, or to show a comment count next to the article: Ranking comments In addition, we can show all the articles of a given user and similar less-used operations. We store about a billion comments at any time, serve about 12.000 queries per second, and about twice as many writes (new comments + comment metadata updates). Average latency for queries is about 4 ms, and write latency roughly 1 ms. Nodes are organized in two tiers as a single Vespa application: A single stateless cluster handling incoming queries and writes, and a content cluster storing the comments, maintaining indexes and executing the distributed part of queries in parallel. In total, we use 32 stateless and 96 stateful nodes spread over 5 regional data centers. Data is automatically sharded by Vespa in each datacenter, in 6-12 shards depending on the traffic patterns of that region. Some articles have a very large number of comments - up to hundreds of thousands are not uncommon, and no user is going to read all of them. Therefore we need to pick the best comments to show each time someone views an article. To do this, we let Vespa find all the comments for the article, compute a score for each, and pick the comments with the best scores to show to the user. This process is called ranking. By configuring the function to compute for each comment as a ranking expression in Vespa, the engine will compute it locally on each data partition in parallel during query execution. This allows us to execute these queries with low latency and ensures that we can handle more comments by adding more content nodes, without causing an increase in latency. The input to the ranking function is features which are typically stored in the comment or sent with the query. Comments have various features indicating how users interacted with the comment, as well as features computed from the comment content itself. In addition, we keep track of the reputation of each comment author as a feature. User actions are sent as update operations to Vespa as they are performed. The information about authors is also continuously changing, but since each author can write many comments it would be wasteful to have to update each article everytime we have new information about the author. Instead, we store the author information in a separate document type - one document per author and use a document reference in Vespa to import that author feature into each comment. This allows us to update author information once and have it automatically take effect for all comments by that author. With these features, we can configure a mathematical function as a ranking expression which computes the rank score or each comment to produce a ranked list of the top comments, like the following: Using a neural net and reinforcement learning We used to rank comments using a handwritten ranking expression with hardcoded weighting of the features. This is a good way to get started but obviously not optimal. To improve it we need to decide on a measurable target and use machine learning to optimize towards it. The ultimate goal is for users to find the comments interesting. This can not be measured directly, but luckily we can define a good proxy for interest based on signals such as dwell time (the amount of time the users spend on the comments of an article) and user actions (whether users reply to comments, provide upvotes and downvotes, etc). We know that we want user interest to go up on average, but we don’t know what the correct value of this measure of interest might be for any given list of comments. Therefore it’s hard to create a training set of interest signals for articles (supervised learning), so we chose to use reinforcement learning instead: Let the system make small changes to the live machine-learned model iteratively, observe the effect on the signal we use as a proxy for user interest, and use this to converge on a model that increases it. The model chosen is a neural net with multiple hidden layers, roughly illustrated as follows: The advantage of using a neural net compared to a simple function such as linear regression is that we can capture non-linear relationships in the feature data without having to guess which relationship exists and hand-write functions to capture them (feature engineering). To explore the space of possible rankings, we implement a sampling algorithm in a Searcher to perturb the ranking of comments returned from each query. We log the ranking information and our user interest signals such as dwell time to our Hadoop grid where they are joined. This generates a training set each hour which we use to retrain the model using TensorFlow-on-Spark, which generates a new model for the next iteration of the reinforcement learning. To implement this on Vespa, we configure the neural net as the ranking function for comments. This was done as a manually written ranking function over tensors in a rank profile:    rank-profile neuralNet {        function get_model_weights(field) {            expression: if(query(field) == 0, constant(field), query(field))        }        function layer_0() {  # returns tensor(hidden[9])            expression: elu(xw_plus_b(nn_input,                                      get_model_weights(W_0),                                      get_model_weights(b_0),                                      x))        }        function layer_1() {  # returns tensor(out[9])            expression: elu(xw_plus_b(layer_0,                                      get_model_weights(W_1),                                      get_model_weights(b_1),                                     hidden))        }        function layer_out() {  # xw_plus_b returns tensor(out[1]), so sum converts to double            expression: sum(xw_plus_b(layer_1,                                      get_model_weights(W_out),                                      get_model_weights(b_out),                                      out))        }        first-phase {            expression: freshnessRank        }        second-phase {            expression: layer_out            rerank-count: 2000        }    } More recently Vespa added support for deploying TensorFlow SavedModels directly, which would also be a good option since the training happens in TensorFlow. Neural nets have a pair of weight and bias tensors for each layer, which is what we want our training process to optimize. The simplest way to include the weights and biases in the model is to add them as constant tensors to the application package. However, to do reinforcement learning we need to be able to update them frequently. We could achieve this by redeploying the application package frequently, as Vespa allows this to be done without restarts or disruption to ongoing queries. However, it is still a somewhat heavy-weight process, so we chose another approach: Store the neural net parameters as tensors in a separate document type, and create a Searcher component which looks up this document on each incoming query, and adds the parameter tensors to it before it’s passed to the content nodes for evaluation. Here is the full code needed to accomplish this: import com.yahoo.document.Document; import com.yahoo.document.DocumentId; import com.yahoo.document.Field; import com.yahoo.document.datatypes.FieldValue; import com.yahoo.document.datatypes.TensorFieldValue; import com.yahoo.documentapi.DocumentAccess; import com.yahoo.documentapi.SyncParameters; import com.yahoo.documentapi.SyncSession; import com.yahoo.search.Query; import com.yahoo.search.Result; import com.yahoo.search.Searcher; import com.yahoo.search.searchchain.Execution; import com.yahoo.tensor.Tensor; import java.util.Map; public class LoadRankingmodelSearcher extends Searcher {   private static final String VESPA_DOCUMENTID_FORMAT = “id:canvass_search:rankingmodel::%s”;   // https://docs.vespa.ai/documentation/ranking.html#using-query-variables:   private static final String QUERY_FEATURE_FORMAT = “query(%s)”;     /** To fetch model documents from Vespa index */   private final SyncSession fetchDocumentSession;   public LoadRankingmodelSearcher() {       this.fetchDocumentSession = DocumentAccess.createDefault().createSyncSession(new SyncParameters.Builder().build());   }   @Override   public Result search(Query query, Execution execution) {       // fetch model document from Vespa       String documentId = String.format(VESPA_DOCUMENTID_FORMAT, query.getRanking().getProfile());       Document modelDoc = fetchDocumentSession.get(new DocumentId(documentId));       // Add it to the query       if (modelDoc != null) {           modelDoc.iterator().forEachRemaining((Map.Entry e) ->                                                        addTensorFromDocumentToQuery(e.getKey().getName(), e.getValue(), query)           );       }       return execution.search(query);   }   private static void addTensorFromDocumentToQuery(String field, FieldValue value, Query query) {       if (value instanceof TensorFieldValue) {           Tensor tensor = ((TensorFieldValue) value).getTensor().get();           query.getRanking().getFeatures().put(String.format(QUERY_FEATURE_FORMAT, field), tensor);       }   } } The model weight document definition is added to the same content cluster as the comment documents and simply contains attribute fields for each weight and bias tensor of the neural net:    document rankingmodel {        field modelTimestamp type long { … }        field W_0 type tensor(x[9],hidden[9]){ … }        field b_0 type tensor(hidden[9]){ … }        field W_1 type tensor(hidden[9],out[9]){ … }        field b_1 type tensor(out[9]){ … }        field W_out type tensor(out[9]){ … }        field b_out type tensor(out[1]){ … }    } Since updating documents is a lightweight operation we can now make frequent changes to the neural net to implement the reinforcement learning. Results Switching to the neural net model with reinforcement learning led to a 20% increase in average dwell time. The average response time when ranking with the neural net increased to about 7 ms since the neural net model is more expensive. The response time stays low because in Vespa the neural net is evaluated on all the content nodes (partitions) in parallel. We avoid the bottleneck of sending the data for each comment to be evaluated over the network and can increase parallelization indefinitely by adding more content nodes. However, evaluating the neural net for all comments for outlier articles which have hundreds of thousands of comments would still be very costly. If you read the rank profile configuration shown above, you’ll have noticed the solution to this: We use two-phase ranking where the comments are first selected by a cheap rank function (which we term freshnessRank) and the highest scoring 2000 documents (per content node) are re-ranked using the neural net. This caps the max CPU spent on evaluating the neural net per query. Conclusion and future work We have shown how to implement a real comment serving and ranking system on Vespa. With reinforcement learning gaining popularity, the serving system needs to become a more integrated part of the machine learning stack, and by using Vespa and TensorFlow-on-Spark, this can be accomplished relatively easily with a standard open source technology. We plan to expand on this work by applying it to other domains such as content recommendation, incorporating more features in a larger network, and exploring personalized comment ranking. Acknowledgments Thanks to Aaron Nagao, Sreekanth Ramakrishnan, Zhi Qu, Xue Wu, Kapil Thadani, Akshay Soni, Parikshit Shah, Troy Chevalier, Sreekanth Ramakrishnan, Jon Bratseth, Lester Solbakken and Håvard Pettersen for their contributions to this work.

Serving article comments using neural nets and reinforcement learning

February 4, 2019
February 1, 2019
February 1, 2019
Share

Vespa 7 is released!

This week we rolled the major version of Vespa over from 6 to 7. The releases we make public already run a large number of high traffic production applications on our Vespa cloud, and the 7 versions are no exception. There are no new features on version 7 since we release all new features incrementally on minors. Instead, the major version change is used to mark the point where we remove legacy features marked as deprecated and change some default settings. We only do this on major version changes, as Vespa uses semantic versioning. Before upgrading, go through the list of changes in the release notes to make sure your application and usage is ready. Upgrading can be done by following the regular live upgrade procedure.

Vespa 7 is released!

February 1, 2019
January 31, 2019
January 31, 2019
Share

Bay Area Hadoop Meetup Recap - Bullet (Open Source Real-Time Data Query Engine) & Vespa (Open Source Big Data Serving Engine)

Nate Speidel, Software Engineer, Verizon Media In December, I joined Michael Natkovich (Director, Software Dev Engineering, Verizon Media) at a Bay Area Hadoop meetup to share about Bullet. Created by Yahoo, Bullet is an open-source multi-tenant query system. It’s lightweight, scalable and pluggable, and allows you to query any data flowing through a streaming system without having to store it. Bullet queries look forward in time and we use it to support intractable Big Data aggregations like Top K, Counting Distincts, and Windowing efficiently without having a storage layer using Sketch-based algorithms. Jon Bratseth, Distinguished Architect at Verizon Media, joined us at the meetup and presented “Big Data Serving with Vespa”. Largely developed by engineers from Yahoo, Vespa is a big data processing and serving engine, available as open source on GitHub. Vespa allows you to search, organize, and evaluate machine-learned models from TensorFlow over large, evolving data sets, with latencies in the tens of milliseconds. Many of our products — such as Yahoo News, Yahoo Sports, Yahoo Finance and Oath Ads Platforms — currently employ Vespa. To learn about future product updates from Bullet or Vespa, follow YDN on Twitter or LinkedIn.

Bay Area Hadoop Meetup Recap - Bullet (Open Source Real-Time Data Query Engine) & Vespa (Open Source Big Data Serving Engine)

January 31, 2019
January 29, 2019
January 29, 2019
Share

Musings from our CI/CD Meetup: Using Screwdriver, Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, and Data Agility for Stateful Workloads in Kubernetes

By Jithin Emmanuel, Sr. Software Dev Manager, Verizon Media On Tuesday, December 4th, I joined speakers from Spotinst, Nirmata, CloudYuga, and MayaData, at the Microservices and Cloud Native Apps Meetup in Sunnyvale. We shared how Screwdriver is used for CI/CD at Verizon Media. Created by Yahoo and open-sourced in 2016, Screwdriver is a build platform designed for continuous delivery at scale. Screwdriver supports an expanding list of source code services, execution engines, and databases since it is not tied to any specific compute platform. Moreover, it has a fully documented API and growing open source community base. The meetup also featured very interesting CI/CD presentations including these: - A Quick Overview of Intro to Kubernetes Course, by Neependra Khare, Founder, CloudYuga Neependra discussed his online course which includes some of Kubernetes’ basic concepts, architecture, the problems it solves, and the model that it uses to handle containerized deployments and scaling. Additionally, CloudYuga provides training in Docker, Kubernetes, Mesos Marathon, Container Security, GO Language, Advanced Linux Administration, and more. - Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, by Amiram Shachar, CEO & Founder, Spotinst Amiram discussed two important concepts of Kubernetes: Headroom and 2 Levels Scaling. Amiram also reviewed the different Kubernetes deployment tools, including Kubernetes Operations (Kops). Ritesh Patel, Founder and VP Products at Nirmata, demoed Spotinst and Nirmata. Nirmata provides a complete solution for Kubernetes deployment and management for cloud-based app containerization. Spotinst is workload automation software that’s focused on helping enterprises save time and costs on their cloud compute infrastructure.  - Data Agility for Stateful Workloads in Kubernetes, by Murat Karslioglu, VP Products, MayaData MayaData is focused on freeing DevOps and Kubernetes from storage constraints with OpenEBS. Murat discussed accelerating CI/CD Pipelines and DevOps, using chaos engineering and containerized storage. Murat also explored some of the open source tools available from MayaData and introduced the MayaData Agility Platform (MDAP). Murat’s presentation ended with a live demo of OpenEBS and Litmus. To learn about future meetups, follow us on Twitter at @YDN or on LinkedIn.

Musings from our CI/CD Meetup: Using Screwdriver, Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, and Data Agility for Stateful Workloads in Kubernetes

January 29, 2019
January 28, 2019
January 28, 2019
Share

Vespa Product Updates, January 2019: Parent/Child, Large File Config Download, and a Simplified Feeding Interface

In last month’s Vespa update, we mentioned ONNX integration, precise transaction log pruning, grouping on maps, and improvements to streaming search performance.  Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to feedback and contributions from the community, Vespa continues to evolve. This month, we’re excited to share the following updates with you: Parent/Child We’ve added support for multiple levels of parent-child document references. Documents with references to parent documents can now import fields, with minimal impact on performance. This simplifies updates to parent data as no denormalization is needed and supports use cases with many-to-many relationships, like Product Search. Read more in parent-child. File URL references in application packages Serving nodes sometimes require data files which are so large that it doesn’t make sense for them to be stored and deployed in the application package. Such files can now be included in application packages by using the URL reference. When the application is redeployed, the files are automatically downloaded and injected into the components who depend on them. Batch feed in java client The new SyncFeedClient provides a simplified API for feeding batches of data with high performance using the Java HTTP client. This is convenient when feeding from systems without full streaming support such as Kafka and DynamoDB. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to see.

Vespa Product Updates, January 2019: Parent/Child, Large File Config Download, and a Simplified Feeding Interface

January 28, 2019
January 25, 2019
January 25, 2019
Share

Pipeline page redesign

Check out Screwdriver’s redesigned UI for the pipeline page! In addition to a smoother interface and easier navigation, here are some utility fixes: Disabled jobs We’ve change disabled job icons to stand out more in the pipeline graph. Also, you can now: - Hover over a disabled job in the pipeline graph to view its details (who disabled it). - Add a reason when you disable a job from the Pipeline Options tab. This information will be displayed on the same page. Disabled job confirmation: Disabled job reason display: Pipeline events The event list has now been conveniently shifted to the right sidebar! The sidebar now has minimal data, including only showing a minified version of the parts of your workflow that ran, to make for quicker information processing. This change gives more space for large workflow graphs and makes for less scrolling on the page. Pull requests can be accessed by switching from the Events tab to the Pull Requests tab on the top right. Old and new pipeline page comparison: Pull requests sidebar: Compatibility List In order to see the new pipeline redesign, you will need these minimum versions: - API:v0.5.551 - UI:v1.0.365Contributors Thanks to the following people for making this feature possible: - DekusDenial - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Pipeline page redesign

January 25, 2019
January 22, 2019
January 22, 2019
Share

Moloch 1.7.0 - Notifications, Field History, and More

Andy Wick, Chief Architect, Verizon Media & Elyse Rinne, Software Engineer, Verizon Media Since wrapping up the 2nd annual MolochON, we’ve been working on Moloch 1.7.0 - available here. Moloch is a large scale, open source, full packet capturing, indexing, and database system. We’ve been improving it with the help of our open source community. This release includes two bug fixes in capture and several new features. Here’s a list of all the changes. Fixed corrupt file sequence numbers When Elasticsearch was responding slowly or capture was busy, it was possible for corrupt sequence numbers to be created. This would lead to packet capture (pcap) that couldn’t be viewed and random items appearing in the sequence table. This is now fixed. Removed 256 offline files limit When running against offline files, capture would stop properly recording sequence numbers for files after the 256 file per capture run. This lead to pcap that couldn’t be viewed for those files forcing the user to restart the capture session for the next 256 files. With the new fix in place, you can now capture and store to more than 256 files. Field Intersections We’ve added a new API endpoint and Actions menu item that allows you to export unique values and counts across multiple fields. It’s now easy to find all the http hosts that a destination IP is serving. Calling this feature from the actions menu on the UI results in exporting the fields currently displayed (excluding the time and info columns). You can use previously saved column configs to switch between the data you want exported. See the demo video for more ideas. If you are in the business of packet capture as part of your job in network security, join the Moloch community, use and help contribute to the project, and chat with us on Slack. To get started, check out our README and FAQ pages on GitHub. P.S. We’re hiring security professionals, whom we lovingly call paranoids!

Moloch 1.7.0 - Notifications, Field History, and More

January 22, 2019
January 21, 2019
January 21, 2019
Share

Efficient personal search at large scale

Vespa includes a relatively unknown mode which provides personal search at massive scale for a fraction of the cost of alternatives: streaming search. In this article we explain streaming search and how to use it. Imagine you are tasked with building the next Gmail, a massive personal data store centered around search. How do you do it? An obvious answer is to just use a regular search engine, write all documents to a big index and simply restrict queries to match documents belonging to a single user. This works, but the problem is cost. Successful personal data stores has a tendency to become massive — the amount of personal data produced in the world outweighs public data by many orders of magnitude. Storing indexes in addition to raw data means paying for extra disk space for all this data and paying for the overhead of updating this massive index each time a user changes or adds data. Index updates are costly, especially when they need to be handled in real time, which users often expect for their own data. Systems like Gmail handle billions of writes per day so this quickly becomes the dominating cost of the entire system. However, when you think about it there’s really no need to go through the trouble of maintaining global indexes when each user only searches her own data. What if we just maintain a separate small index per user? This makes both index updates and queries cheaper, but leads to a new problem: Writes will arrive randomly over all users, which means we’ll need to read and write a user’s index on every update without help from caching. A billion writes per day translates to about 25k read-and write operations per second peak. Handling traffic at that scale either means using a few thousand spinning disks, or storing all data on SSD’s. Both options are expensive. Large scale data stores already solve this problem for appending writes, by using some variant of multilevel log storage. Could we leverage this to layer the index on top of a data store like that? That helps, but means we need to do our own development to put these systems together in a way that performs at scale every time for both queries and writes. And we still pay the cost of storing the indexes in addition to the raw user data. Do we need indexes at all though? With some reflection, it turns out that we don’t. Indexes consists of pointers from words/tokens to the documents containing them. This allows us to find those documents faster than would be possible if we had to read the content of the documents to find the right ones, of course at the considerable cost of maintaining those indexes. In personal search however, any query only accesses a small subset of the data, and the subsets are know in advance. If we take care to store the data of each subset together we can achieve search with low latency by simply reading the data at query time — what we call streaming search. In most cases, most subsets of data (i.e most users) are so small that this can be done serially on a single node. Subsets of data that are too large to stream quickly on a single node can be split over multiple nodes streaming in parallel.Numbers How many documents can be searched per node per second with this solution? Assuming a node with 500 Mb/sec read speed (either from an SSD or multiple spinning disks), and 1k average compressed document size, the disk can search max 500Mb/sec / 1k/doc = 500,000 docs/sec. If each user store 1000 documents each on average this gives a max throughput per node of 500 queries/second. This is not an exact computation since we disregard time used to seek and write, and inefficiency from reading non-compacted data on one hand, and assume an overly pessimistic zero effect from caching on the other, but it is a good indication that our solution is cost effective. What about latency? From the calculation above we see that the latency from finding the matching documents will be 2 ms on average. However, we usually care more about the 99% latency (or similar). This will be driven by large users which needs to be split among multiple nodes streaming in parallel. The max data size per node is then a tradeoff between latency for such users and the overall cost of executing their queries (less nodes per query is cheaper). For example, we can choose to store max 50.000 documents per user per node such that we get a max latency of 100 ms per query. Lastly, the total number of nodes decides the max parallelism and hence latency for the very largest users. For example, with 20 nodes in total a cluster we can support 20 * 50k = 1 million documents for a single user with 100 ms latency.Streaming search All right — with this we have our cost-effective solution to implement the next Gmail: Store just the raw data of users, in a log-level store. Locate the data of each user on a single node in the system for locality (or, really 2–3 nodes for redundancy), but split over multiple nodes for users that grow large. Implement a fully functional search and relevance engine on top of the raw data store, which distributes queries to the right set of nodes for each user and merges the results. This will be cheap and efficient, but it sounds like a lot of work! It sure would be nice if somebody already did all of it, ran it at large scale for years and then released it as open source. Well, as luck would have it we already did this in Vespa. In addition to the standard indexing mode, Vespa includes a streaming mode for documents which provides this solution, implemented by layering the full search engine functionality over the raw data store built into Vespa. When this solution is compared to indexed search in Vespa or more complicated sharding solutions in Elastic Search for personal search applications, we typically see about an order of magnitude reduction in cost of achieving a system which can sustain the query and update rates needed by the application with stable latencies over long time periods. It has been used to implement various applications such as storing and searching massive amounts of mails, personal typeahead suggestions, personal image collections, and private forum group content.Using streaming search on Vespa The steps to using streaming search on Vespa are: - Set streaming mode for the document type(s) in question in services.xml. - Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123 - Pass the group id in queries by setting the query property streaming.groupname in queries. That’s it! With those steps you have created a scalable, battle-proven personal search solution which is an order of magnitude cheaper than any alternative out there, with full support for structured and text search, advanced relevance including natural language and machine-learned models, and powerful grouping and aggregation for features like faceting. For more details see the documentation on streaming search. Have fun with it, and as usual let us know what you are building!

Efficient personal search at large scale

January 21, 2019
January 17, 2019
January 17, 2019
Share

Dash Open Podcast: Episode 02 - Building Community and Mentorship around Hackdays

By Ashley Wolf, Open Source Program Manager, Verizon Media The second installment of Dash Open is ready for you to tune in! In this episode, Gil Yehuda, Sr. Director of Open Source at Verizon Media, interviews Dav Glass, Distinguished Architect of IaaS and Node.js at Verizon Media. Dav discusses how open source inspired him to start HackSI, a Hack Day for all ages, as well as robotics mentorship programs for the Southern Illinois engineering community. Listen now on iTunes or SoundCloud. Dash Open is your place for interesting conversations about open source and other technologies, from the open source program office at Verizon Media. Verizon Media is the home of many leading brands including Yahoo, Aol, Tumblr, TechCrunch, and many more. Follow us on Twitter @YDN and on LinkedIn.

Dash Open Podcast: Episode 02 - Building Community and Mentorship around Hackdays

January 17, 2019
January 10, 2019
January 10, 2019
Share

Meta PR Comments

Screwdriver now supports commenting on pull requests through Screwdriver build meta. This feature allows users to add custom data such as coverage results to the Git pull request. Screwdriver Users To add a comment to a pull request build, Screwdriver users can configure their screwdriver.yaml with steps as shown below: jobs: main: steps: - postdeploy: | meta set meta.summary.coverage "Coverage increased by 15%" meta set meta.summary.markdown "this markdown comment is **bold** and *italic*" These commands will result in a comment in Git that will look something like: Cluster Admins In order to enable meta PR comments, you’ll need to create a bot user in Git with a personal access token with the public_repo scope. In Github, create a new user. Follow instructions to create a personal access token, set the scope as public_repo. Copy this token and set it as commentUserToken in your scms settings in your API config yaml. You need this headless user for commenting since Github requires public_repo scope in order to comment on pull requests (https://github.community/t5/How-to-use-Git-and-GitHub/Why-does-GitHub-API-require-admin-rights-to-leave-a-comment-on-a/td-p/357). For more information about Github scope, see https://developer.github.com/apps/building-oauth-apps/understanding-scopes-for-oauth-apps. Compatibility List In order to use the new meta PR comments feature, you will need these minimum versions: - API:v0.5.545Contributors Thanks to the following people for making this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Meta PR Comments

January 10, 2019
January 3, 2019
January 3, 2019
Share

Multiple Build Cluster

Multiple Build Cluster Screwdriver now supports running builds across multiple build clusters. This feature allows Screwdriver to provide a native hot/hot HA solution with multiple clusters on standby. This also opens up the possibility for teams to run their builds in their own infrastructure. Screwdriver Users To specify a build cluster, Screwdriver users can configure their screwdriver.yamls using annotations as shown below: jobs: main: annotations: screwdriver.cd/buildClusters: us-west-1 image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Users can view a list of available build clusters at /v4/buildclusters. Without the annotation, Screwdriver assigns builds to a default cluster that is managed by the Screwdriver team. Users can assign their build to run in any cluster they have access to (the default cluster or any external cluster that your repo is allowed to use, which is indicated by the field scmOrganizations). Contact your cluster admin if you want to onboard your own build cluster. Cluster Admins Screwdriver cluster admins can refer to the following issues and design doc to set up multiple build clusters properly. - Design: https://github.com/screwdriver-cd/screwdriver/blob/master/design/build-clusters.md - Feature issue: https://github.com/screwdriver-cd/screwdriver/issues/1319Compatibility List In order to use the new build clusters feature, you will need these minimum versions: - API:v0.5.537 - Scheduler:v2.4.2 - Buildcluster-queue-worker:v1.1.3Contributors Thanks to the following people for making this feature possible: - minz1027 - parthasl - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Multiple Build Cluster

January 3, 2019
December 27, 2018
December 27, 2018
amberwilsonla
Share

Announcing OpenTSDB 2.4.0: Rollup and Pre-Aggregation Storage, Histograms, Sketches, and More

yahoodevelopers: By Chris Larsen, Architect OpenTSDB is one of the first dedicated open source time series databases built on top of Apache HBase and the Hadoop Distributed File System. Today, we are proud to share that version 2.4.0 is now available and has many new features developed in-house and with contributions from the open source community. This release would not have been possible without support from our monitoring team, the Hadoop and HBase developers, as well as contributors from other companies like Salesforce, Alibaba, JD.com, Arista and more. Thank you to everyone who contributed to this release! A few of the exciting new features include: Rollup and Pre-Aggregation Storage As time series data grows, storing the original measurements becomes expensive. Particularly in the case of monitoring workflows, users rarely care about last years’ high fidelity data. It’s more efficient to store lower resolution “rollups” for longer periods, discarding the original high-resolution data. OpenTSDB now supports storing and querying such data so that the raw data can expire from HBase or Bigtable, and the rollups can stick around longer. Querying for long time ranges will read from the lower resolution data, fetching fewer data points and speeding up queries. Likewise, when a user wants to query tens of thousands of time series grouped by, for example, data centers, the TSD will have to fetch and process a significant amount of data, making queries painfully slow. To improve query speed, pre-aggregated data can be stored and queried to fetch much less data at query time, while still retaining the raw data. We have an Apache Storm pipeline that computes these rollups and pre-aggregates, and we intend to open source that code in 2019. For more details, please visit http://opentsdb.net/docs/build/html/user_guide/rollups.html. Histograms and Sketches When monitoring or performing data analysis, users often like to explore percentiles of their measurements, such as the 99.9th percentile of website request latency to detect issues and determine what consumers are experiencing. Popular metrics collection libraries will happily report percentiles for the data they collect. Yet while querying for the original percentile data for a single time series is useful, trying to query and combine the data from multiple series is mathematically incorrect, leading to errant observations and problems. For example, if you want the 99.9th percentile of latency in a particular region, you can’t just sum or recompute the 99.9th of the 99.9th percentile. To solve this issue, we needed a complex data structure that can be combined to calculate an accurate percentile. One such structure that has existed for a long time is the bucketed histogram, where measurements are sliced into value ranges and each range maintains a count of measurements that fall into that bucket. These buckets can be sized based on the required accuracy and the counts from multiple sources (sharing the same bucket ranges) combined to compute an accurate percentile. Bucketed histograms can be expensive to store for highly accurate data, as many buckets and counts are required. Additionally, many measurements don’t have to be perfectly accurate but they should be precise. Thus another class of algorithms could be used to approximate the data via sampling and provide highly precise data with a fixed interval. Data scientists at Yahoo (now part of Oath) implemented a great Java library called Data Sketches that implements the Stochastic Streaming Algorithms to reduce the amount of data stored for high-throughput services. Sketches have been a huge help for the OLAP storage system Druid (also sponsored by Oath) and Bullet, Oath’s open source real-time data query engine. The latest TSDB version supports bucketed histograms, Data Sketches, and T-Digests. Some additional features include: - HBase Date Tiered Compaction support to improve storage efficiency. - A new authentication plugin interface to support enterprise use cases. - An interface to support fetching data directly from Bigtable or HBase rows using a search index such as ElasticSearch. This improves queries for small subsets of high cardinality data and we’re working on open sourcing our code for the ES schema. - Greater UID cache controls and an optional LRU implementation to reduce the amount of JVM heap allocated to UID to string mappings. - Configurable query size and time limits to avoid OOMing a JVM with large queries. Try the releases on GitHub and let us know of any issues you run into by posting on GitHub issues or the OpenTSDB Forum. Your feedback is appreciated! OpenTSDB 3.0 Additionally, we’ve started on 3.0, which is a rewrite that will support a slew of new features including: - Querying and analyzing data from the plethora of new time series stores. - A fully configurable query graph that allows for complex queries OpenTSDB 1x and 2x couldn’t support. - Streaming results to improve the user experience and avoid overwhelming a single query node. - Advanced analytics including support for time series forecasting with Yahoo’s EGADs library. Please join us in testing out the current 3.0 code, reporting bugs, and adding features.

Announcing OpenTSDB 2.4.0: Rollup and Pre-Aggregation Storage, Histograms, Sketches, and More

December 27, 2018
December 21, 2018
December 21, 2018
amberwilsonla
Share

Musings from the 2nd Annual MolochON

yahoodevelopers: By Andy Wick, Chief Architect, Oath & Elyse Rinne, Software Engineer, Oath Last month, our Moloch team hosted the second all day Moloch conference at our Dulles, Virginia campus. Moloch, the large-scale, full packet capturing, indexing, and database system was developed by Andy Wick at AOL (now part of Oath) in 2011 and open-sourced in 2012. Elyse Rinne joined the Moloch team in 2016 to enhance the tool’s front-end features. The project enjoys an active community of users and contributors. Most recently, on November 1, more than 80 Moloch users and developers joined the Moloch core team to discuss the latest features, administrative capabilities, and clever uses of Moloch. Speakers from Elastic, SANS, Cox, SecureOps, and Oath presented their experiences setting up and using Moloch in a variety of security-focused scenarios. Afterwards, the participants brainstormed new project features and enhancements. We ended with happy hour giving a chance to relax and network. Although most of the talks were not recorded due to the sensitive topics related to blue team security tactics in some of the presentations, we do have these presentation recordings and slides that are cleared for the public: - Recent Changes to Moloch - Video & Slides. - Moloch Deployments at Oath - Video & Slides. - Using Wise -  Video & Slides. - All Presentations (including external and 2017 MolochON presentations) If you are a blue team security professional, consider joining the Moloch community, use and help contribute to the project, and chat with us on Slack. To get started, check out our README and FAQ pages on GitHub. P.S. We’re hiring security professionals, whom we lovingly call paranoids!

Musings from the 2nd Annual MolochON

December 21, 2018
December 14, 2018
December 14, 2018
Share

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

Hi Vespa Community! Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues. We welcome your contributions and feedback about any new features or improvements you’d like to see. For December, we’re excited to share the following product news: Streaming Search Performance Improvement Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here. ONNX Integration ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models. Precise Transaction Log Pruning Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart. Grouping on Maps Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes. Questions or suggestions? Send us a tweet or an email.

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

December 14, 2018
December 13, 2018
December 13, 2018
amberwilsonla
Share

Vespa Product Updates, December 2018 - ONNX Import and Map Attribute Grouping

yahoodevelopers: Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues. We welcome your contributions and feedback about any new features or improvements you’d like to see. For December, we’re excited to share the following product news: Streaming Search Performance Improvement Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here. ONNX Integration ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models. Precise Transaction Log Pruning Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart. Grouping on Maps Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes. Questions or suggestions? Send us a tweet or an email.

Vespa Product Updates, December 2018 - ONNX Import and Map Attribute Grouping

December 13, 2018
December 13, 2018
December 13, 2018
Share

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues. We welcome your contributions and feedback about any new features or improvements you’d like to see. For December, we’re excited to share the following product news: Streaming Search Performance Improvement Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here. ONNX Integration ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models. Precise Transaction Log Pruning Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart. Grouping on Maps Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes. Questions or suggestions? Send us a tweet or an email.

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

December 13, 2018
December 6, 2018
December 6, 2018
amberwilsonla
Share

A New Chapter for Omid

yahoodevelopers: By Ohad Shacham, Yonatan Gottesman, Edward Bortnikov Scalable Systems Research, Verizon/Oath Omid, an open source transaction processing platform for Big Data, was born as a research project at Yahoo (now part of Verizon), and became an Apache Incubator project in 2015. Omid complements Apache HBase, a distributed key-value store in Apache Hadoop suite, with a capability to clip multiple operations into logically indivisible (atomic) units named transactions. This programming model has been extremely popular since the dawn of SQL databases, and has more recently become indispensable in the NoSQL world. For example, it is the centerpiece for dynamic content indexing of search and media products at Verizon, powering a web-scale content management platform since 2015. Today, we are excited to share a new chapter in Omid’s history. Thanks to its scalability, reliability, and speed, Omid has been selected as transaction management provider for Apache Phoenix, a real-time converged OLTP and analytics platform for Hadoop. Phoenix provides a standard SQL interface to HBase key-value storage, which is much simpler and in many cases more performant than the native HBase API. With Phoenix, big data and machine learning developers get the best of all worlds: increased productivity coupled with high scalability. Phoenix is designed to scale to 10,000 query processing nodes in one instance and is expected to process hundreds of thousands or even millions of transactions per second (tps). It is widely used in the industry, including by Alibaba, Bloomberg, PubMatic, Salesforce, Sogou and many others. We have just released a new and significantly improved version of Omid (1.0.0), the first major release since its original launch. We have extended the system with multiple functional and performance features to power a modern SQL database technology, ready for deployment on both private and public cloud platforms. A few of the significant innovations include: Protocol re-design for low latency The early version of Omid was designed for use in web-scale data pipeline systems, which are throughput-oriented by nature. We re-engineered Omid’s internals to now support new ultra-low-latency OLTP (online transaction processing) applications, like messaging and algo-trading. The new protocol, Omid Low Latency (Omid LL), dissipates Omid’s major architectural bottleneck. It reduces the latency of short transactions by 5 times under light load, and by 10 to 100 times under heavy load. It also scales the overall system throughput to 550,000 tps while remaining within real-time latency SLAs. The figure below illustrates Omid LL scaling versus the previous version of Omid, for short and long transactions. Throughput vs latency, transaction size=1 op Throughput vs latency, transaction size=10 ops Figure 1. Omid LL scaling versus legacy Omid. The throughput scales beyond 550,000 tps while the latency remains flat (low milliseconds). ANSI SQL support Phoenix provides secondary indexes for SQL tables — a centerpiece tool for efficient access to data by multiple keys. The CREATE INDEX command is on-demand; it is not allowed to block already deployed applications. We added Omid support for accomplishing this without impeding concurrent database operations or sacrificing consistency. We further introduced a mechanism to avoid recursive read-your-own-writes scenarios in complex queries, like “INSERT INTO T … SELECT FROM T …” statements. This was achieved by extending Omid’s traditional Snapshot Isolation consistency model, which provides single-read-point-single-write-point semantics, with multiple read and write points. Performance improvements Phoenix extensively employs stored procedures implemented as HBase filters in order to eliminate the overhead of multiple round-trips to the data store. We integrated Omid’s code within such HBase-resident procedures, allowing for a smooth integration with Phoenix and also reduced the overhead of transactional reads (for example, filtering out redundant data versions). We collaborated closely with the Phoenix developer community while working on this project, and contributed code to Phoenix that made Omid’s integration possible. We look forward to seeing Omid’s adoption through a wide range of Phoenix applications. We always welcome new developers to join the community and help push Omid forward!

A New Chapter for Omid

December 6, 2018
November 27, 2018
November 27, 2018
Share

Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th

Hi Vespa Community, If you are in Seattle on November 29th, please join Jon Bratseth (Distinguished Architect, Oath) at a machine learning meetup hosted by Zillow. Jon will share a Vespa overview and answer any questions about Oath’s open source big data serving engine. Eric Ringger (Director of Machine Learning for Personalization, Zillow) will discuss some of the models used to help users find homes, including collaborative filtering, a content-based model, and deep learning. Learn more and RSVP here. Hope you can join! The Vespa Team

Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th

November 27, 2018
November 26, 2018
November 26, 2018
Share

Oath’s VP of AI invites you to learn how to build a Terabyte Scale Machine Learning Application at TDA Conference

By Ganesh Harinath, VP Engineering, AI Platform & Applications, Oath If you’re attending the upcoming Telco Data Analytics and AI Conference in San Francisco, make sure to join my keynote talk. I’ll be presenting “Building a Terabyte Scale Machine Learning Application” on November 28th at 10:10 am PST. You’ll learn about how Oath builds AI platforms at scale. My presentation will focus on our approach and experience at Oath in architecting and using frameworks to build machine learning models at terabyte scale, near real-time. I’ll also highlight Trapezium, an open source framework based on Spark, developed by Oath’s Big Data and Artificial Intelligence (BDAI) team. I hope to catch you at the conference. If you would like to connect, reach out to me. If you’re unable to attend the conference and are curious about the topics shared in my presentation, follow @YDN on Twitter and we’ll share highlights during and after the event.

Oath’s VP of AI invites you to learn how to build a Terabyte Scale Machine Learning Application at TDA Conference

November 26, 2018
November 19, 2018
November 19, 2018
Share

Introducing the Dash Open Podcast, sponsored by Yahoo Developer...

Introducing the Dash Open Podcast, sponsored by Yahoo Developer Network By Ashley Wolf, Principal Technical Program Manager, Oath Is open source the wave of the future, or has it seen its best days already? Which Big Data and AI trends should you be aware of and why? What is 5G and how will it impact the apps you enjoy using? You’ve got questions and we know smart people; together we’ll get answers. Introducing the Dash Open podcast, sponsored by the Yahoo Developer Network and produced by the Open Source team at Oath. Dash Open will share interesting conversations about tech and the people who spend their day working in tech. We’ll look at the state of technology through the lens of open source; keeping you up-to-date on the trends we’re seeing across the internet. Why Dash Open? Because it’s like a command line argument reminding the command to be open. What can you expect from Dash Open? Interviews with interesting people, occasional witty banter, and a catchy theme song. In the first episode, Rosalie Bartlett, Open Source community manager at Oath, interviews Gil Yehuda, Senior Director of Open Source at Oath. Tune in to hear one skeptic’s journey from resisting the open source movement to heading one of the more prolific Open Source Program Offices (OSPO). Gil highlights the benefits of open source to companies and provides actionable advice on how technology companies can start or improve their OSPO. Give Dash Open a listen and tell us what topics you’d like to hear next. – Ashley Wolf manages the Open Source Program at Oath/Verizon Media Group.

Introducing the Dash Open Podcast, sponsored by Yahoo Developer...

November 19, 2018
November 12, 2018
November 12, 2018
Share

Git Shallow Clone

Previously, Screwdriver would clone the entire commit tree of a Git repository. In most cases, this was unnecessary since most builds only require the latest single commit. For repositories containing immense commit trees, this behavior led to unnecessarily long build times. To address this issue, Screwdriver now defaults to shallow cloning Git repositories with a depth of 50. Screwdriver will also enable the --no-single-branch flag by default in order enable access to other branches in the repository. To disable shallow cloning, simply set the GIT_SHALLOW_CLONE environment variable to false. Example jobs: main: environment: GIT_SHALLOW_CLONE: false image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Here is a comparison of the build speed improvement for a repository containing over ~160k commits. Before: After: For more information, please consult the Screwdriver V4 FAQ. Compatibility List In order to use the new build cache feature, you will need these minimum versions: - screwdrivercd/screwdriver:v0.5.501Contributors Thanks to the following people for making this feature possible: - Filbird Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support

Git Shallow Clone

November 12, 2018
November 8, 2018
November 8, 2018
amberwilsonla
Share

Hadoop Contributors Meetup at Oath

yahoodevelopers: By Scott Bush, Director, Hadoop Software Engineering, Oath On Tuesday, September 25, we hosted a special day-long Hadoop Contributors Meetup at our Sunnyvale, California campus. Much of the early Hadoop development work started at Yahoo, now part of Oath, and has continued over the past decade. Our campus was the perfect setting for this meetup, as we continue to make Hadoop a priority. More than 80 Hadoop users, contributors, committers, and PMC members gathered to hear talks on key issues facing the Hadoop user community. Speakers from Ampool, Cloudera, Hortonworks, Microsoft, Oath, and Twitter detailed some of the challenges and solutions pertinent to their parts of the Hadoop ecosystem. The talks were followed by a number of parallel, birds of a feather breakout sessions to discuss HDFS, Tez, containers and low latency processing. The day ended with a reception and consensus that the event went well and should be repeated in the near future. Presentation recordings (YouTube playlist) and slides (links included in the video description) are available here: - Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda Tan, Hortonworks - Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara, Botong Huang - “HDFS Scalability and Security”, Daryn Sharp, Senior Engineer, Oath - The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool - Moving the Oath Grid to Docker, Eric Badger, Software Developer Engineer, Oath - Vespa: Open Source Big Data Serving Engine, Jon Bratseth, Distinguished Architect, Oath - Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shane Kumpf, Hortonworks - How Twitter Hadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu Thank you to all the presenters and the attendees both in person and remote! P.S. We’re hiring! Learn more about career opportunities at Oath.

Hadoop Contributors Meetup at Oath

November 8, 2018
November 7, 2018
November 7, 2018
Share

Build Cache

Screwdriver now has the ability to cache and restore files and directories from your builds for use in other builds! This feature gives you the option to cache artifacts in builds using Gradle, NPM, Maven etc. so subsequent builds can save time on commonly-run steps such as dependency installation and package build. You can now specify a top-level setting in your screwdriver.yaml called cache that contains file paths from your build that you would like to cache. You can limit access to the cache at a pipeline, event, or job-level scope. Scope guide - pipeline-level: all builds in the same pipeline (across different jobs and events) - event-level: all builds in the same event (across different jobs) - job-level: all builds for the same job (across different events in the same pipeline) Example cache: event: - $SD_SOURCE_DIR/node_modules pipeline: - ~/.gradle job: test-job: [/tmp/test] In the above example, we cache the .gradle folder so that subsequent builds in the pipeline can save time on gradle install. Without cache: With cache: Compatibility List In order to use the new build cache feature, you will need these minimum versions: - screwdrivercd/queue-worker:v2.2.2 - screwdrivercd/screwdriver:v0.5.492 - screwdrivercd/launcher:v5.0.37 - screwdrivercd/store:v3.3.11 Note: Please ensure the store service has sufficient available memory to handle the payload. For cache cleanup, we use AWS S3 Lifecycle Management. If your store service is not configured to use S3, you might need to add a cleanup mechanism. Contributors Thanks to the following people for making this feature possible: - d2lam - pranavrc Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support

Build Cache

November 7, 2018
November 5, 2018
November 5, 2018
Share

Announcing Bandar-Log: easily monitor throughput of data sources and processing components for ETL pipelines

By Alexey Lipodat, Project Engagement, Oath One of the biggest problems with a typical Extract, Transform, Load (ETL) workflow is that stringing together multiple systems adds processing time that is often unmeasured. At Oath, we turned to open source for a solution. When researching possible solutions for enabling process metrics within our ETL pipeline, we searched for tools focused on Apache Kafka, Athena, Hive, Presto, and Vertica. For Kafka, we discovered Burrow, a monitoring service that provides consumer lag metric, but no existing options were available for Hive and Vertica. Creating or integrating a different monitoring application for each component of the ETL pipeline would significantly increase the complexity to run and maintain the system. In order to avoid unnecessary complexity and effort, we built Bandar-Log, a simple, standalone monitoring application based on typical ETL workflows. We published it as an open source project so that it can be a resource for other developers as well. Meet Bandar-Log Bandar-Log is a monitoring service that tracks three fundamental metrics in real-time: lag, incoming, and outgoing rates. It runs as a standalone application that tracks performance outside the monitored application and sends metrics to an associated monitoring system (e.g. Datadog). A typical ETL assumes there will be some processing logic between data sources. This assumption adds some delay, or “resistance,” which Bandar-Log can measure. For example: - How many events is the Spark app processing per minute as compared to how many events are coming to Kafka topics? - What is the size of unprocessed events in Kafka topics at this exact moment? - How much time has passed since the last aggregation processed? Bandar-Log makes it easy to answer these questions and is a great tool: - Simple to use — create your own Bandar-Log in 10 minutes by following the Start Bandar-Log in 3 steps doc, and easily extend or add custom data sources - Stable — tested extensively on real-time big data pipelines - Fully supported — new features are frequently added - Flexible — no modifications needed to existing apps for this external, standalone application to monitor metrics. How Bandar-Log works To better understand how Bandar-Log works, let’s review a single part of the ETL process and integrate it with Bandar-Log. As an example, we’ll use the replication stage, the process of copying data from one data source to another. Let’s assume we already have aggregated data in Presto/Hive storage and now we’ll replicate the data to Vertica and query business reports. We’ll use a replicator app to duplicate data from one storage to another. This process copies the dedicated batch of data from Presto/Hive to Vertica. The replication process is now configured, but how should we track the progress? We want to make sure that the replication works as expected and that there isn’t any delay. In order to confirm that the replication is working correctly, we will need to know: - How much unreplicated data do we have in Presto/Hive? - What’s the incoming rate of Presto/Hive data? - How much data has already been replicated to Vertica? - What’s the lag between unreplicated and already replicated data? Inserting tracking logic directly in the replicator app or adding tracking logic in all the ETL pipeline components where we need the process status pose concerns, as these approaches are not scalable and if the replicator app fails, no information is received. A standalone application, like Bandar-Log, solves these issues by tracking status from outside the monitored application. Let’s integrate Bandar-Log with the replication pipeline. As mentioned above, Bandar-Log operates with three fundamental metrics: - IN - incoming rate - OUT - outgoing rate - LAG - lag rate (the difference between IN and OUT) Every component independently interprets lag and incoming/outgoing rates. For instance, a Spark-driven app depends on Kafka, therefore, the lag and rates signify the number of unread messages, both producing and consuming rates. For our replicator app, incoming and outgoing rates mean how many batches of data arrived and were processed per some interval. ETL components use a dedicated column to mark and isolate a specific piece of processed data, which is called batch_id. The semantics of batch_id is a Unix time timestamp measured in milliseconds, which determines the time when the piece of data was processed. Keeping this in mind, we’ll now use the Presto/Hive source to measure the input rate and our Vertica source to measure the output rate.   Due to this approach, an input rate will be the last processed timestamp from Presto/Hive, and the output rate will be the last replicated timestamp from Vertica. Bandar-Log will fetch these metrics according to the rate interval, which can be configured to retrieve metrics based on preference. Fetched metrics will be pushed to Datadog, the default monitoring system. Now that we have the input and output rate metrics. What about lag? How can we track the delay of our replication process? This is quite simple: the LAG metric depends on IN and OUT metrics and is calculated as the difference between them. If we have both of these metrics, then we can easily calculate the LAG metric, which will show us the delay in the replication process between Presto/Hive and Vertica. However, if we are using the real-time system then we should provide data faster and track the status of this process. This metric can be called “Real-time Lag” and determines the delay between the current timestamp and the timestamp associated with the last processed or replicated batch of data. To calculate the REALTIME_LAG metric, the IN metric from Presto/Hive isn’t needed; instead, we’ll use the current timestamp and OUT metric from the final source, which is Vertica. After this step we’ll receive four metrics (IN, OUT, LAG, REALTIME_LAG) in the monitoring system. We can now create appropriate dashboards and alert monitors. This is just one of many examples showing how you can use Bandar-Log. Explore more examples in the Bandar-Log Readme doc. Download the code and if you’d like to contribute any code fixes or additional functionality, visit Bandar-Log on Github and become part of our open source community. P.S. We’re hiring! Explore opportunities here.

Announcing Bandar-Log: easily monitor throughput of data sources and processing components for ETL pipelines

November 5, 2018
October 29, 2018
October 29, 2018
amberwilsonla
Share

Announcing the 2nd Annual Moloch Conference: Learn how to augment your current security infrastructure

yahoodevelopers: We’re excited to share that the 2nd Annual MolochON will be Thursday, Nov. 1, 2018 in Dulles, Virginia, at the Oath campus. Moloch is a large-scale, open source, full packet capturing, indexing and database system. There’s no cost to attend the event and we’d love to see you there! Feel free to register here. We’ll be joined by many fantastic speakers from the Moloch community to present on the following topics: Moloch: Recent Changes & Upcoming Features by Andy Wick, Sr Princ Architect, Oath & Elyse Rinne, Software Dev Engineer, Oath Since the last MolochON, many new features have been added to Moloch. We will review some of these features and demo how to use them. We will also discuss a few desired upcoming features. Speaker Bios Andy is the creator of Moloch and former Architect of AIM. He joined the security team in 2011 and hasn’t looked back. Elyse is the UI and full stack engineer for Moloch. She revamped the UI to be more user-friendly and maintainable. Now that the revamp has been completed, Elyse is working on implementing awesome new Moloch features! Small Scale at Large Scale: Putting Moloch on the Analyst’s Desk by Phil Hagen, SANS Senior Instructor, DFIR Strategist, Red Canary I’ve been excited to add Moloch to the FOR572 class, Advanced Network Forensics at the SANS Institute. In FOR572, we cover Moloch with nearly 1,000 students per year, via classroom discussions and hands-on labs. This presents an interesting engineering problem, in that we provide a self-contained VMware image for the classroom lab, but it is also suitable for use in forensic casework. In this talk, I’ll cover some of what we did to make a single VM into a stable and predictable environment, distributed to hundreds of students across the world. Speaker Bio Phil is a Senior Instructor with the SANS Institute and the DFIR Strategist at Red Canary. He is the course lead for SANS FOR572, Advanced Network Forensics, and has been in the information security industry for over 20 years. Phil is also the lead for the SOF-ELK project, which provides a free, open source, ready-to-use Elastic Stack appliance to aid and optimize security operations and forensic processing. Networking is in his blood, dating back to a 2400 baud modem in an Apple //e, which he still has. Oath Deployments by Andy Wick, Sr Princ Architect, Oath The formation of Oath gave us an opportunity to rethink and create a new visibility stack. In this talk, we will be sharing our process for designing our stack for both office and data center deployments and discussing the technologies we decided to use. Speaker Bio Andy is the creator of Moloch and former Architect of AIM. He joined the security team in 2011 and hasn’t looked back. Centralized Management and Deployment with Docker and Ansible by Taylor Ashworth, Cybersecurity Analyst I will focus on how to use Docker and Ansible to deploy, update, and manage Moloch along with other tools like Suricata, WISE, and ES. I will explain the time-saving benefits of Ansible and the workload reduction benefits of Docker,and I will also cover the topic “Pros and cons of using Ansible tower/AWX over Ansible in CLI.” If time permits, I’ll discuss “Using WISE for data enrichment.” Speaker Bio Taylor is a cybersecurity analyst who was tired of the terrible tools he was presented with and decided to teach himself how to set up tools to successfully do his job. Automated Threat Intel Investigation Pipeline by Matt Carothers, Principal Security Architect, Cox Communications I will discuss integrating Moloch into an automated threat intel investigation pipeline with MISP. Speaker Bio Matt enjoys sunsets, long hikes in the mountains and intrusion detection. After studying Computer Science at the University of Oklahoma, he accepted a position with Cox Communications in 2001 under the leadership of renowned thought leader and virtuoso bass player William “Wild Bill” Beesley, who asked to be credited in this bio. There, Matt formed Cox’s abuse department, which he led for several years, and today he serves as Cox’s Principal Security Architect. Using WISE by Andy Wick, Sr Princ Architect, Oath We will review how to use WISE and provide real-life examples of features added since the last MolochON. Speaker Bio Andy is the creator of Moloch and former Architect of AIM. He joined the security team in 2011 and hasn’t looked back. Moloch Deployments by Srinath Mantripragada, Linux Integrator, SecureOps I will present a Moloch deployment with 20+ different Moloch nodes. A range will be presented, including small, medium, and large deployments that go from full hardware with dedicated capture cards to virtualized point-of-presence and AWS with transit network. All nodes run Moloch, Suricata and Bro. Speaker Bio Srinath has worked as a SysAdmin and related positions for most of his career. He currently works as an Integrator/SysAdmin/DevOps for SecureOps, a Security Services company in Montreal, Canada. Elasticsearch for Time-series Data at Scale by Andrew Selden, Solution Architect, Elastic Elasticsearch has evolved beyond search and logging to be a first-class, time-series metric store. This talk will explore how to achieve 1 million metrics/second on a relatively modest cluster. We will take a look at issues such as data modeling, debugging, tuning, sharding, rollups and more. Speaker Bio Andrew Selden has been running Elasticsearch at scale since 2011 where he previously led the search, NLP, and data engineering teams at Meltwater News and later developed streaming analytics solutions for BlueKai’s advertising platform (acquired by Oracle). He started his tenure at Elastic as a core engineer and for the last two years has been helping customers architect and scale. After the conference, enjoy a complimentary happy hour, sponsored by Arista. Hope to see you there!

Announcing the 2nd Annual Moloch Conference: Learn how to augment your current security infrastructure

October 29, 2018
October 19, 2018
October 19, 2018
Share

Sharing Vespa at the SF Big Analytics Meetup

By Jon Bratseth, Distinguished Architect, Oath I had the wonderful opportunity to present Vespa at the SF Big Analytics Meetup on September 26th, hosted by Amplitude. Several members of the Vespa team (Kim, Frode and Kristian) also attended. We all enjoyed meeting with members of the Big Analytics community to discuss how Vespa could be helpful for their companies. Thank you to Chester Chen, T.J. Bay, and Jin Hao Wan for planning the meetup, and here’s our presentation, in case you missed it (slides are also available here):

Sharing Vespa at the SF Big Analytics Meetup

October 19, 2018
October 17, 2018
October 17, 2018
amberwilsonla
Share

Sharing Vespa (Open Source Big Data Serving Engine) at the SF Big Analytics Meetup

yahoodevelopers: By Jon Bratseth, Distinguished Architect, Oath I had the wonderful opportunity to present Vespa at the SF Big Analytics Meetup on September 26th, hosted by Amplitude. Several members of the Vespa team (Kim, Frode and Kristian) also attended. We all enjoyed meeting with members of the Big Analytics community to discuss how Vespa could be helpful for their companies. Thank you to Chester Chen, T.J. Bay, and Jin Hao Wan for planning the meetup, and here’s our presentation, in case you missed it (slides are also available here): Largely developed by Yahoo engineers, Vespa is our big data processing and serving engine, available as open source on GitHub. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance and Oath Ads Platforms.  Vespa use is growing even more rapidly; since it is open source under a permissive Apache license, Vespa can power other external third-party apps as well.  A great example is Zedge, which uses Vespa for search and recommender systems to support content discovery for personalization of mobile phones (Android, iOS, and Web). Zedge uses Vespa in production to serve millions of monthly active users. Visit https://vespa.ai/ to learn more and download the code. We encourage code contributions and welcome opportunities to collaborate.

Sharing Vespa (Open Source Big Data Serving Engine) at the SF Big Analytics Meetup

October 17, 2018
October 4, 2018
October 4, 2018
amberwilsonla
Share

Open-Sourcing Panoptes, Oath’s distributed network telemetry collector

yahoodevelopers: By Ian Flint, Network Automation Architect and Varun Varma, Senior Principal Engineer The Oath network automation team is proud to announce that we are open-sourcing Panoptes, a distributed system for collecting, enriching and distributing network telemetry.   We developed Panoptes to address several issues inherent in legacy polling systems, including overpolling due to multiple point solutions for metrics, a lack of data normalization, consistent data enrichment and integration with infrastructure discovery systems.   Panoptes is a pluggable, distributed, high-performance data collection system which supports multiple polling formats, including SNMP and vendor-specific APIs. It is also extensible to support emerging streaming telemetry standards including gNMI. Architecture The following block diagram shows the major components of Panoptes: Panoptes is written primarily in Python, and leverages multiple open-source technologies to provide the most value for the least development effort. At the center of Panoptes is a metrics bus implemented on Kafka. All data plane transactions flow across this bus; discovery publishes devices to the bus, polling publishes metrics to the bus, and numerous clients read the data off of the bus for additional processing and forwarding. This architecture enables easy data distribution and integration with other systems. For example, in preparing for open-source, we identified a need for a generally available time series datastore. We developed, tested and released a plugin to push metrics into InfluxDB in under a week. This flexibility allows Panoptes to evolve with industry standards. Check scheduling is accomplished using Celery, a horizontally scalable, open-source scheduler utilizing a Redis data store. Celery’s scalable nature combined with Panoptes’ distributed nature yields excellent scalability. Across Oath, Panoptes currently runs hundreds of thousands of checks per second, and the infrastructure has been tested to more than one million checks per second. Panoptes ships with a simple, CSV-based discovery system. Integrating Panoptes with a CMDB is as simple as writing an adapter to emit a CSV, and importing that CSV into Panoptes. From there, Panoptes will manage the task of scheduling polling for the desired devices. Users can also develop custom discovery plugins to integrate with their CMDB and other device inventory data sources. Finally, any metrics gathering system needs a place to send the metrics. Panoptes’ initial release includes an integration with InfluxDB, an industry-standard time series store. Combined with Grafana and the InfluxData ecosystem, this gives teams the ability to quickly set up a fully-featured monitoring environment. Deployment at Oath At Oath, we anticipate significant benefits from building Panoptes. We will consolidate four siloed polling solutions into one, reducing overpolling and the associated risk of service interruption. As vendors move toward streaming telemetry, Panoptes’ flexible architecture will minimize the effort required to adopt these new protocols. There is another, less obvious benefit to a system like Panoptes. As is the case with most large enterprises, a massive ecosystem of downstream applications has evolved around our existing polling solutions. Panoptes allows us to continue to populate legacy datastores without continuing to run the polling layers of those systems. This is because Panoptes’ data bus enables multiple metrics consumers, so we can send metrics to both current and legacy datastores. At Oath, we have deployed Panoptes in a tiered, federated model. We install the software in each of our major data centers and proxy checks out to smaller installations such as edge sites.  All metrics are polled from an instance close to the devices, and metrics are forwarded to a centralized time series datastore. We have also developed numerous custom applications on the platform, including a load balancer monitor, a BGP session monitor, and a topology discovery application. The availability of a flexible, extensible platform has greatly reduced the cost of producing robust network data systems. Easy Setup Panoptes’ open-source release is packaged for easy deployment into any Linux-based environment. Deployment is straightforward, so you can have a working system up in hours, not days. We are excited to share our internal polling solution and welcome engineers to contribute to the codebase, including contributing device adapters, metrics forwarders, discovery plugins, and any other relevant data consumers.   Panoptes is available at https://github.com/yahoo/panoptes, and you can connect with our team at network-automation@oath.com.

Open-Sourcing Panoptes, Oath’s distributed network telemetry collector

October 4, 2018
October 2, 2018
October 2, 2018
Share

Configurable Build Resources

We’ve expanded build resource configuration options for Screwdriver! Screwdriver allows users to specify varying tiers of build resources via annotations. Previously, users were able to configure cpu and ram between the three tiers: micro, low(default), and high. In our recent change, we are introducing a new configurable resource, disk, which can be set to either low (default) or high. Furthermore, we are adding an extra tier turbo to both the cpu and ram resources! Please note that although Screwdriver provides default values for each tier, their actual values are determined by the cluster admin. Resources tier: Screwdriver Users In order to use these new settings, Screwdriver users can configure their screwdriver.yamls using annotations as shown below: Example: jobs: main: annotations: screwdriver.cd/cpu: TURBO screwdriver.cd/disk: HIGH screwdriver.cd/ram: MICRO image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Cluster Admins Screwdriver cluster admins can refer to the following issues to set up turbo and disk resources properly. - Turbo resources: https://github.com/screwdriver-cd/screwdriver/issues/1318#issue-364993739 - Disk resources: https://github.com/screwdriver-cd/screwdriver/issues/757#issuecomment-425589405Compatibility List In order to use these new features, you will need these minimum versions: - screwdrivercd/queue-worker:v2.2.2Contributors Thanks to the following people for making this feature possible: - Filbird - minz1027 Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support

Configurable Build Resources

October 2, 2018
September 25, 2018
September 25, 2018
amberwilsonla
Share

Apache Pulsar graduates to Top-Level Project

yahoodevelopers: By Joe Francis, Director, Storage & Messaging We’re excited to share that The Apache Software Foundation announced today that Apache Pulsar has graduated from the incubator to a Top-Level Project. Apache Pulsar is an open-source distributed pub-sub messaging system, created by Yahoo in June 2015 and submitted to the Apache Incubator in June 2017. Apache Pulsar is integral to the streaming data pipelines supporting Oath’s core products including Yahoo Mail, Yahoo Finance, Yahoo Sports and Oath Ad Platforms. It handles hundreds of billions of data events each day and is an integral part of our hybrid cloud strategy. It enables us to stream data between our public and private clouds and allows data pipelines to connect across the clouds.   Oath continues to support Apache Pulsar, with contributions including best-effort messaging, load balancer and end-to-end encryption. With growing data needs handled by Apache Pulsar at Oath, we’re focused on reducing memory pressure in brokers and bookkeepers, and creating additional connectors to other large-scale systems. Apache Pulsar’s future is bright and we’re thrilled to be part of this great project and community. P.S. We’re hiring! Learn more here.

Apache Pulsar graduates to Top-Level Project

September 25, 2018
September 20, 2018
September 20, 2018
Share

Pipeline pagination on the Search page

We’ve recently added pagination to the pipelines on the Search page! Before pipeline pagination, when a user visited the Search page (e.g. /search), all pipelines were fetched from the API and sorted alphabetically in the UI. In order to improve the total page load time, we moved the burden of pagination from the UI to the API. Now, when a user visits the Search page, only the first page of pipelines is fetched by default. Clicking the Show More button triggers the fetching of the next page of pipelines. All the pagination and search logic is moved to the datastore, so the overall load time for fetching a page of search results is under 2 seconds now as compared to before where some search queries could take more than 10 seconds.Screwdriver Cluster Admins In order to use these latest changes fully, Screwdriver cluster admins will need to do some SQL queries to migrate data from scmRepo to the new name field. This name field will be used for sorting and searching in the Search UI. Without migrating If no migration is done, pipelines will show up sorted by id in the Search page. Pipelines will not be returned in search results until a sync or update is done on them (either directly from the UI or by interacting with the pipeline in some way in the UI). Steps to migrate 1. Pull in the new API (v0.5.466). This is necessary for the name column to be created in the DB. 2. Take a snapshot or backup your DB. 3. Set pipeline name. This requires two calls in postgres: one to extract the pipeline name data, the second to remove the curly braces ({ and }) injected by the regexp call. In postgresql, run: UPDATE public.pipelines SET name = regexp_matches("scmRepo", '.*name":"(.*)",.*') UPDATE public.pipelines SET name = btrim(name, '{}') 4. Pull in the new UI (v1.0.331). 5. Optionally, you can post a banner at to let users know they might need to sync their pipelines if it is not showing up in search results. Make an API call to POST /banners with proper auth and body like: { "message": "If your pipeline is not showing up in Search results, go to the pipeline Options tab and Sync the pipeline.", "isActive": true, "type": "info" } Compatibility List The Search page pipeline pagination requires the following minimum versions of Screwdriver: - API: v0.5.466 - UI: v1.0.331 Contributors Thanks to the following people who made this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Pipeline pagination on the Search page

September 20, 2018
September 19, 2018
September 19, 2018
amberwilsonla
Share

Introducing HaloDB, a fast, embedded key-value storage engine written in Java

yahoodevelopers: By Arjun Mannaly, Senior Software Engineer  At Oath, multiple ad platforms use a high throughput, low latency distributed key-value database that runs in data centers all over the world. The database stores billions of records and handles millions of read and write requests per second at millisecond latencies. The data we have in this database must be persistent, and the working set is larger than what we can fit in memory. Therefore, a key component of the database performance is a fast storage engine. Our current solution had served us well, but it was primarily designed for a read-heavy workload and its write throughput started to be a bottleneck as write traffic increased. There were other additional concerns as well; it took hours to repair a corrupted DB, or iterate over and delete records. The storage engine also didn’t expose enough operational metrics. The primary concern though was the write performance, which based on our projections, would have been a major obstacle for scaling the database. With these concerns in mind, we began searching for an alternative solution. We searched for a key-value storage engine capable of dealing with IO-bound workloads, with submillisecond read latencies under high read and write throughput. After concluding our research and benchmarking alternatives, we didn’t find a solution that worked for our workload, thus we were inspired to build HaloDB. Now, we’re glad to announce that it’s also open source and available to use under the terms of the Apache license. HaloDB has given our production boxes a 50% improvement in write capacity while consistently maintaining a submillisecond read latency at the 99th percentile. Architecture HaloDB primarily consists of append-only log files on disk and an index of keys in memory. All writes are sequential writes which go to an append-only log file and the file is rolled-over once it reaches a configurable size. Older versions of records are removed to make space by a background compaction job. The in-memory index in HaloDB is a hash table which stores all keys and their associated metadata. The size of the in-memory index, depending on the number of keys, can be quite large, hence for performance reasons, is stored outside the Java heap, in native memory. When looking up the value for a key, corresponding metadata is first read from the in-memory index and then the value is read from disk. Each lookup request requires at most a single read from disk. Performance   The chart below shows the results of performance tests with real production data. The read requests were kept at 50,000 QPS while the write QPS was increased. HaloDB scaled very well as we increased the write QPS while consistently maintaining submillisecond read latencies at the 99th percentile. The chart below shows the 99th percentile latency from a production server before and after migration to HaloDB.  If HaloDB sounds like a helpful solution to you, please feel free to use it, open issues, and contribute!

Introducing HaloDB, a fast, embedded key-value storage engine written in Java

September 19, 2018
September 18, 2018
September 18, 2018
Share

Join us in San Francisco on September 26th for a Meetup

Hi Vespa Community, Several members from our team will be traveling to San Francisco on September 26th for a meetup and we’d love to chat with you there. Jon Bratseth (Distinguished Architect) will present a Vespa overview and answer any questions. To learn more and RSVP, please visit: https://www.meetup.com/SF-Big-Analytics/events/254461052/. Hope to see you! The Vespa Team

Join us in San Francisco on September 26th for a Meetup

September 18, 2018
September 18, 2018
September 18, 2018
Share

Build step logs download

Downloading Step Logs We have added a Download button in the top right corner of the build log console. Upon clicking the button, the browser will query all or the rest of the log content from our API and compose a client-side downloadable text blob by leveraging the URL.createObjectURL() Web API. Minor Improvement On Workflow Graph Thanks to s-yoshika, the link edge is no longer covering the name text of the build node. Also, for build jobs with names that exceed 20 characters, it will be automatically ellipsized to avoid being clipped off by the containing DOM element. Compatibility List These UI improvements require the following minimum versions of Screwdriver: - screwdrivercd/ui: v1.0.329Contributors Thanks to the following people for making this feature possible: - DekusDenial - s-yoshika Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Build step logs download

September 18, 2018
September 13, 2018
September 13, 2018
amberwilsonla
Share

Introducing Oak: an Open Source Scalable Key-Value Map for Big Data Analytics

yahoodevelopers: By Dmitry Basin, Edward Bortnikov, Anastasia Braginsky, Eshcar Hillel, Idit Keidar, Hagar Meir, Gali Sheffi Real-time analytics applications are on the rise. Modern decision support and machine intelligence engines strive to continuously ingest large volumes of data while providing up-to-date insights with minimum delay. For example, in Flurry Analytics, an Oath service which provides mobile developers with rich tools to explore user behavior in real time, it only takes seconds to reflect the events that happened on mobile devices in its numerous dashboards. The scalability demand is immense – as of late 2017, the Flurry SDK was installed on 2.6B devices and monitored 1M+ mobile apps. Mobile data hits the Flurry backend at a huge rate, updates statistics across hundreds of dimensions, and becomes queryable immediately. Flurry harnesses the open-source distributed interactive analytics engine named Druid to ingest data and serve queries at this massive rate. In order to minimize delays before data becomes available for analysis, technologies like Druid should avoid maintaining separate systems for data ingestion and query serving, and instead strive to do both within the same system. Doing so is nontrivial since one cannot compromise on overall correctness when multiple conflicting operations execute in parallel on modern multi-core CPUs. A promising approach is using concurrent data structure (CDS) algorithms which adapt traditional data structures to multiprocessor hardware. CDS implementations are thread-safe – that is, developers can use them exactly as sequential code while maintaining strong theoretical correctness guarantees. In recent years, CDS algorithms enabled dramatic application performance scaling and became popular programming tools. For example, Java programmers can use the ConcurrentNavigableMap JDK implementations for the concurrent ordered key-value map abstraction that is instrumental in systems like Druid. Today, we are excited to share Oak, a new open source project from Oath, available under the Apache License 2.0. The project was created by the Scalable Systems team at Yahoo Research. It extends upon our earlier research work, named KiWi. Oak is a Java package that implements OakMap – a concurrent ordered key-value map. OakMap’s API is similar to Java’s ConcurrentNavigableMap. Java developers will find it easy to switch most of their applications to it. OakMap provides the safety guarantees specified by ConcurrentNavigableMap’s programming model. However, it scales with the RAM and CPU resources well beyond the best-in-class ConcurrentNavigableMap implementations. For example, it compares favorably to Doug Lea’s seminal ConcurrentSkipListMap, which is used by multiple big data platforms, including Apache HBase, Druid, EVCache, etc. Our benchmarks show that OakMap harnesses 3x more memory, and runs 3x-5x faster on analytics workloads. OakMap’s implementation is very different from traditional implementations such as  ConcurrentSkipListMap. While the latter maintains all keys and values as individual Java objects, OakMap stores them in very large memory buffers allocated beyond the JVM-managed memory heap (hence the name Oak - abbr. Off-heap Allocated Keys). The access to the key-value pairs is provided by a lightweight two-level on-heap index. At its lower level, the references to keys are stored in contiguous chunks, each responsible for a distinct key range. The chunks themselves, which dominate the index footprint, are accessed through a lightweight top-level ConcurrentSkipListMap. The figure below illustrates OakMap’s data organization. OakMap structure. The maintenance of OakMap’s chunked index in a concurrent setting is the crux of its complexity as well as the key for its efficiency. Experiments have shown that our algorithm is advantageous in multiple ways: 1. Memory scaling. OakMap’s custom off-heap memory allocation alleviates the garbage collection (GC) overhead that plagues Java applications. Despite the permanent progress, modern Java GC algorithms do not practically scale beyond a few tens of GBs of memory, whereas OakMap scales beyond 128GB of off-heap RAM. 2. Query speed. The chunk-based layout increases data locality, which speeds up both single-key lookups and range scans. All queries enjoy efficient, cache-friendly access, in contrast with permanent dereferencing in object-based maps. On top of these basic merits, OakMap provides safe direct access to its chunks, which avoids an extra copy for rebuilding the original key and value objects. Our benchmarks demonstrate OakMap’s performance benefits versus ConcurrentSkipListMap: A) Up to 2x throughput for ascending scans. B) Up to 5x throughput for descending scans. C) Up to 3x throughput for lookups. 3. Update speed. Beyond avoiding the GC overhead typical for write-intensive workloads, OakMap optimizes the incremental maintenance of big complex values – for example, aggregate data sketches, which are indispensable in systems like Druid. It adopts in situ computation on objects embedded in its internal chunks to avoid unnecessary data copy, yet again. In our benchmarks, OakMap achieves up to 1.8x data ingestion rate versus ConcurrentSkipListMap. With key-value maps being an extremely generic abstraction, it is easy to envision a variety of use cases for OakMap in large-scale analytics and machine learning applications – such as unstructured key-value storage, structured databases, in-memory caches, parameter servers, etc. For example, we are already working with the Druid community on rebuilding Druid’s core Incremental Index component around OakMap, in order to boost its scalability and performance. We look forward to growing the Oak community! We invite you to explore the project, use OakMap in your applications, raise issues, suggest improvements, and contribute code. If you have any questions, please feel free to send us a note on the Oak developers list: oakproject@googlegroups.com. It would be great to hear from you!

Introducing Oak: an Open Source Scalable Key-Value Map for Big Data Analytics

September 13, 2018
September 12, 2018
September 12, 2018
Share

Improvement on perceived performance

In an effort to improve Screwdriver user experience, the Screwdriver team identified two major components on the UI that needed improvement with respect to load time — the event pipeline and build step log. To improve user-perceived performance on those components, we decided to adopt two corresponding UX approaches — pagination and lazy loading. Event Pipeline Before our pagination change, when a user visited the pipeline events page (e.g. /pipelines/{id}/events), all events and their builds were fetched from the API then artificially paginated in the UI. In order to improve the total page load time, it was important to move the burden of pagination from the UI to the API. Now, when a user visits the pipeline events page, only the latest page of events and builds are fetched by default. Clicking the Show More button triggers the fetching of the next page of events and builds. Since there is no further processing of the API data by the UI, the overall load time for fetching a page of events and their corresponding build info is well under a second now as compared to before where some pipelines could take more than ten seconds. Build Step Log As for the build step log, instead of chronologically fetching pages of completed step logs one page at a time until the entire log is fetched, it is now fetched in a reverse chronologically order and only a reasonable amount of logs is fetched and loaded lazily as the user scrolls up the log console. This change is meant to compensate for builds that generate tens of thousands lines of log. Since users had to wait for the entire log to load before they could interact with it, the previous implementation was extremely time consuming as the size of step logs increased. Now, the first page of a step log takes roughly two seconds or less to load. To put the significance of the change into perspective, consider a step that generates a total of 98743 lines of log: it would have taken 90 seconds to load and almost 10 seconds to fully render on the UI; now it takes less than 2 seconds to load and less than 1 second to render. Compatibility List These UI improvements require the following minimum versions of Screwdriver: - screwdrivercd/screwdriver: v0.5.460 - screwdrivercd/ui: v1.0.327Contributors Thanks to the following people for making this feature possible: - DekusDenial - jithin1987 - minz1027 - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Improvement on perceived performance

September 12, 2018
September 3, 2018
September 3, 2018
Share

Vespa at Zedge - providing personalization content to millions of iOS, Android & web users

This blog post describes Zedge’s use of Vespa for search and recommender systems to support content discovery for personalization of mobile phones (Android, iOS and Web). Zedge is now using Vespa in production to serve millions of monthly active users. See the architecture below.What is Zedge? Zedge’s main product is an app - Zedge Ringtones & Wallpapers - that provides wallpapers, ringtones, game recommendations and notification sounds customized for your mobile device.  Zedge apps have been downloaded more than 300 million times combined for iOS and Android and is used by millions of people worldwide each month. Zedge is traded on NYSE under the ticker ZDGE. People use Zedge apps for self-expression. Setting a wallpaper or ringtone on your mobile device is in many ways similar to selecting clothes, hairstyle or other fashion statements. In fact people try a wallpaper or ringtone in a similar manner as they would try clothes in a dressing room before making a purchase decision, they try different wallpapers or ringtones before deciding on one they want to keep for a while. The decision for selecting a wallpaper is not taken lightly, since people interact and view their mobile device screen (and background wallpaper) a lot (hundreds of times per day). Why Zedge considered Vespa Zedge apps - for iOS, Android and Web - depend heavily on search and recommender services to support content discovery. These services have been developed over several years and constituted of multiple subsystems - both internally developed and open source - and technologies for both search and recommender serving. In addition there were numerous big data processing jobs to build and maintain data for content discovery serving. The time and complexity of improving search and recommender services and corresponding processing jobs started to become high, so simplification was due. Vespa seemed like a promising open source technology to consider for Zedge, in particular since it was proven in several ways within Oath (Yahoo): 1. Scales to handle very large systems, e.g.  2. Flickr with billions of images and 3. Yahoo Gemini Ads Platform with more than one hundred thousand request per second to serve ads to 1 billion monthly active users for services such as Techcrunch, Aol, Yahoo!, Tumblr and Huffpost. 4. Runs stable and requires very little operations support - Oath has a few hundred - many of them large - Vespa based applications requiring less than a handful operations people to run smoothly.  5. Rich set of features that Zedge could gain from using 6. Built-in tensor processing support could simplify calculation and serving of related wallpapers (images) & ringtones/notifications (audio) 7. Built-in support of Tensorflow models to simplify development and deployment of machine learning based search and recommender ranking (at that time in development according to Oath). 8. Search Chains 9. Help from core developers of VespaThe Vespa pilot project Given the content discovery technology need and promising characteristics of Vespa we started out with a pilot project with a team of software engineers, SRE and data scientists with the goals of: 1. Learn about Vespa from hands-on development  2. Create a realistic proof of concept using Vespa in a Zedge app 3. Get initial answers to key questions about Vespa, i.e. enough to decide to go for it fully 4. Which of today’s API services can it simplify and replace? 5. What are the (cloud) production costs with Vespa at Zedge’s scale? (OPEX) 6. How will maintenance and development look like with Vespa? (future CAPEX) 7. Which new (innovation) opportunities does Vespa give? The result of the pilot project was successful - we developed a good proof of concept use of Vespa with one of our Android apps internally and decided to start a project transferring all recommender and search serving to Vespa. Our impression after the pilot was that the main benefit was by making it easier to maintain and develop search/recommender systems, in particular by reducing amount of code and complexity of processing jobs. Autosuggest for search with Vespa Since autosuggest (for search) required both low latency and high throughput we decided that it was a good candidate to try for production with Vespa first. Configuration wise it was similar to regular search (from the pilot), but snippet generation (document summary) requiring access to document store was superfluous for autosuggest. A good approach for autosuggest was to: 1. Make all document fields searchable with autosuggest of type (in-memory) attribute 2. https://docs.vespa.ai/documentation/attributes.html  3. https://docs.vespa.ai/documentation/reference/search-definitions-reference.html#attribute  4. https://docs.vespa.ai/documentation/search-definitions.html (basics) 5. Avoid snippet generation and using the document store by overriding the document-summary setting in search definitions to only access attributes 6. https://docs.vespa.ai/documentation/document-summaries.html  7. https://docs.vespa.ai/documentation/nativerank.html The figure above illustrates the autosuggest architecture. When the user starts typing in the search field, we fire a query with the search prefix to the Cloudflare worker - which in case of a cache hit returns the result (possible queries) to the client. In case of a cache miss the Cloudflare worker forwards the query to our Vespa instance handling autosuggest. Regarding external API for autosuggest we use Cloudflare Workers (supporting Javascript on V8 and later perhaps multiple languages with Webassembly) to handle API queries from Zedge apps in front of Vespa running in Google Cloud. This setup allow for simple close-to-user caching of autosuggest results. Search, Recommenders and Related Content with Vespa Without going into details we had several recommender and search services to adapt to Vespa. These services were adapted by writing custom Vespa searchers and in some cases search chains: - https://docs.vespa.ai/documentation/searcher-development.html  - https://docs.vespa.ai/documentation/chained-components.html  The main change compared to our old recommender and related content services was the degree of dynamicity and freshness of serving, i.e. with Vespa more ranking signals are calculated on the fly using Vespa’s tensor support instead of being precalculated and fed into services periodically. Another benefit of this was that the amount of computational (big data) resources and code for recommender & related content processing was heavily reduced. Continuous Integration and Testing with Vespa A main focus was to make testing and deployment of Vespa services with continuous integration (see figure below). We found that a combination of Jenkins (or similar CI product or service) with Docker Compose worked nicely in order to test new Vespa applications, corresponding configurations and data (samples) before deploying to the staging cluster with Vespa on Google Cloud. This way we can have a realistic test setup - with Docker Compose - that is close to being exactly similar to the production environment (even at hostname level).Monitoring of Vespa with Prometheus and Grafana For monitoring we created a tool that continuously read Vespa metrics, stored them in Prometheus (a time series database) and visualized them them with Grafana. This tool can be found on https://github.com/vespa-engine/vespa_exporter. More information about Vespa metrics and monitoring: - https://docs.vespa.ai/documentation/reference/metrics-health-format.html - https://docs.vespa.ai/documentation/jdisc/metrics.html - https://docs.vespa.ai/documentation/operations/admin-monitoring.htmlConclusion The team quickly got up to speed with Vespa with its good documentation and examples, and it has been running like a clock since we started using it for real loads in production. But this was only our first step with Vespa - i.e. consolidating existing search and recommender technologies into a more homogeneous and easier to maintain form. With Vespa as part of our architecture we see many possible paths for evolving our search and recommendation capabilities (e.g. machine learning based ranking such as integration with Tensorflow and ONNX). Best regards, Zedge Content Discovery Team

Vespa at Zedge - providing personalization content to millions of iOS, Android & web users

September 3, 2018
August 27, 2018
August 27, 2018
Share

Private channel support for Slack notifications

In January, we introduced Slack notifications for build statuses in public channels. This week, we are happy to announce that we also support Slack notifications for private channels as well!Usage for a Screwdriver.cd User Slack notifications can be configured the exact same way as before, but private repos are now supported. First, you must invite the Screwdriver Slack bot (most likely screwdriver-bot), created by your admin, to your Slack channel(s). Then, you must configure your screwdriver.yaml file, which stores all your build settings: settings: slack: channels: - channel_A # public - channel_B # private statuses: # statuses to notify on - SUCCESS - FAILURE - ABORTED statuses denote the build statuses that trigger a notification. The full possible list of statuses to listen on can be found in our data-schema. If omitted, it defaults to only notifying you when a build returns a FAILURE status. See our previous Slack blog post and Slack user documentation and cluster admin documentation for more information.Compatibility List Private channel support for Slack notifications requires the following minimum versions of Screwdriver: - screwdrivercd/screwdriver: v0.5.451Contributors Thanks to the following people for making this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Private channel support for Slack notifications

August 27, 2018
August 22, 2018
August 22, 2018
Share

User configurable shell

Previously, Screwdriver ran builds in sh. This caused problems for some users that have bash syntax in their steps. With version LAUNCHER v5.0.13 and above, users can run builds in the shell of their choice by setting the environment variable USER_SHELL_BIN. This value can also be the full path such as /bin/bash. Example screwdriver.yaml (can be found under the screwdriver-cd-test/user-shell-example repo): shared: image: node:6 jobs: # This job will fail because `source` is not available in sh test-sh: steps: - fail: echo "echo hello" > /tmp/test && source /tmp/test requires: [~pr, ~commit] # This job will pass because `source` is available in bash test-bash: # Set USER_SHELL_BIN to bash to run the in bash environment: USER_SHELL_BIN: bash steps: - pass: echo "echo hello" > /tmp/test && source /tmp/test requires: [~pr, ~commit] Compatibility List User-configurable shell support requires the following minimum versions of Screwdriver: - screwdrivercd/launcher: v5.0.13Contributors Thanks to the following people for making this feature possible: - d2lam Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

User configurable shell

August 22, 2018
August 8, 2018
August 8, 2018
Share

Introducing JSON queries

We recently introduced a new addition to the Search API - JSON queries. The search request can now be executed with a POST request, which includes the query-parameters within its payload. Along with this new query we also introduce a new parameter SELECT with the sub-parameters WHERE and GROUPING, which is equivalent to YQL. The new query With the Search APIs newest addition, it is now possible to send queries with HTTP POST. The query-parameters has been moved out of the URL and into a POST request body - therefore, no more URL-encoding. You also avoid getting all the queries in the log, which can be an advantage. This is how a GET query looks like: GET /search/?param1=value1¶m2=value2&... The general form of the new POST query is: POST /search/ { param1 : value1, param2 : value2, ... } The dot-notation is gone, and the query-parameters are now nested under the same key instead. Let’s take this query: GET /search/?yql=select+%2A+from+sources+%2A+where+default+contains+%22bad%22%3B&ranking.queryCache=false&ranking.profile=vespaProfile&ranking.matchPhase.ascending=true&ranking.matchPhase.maxHits=15&ranking.matchPhase.diversity.minGroups=10&presentation.bolding=false&presentation.format=json&nocache=true and write it in the new POST request-format, which will look like this: POST /search/ { "yql": "select * from sources * where default contains \"bad\";", "ranking": { "queryCache": "false", "profile": "vespaProfile", "matchPhase": { "ascending": "true", "maxHits": 15, "diversity": { "minGroups": 10 } } }, "presentation": { "bolding": "false", "format": "json" }, "nocache": true } With Vespa running (see Quick Start or Blog Search Tutorial), you can try building POST-queries with the new querybuilder GUI at http://localhost:8080/querybuilder/, which can help you build queries with e.g. autocompletion of YQL: The Select-parameter The SELECT-parameter is used with POST queries and is the JSON equivalent of YQL queries, so they can not be used together. The query-parameter will overwrite SELECT, and decide the query’s querytree. Where The SQL-like syntax is gone and the tree-syntax has been enhanced. If you’re used to the query-parameter syntax you’ll feel right at home with this new language. YQL is a regular language and is parsed into a query-tree when parsed in Vespa. You can now build that tree in the WHERE-parameter with JSON. Lets take a look at the yql: select * from sources * where default contains foo and rank(a contains "A", b contains "B");, which will create the following query-tree: You can build the tree above with the WHERE-parameter, like this: { "and" : [ { "contains" : ["default", "foo"] }, { "rank" : [ { "contains" : ["a", "A"] }, { "contains" : ["b", "B"] } ]} ] } Which is equivalent with the YQL. Grouping The grouping can now be written in JSON, and can now be written with structure, instead of on the same line. Instead of parantheses, we now use curly brackets to symbolise the tree-structure between the different grouping/aggregation-functions, and colons to assign function-arguments. A grouping, that will group first by year and then by month, can be written as such: | all(group(time.year(a)) each(output(count()) all(group(time.monthofyear(a)) each(output(count()))) and equivalentenly with the new GROUPING-parameter: "grouping" : [ { "all" : { "group" : "time.year(a)", "each" : { "output" : "count()" }, "all" : { "group" : "time.monthofyear(a)", "each" : { "output" : "count()" }, } } } ] Wrapping it up In this post we have provided a gentle introduction to the new Vepsa POST query feature, and the SELECT-parameter. You can read more about writing POST queries in the Vespa documentation. More examples of the POST query can be found in the Vespa tutorials. Please share experiences. Happy searching!

Introducing JSON queries

August 8, 2018
July 30, 2018
July 30, 2018
Share

Introducing Screwdriver Commands for sharing binaries

Oftentimes, there are small scripts or commands that people will use in multiple jobs that are not complex enough to warrant creating a Screwdriver template. Options such as Git repositories, yum packages, or node modules exist, but there was no clear way to share binaries or scripts across multiple jobs. Recently, we have released Screwdriver Commands (also known as sd-cmd) which solves this problem, allowing users to easily share binary commands or scripts across multiple containers and jobs. Using a command The following is an example of using an sd-cmd. You can configure any commands or scripts in screwdriver.yaml like this: Example: jobs: main: requires: [~pr, ~commit] steps: - exec: sd-cmd exec foo/bar@1 -baz sample Format for using sd-cmd: sd-cmd exec /@ - namespace/name - the fully-qualified command name - version - a semver-compatible format or tag - arguments - passed directly to the underlying command In this example, Screwdriver will download the command “foobar.sh” from the Store, which is defined by namespace, name, and version, and will execute it with args “-baz sample”. Actual command will be run as: $ /opt/sd/commands/foo/bar/1.0.1/foobar.sh -baz sample Creating a command Next, this section covers how to publish your own binary commands or scripts. Commands or scripts must be published using a Screwdriver pipeline. The command will then be available in the same Screwdriver cluster. Writing a command yaml To create a command, create a repo with a sd-command.yaml file. The file should contain a namespace, name, version, description, maintainer email, format, and a config that depends on a format. Optionally, you can set the usage field, which will replace the default usage set in the documentation in the UI. Example sd-command.yaml: Binary example: namespace: foo # Namespace for the command name: bar # Command name version: '1.0' # Major and Minor version number (patch is automatic), must be a string description: | Lorem ipsum dolor sit amet. usage: | # Optional usage field for documentation purposes sd-cmd exec foo/bar@

Introducing Screwdriver Commands for sharing binaries

July 30, 2018
July 12, 2018
July 12, 2018
Share

User teardown steps

Users can now specify their own teardown steps in Screwdriver, which will always run regardless of build status. These steps need to be defined at the end of the job and start with teardown-. Note: These steps run in separate shells. As a result, environment variables set by previous steps will not be available. Update 8/22/2018: Environment variables set by user steps are now available in teardown steps. Example screwdriver.yaml jobs: main: image: node:8 steps: - fail: command-does-not-exist - teardown-step1: echo hello - teardown-step2: echo goodbye requires: - ~commit - ~pr In this example, the steps teardown-step1 and teardown-step2 will run even though the build fails: Compatibility List User teardown support requires the following minimum versions of Screwdriver: - screwdrivercd/launcher: v4.0.116 - screwdrivercd/screwdriver: v0.5.405Contributors Thanks to the following people for making this feature possible: - d2lam - tk3fftk (from Yahoo! JAPAN) Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

User teardown steps

July 12, 2018
July 9, 2018
July 9, 2018
Share

Pipeline API Tokens in Screwdriver

We released pipeline-scoped API Tokens, which enable your scripts to interact with a specific Screwdriver pipeline. You can use these tokens with fine-grained access control for each pipeline instead of User Access Tokens. Creating Tokens If you go to Screwdriver’s updated pipeline Secrets page, you can find a list of all your pipeline access tokens along with the option to modify, refresh, or revoke them. At the bottom of the list is a form to generate a new token. Enter a name and optional description, then click Add. Your new pipeline token value will be displayed at the top of the Access Tokens section, but it will only be displayed once, so make sure you save it somewhere safe! This token provides admin-level access to your specific pipeline, so treat it as you would a password. Using Tokens to Authenticate To authenticate with your pipeline’s newly-created token, make a GET request to https://${API_URL}/v4/auth/token?api_token=${YOUR_PIPELINE_TOKEN_VALUE}. This returns a JSON object with a token field. The value of this field will be a JSON Web Token, which you can use in an Authorization header to make further requests to the Screwdriver API. This JWT will be valid for 2 hours, after which you must re-authenticate. Example: Starting a Specific Pipeline You can use a pipeline token similar to how you would a user token. Here’s a short example written in Python showing how you can use a Pipeline API token to start a pipeline. This script will directly call the Screwdriver API. # Authenticate with token auth_request = get('https://api.screwdriver.cd/v4/auth/token?api_token=%s' % environ['SD_KEY']) jwt = auth_request.json()['token'] # Set headers headers = { 'Authorization': 'Bearer %s' % jwt } # Get the jobs in the pipeline jobs_request = get('https://api.screwdriver.cd/v4/pipelines/%s/jobs' % pipeline_id, headers=headers) jobId = jobs_request.json()[0]['id'] # Start the first job start_request = post('https://api.screwdriver.cd/v4/builds', headers=headers, data=dict(jobId=jobId)) Compatibility List For pipeline tokens to work, you will need these minimum versions: - screwdrivercd/screwdriver: v0.5.389 - screwdrivercd/ui: v1.0.290Contributors Thanks to the following people for making this feature possible: - kumada626 (from Yahoo! JAPAN) - petey - s-yoshika (from Yahoo! JAPAN) Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Pipeline API Tokens in Screwdriver

July 9, 2018
July 6, 2018
July 6, 2018
Share

Multibyte Artifact Name Support

A multibyte character is a character composed of sequences of one or more bytes. It’s often used in Asia (e.g. Japanese, Chinese, Thai). Screwdriver now supports reading artifacts that contain multibyte characters.Example screwdriver.yaml jobs: main: image: node:8 requires: [ ~pr, ~commit ] steps: - touch_multibyte_artifact: echo 'foo' > $SD_ARTIFACTS_DIR/日本語ファイル名さんぷる.txt In this example, we are writing an artifact, 日本語ファイル名さんぷる, which means Japanese file name sample. The artifact name includes Kanji, Katakana, and Hiragana, which are multibyte characters. The artifacts of this example pipeline: The result from clicking the artifact link:Compatibility List Multibyte artifact name support requires the following minimum versions of Screwdriver: - screwdrivercd/screwdriver: v0.5.309Contributors Thanks to the following people for making this feature possible: - minz1027 - sakka2 (from Yahoo! JAPAN) - Zhongtang Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Multibyte Artifact Name Support

July 6, 2018
June 29, 2018
June 29, 2018
Share

Introducing Template Namespaces

We’ve reworked templates to filter by namespace! Namespaces are meant for easier grouping in the UI. From a template creator perspective, template creation still works the same; however, you now have the ability to explicitly define a template namespace. For Screwdriver cluster admins, you will need to migrate existing templates in your database to the new schema in order for them to be displayed correctly in the new UI. These steps will be covered below.Screwdriver Users For Screwdriver template users, you can still use templates the same way. jobs: main: template: templateNamespace/templateName@1.2.3 requires: [~pr, ~commit] In the UI, you can navigate to the template namespace page by clicking on the namespace or going to /templates/namespaces/. Any templates with no defined template namespace will be available at /templates/namespaces/default.Template owners To create a template with a designated namespace, you can either: Implicitly define a namespace (same as before) Explicitly define a namespace Use the default namespace (same as before) Templates will still be used by users the same way. Implicit namespace Screwdriver will interpret anything before a template name’s slash (/) as the namespace. If you define a sd-template.yaml with name: nodejs/lib, Screwdriver will store namespace: nodejs and name: lib. User’s screwdriver.yaml: jobs: main: template: nodejs/lib@1.2.3 requires: [~pr, ~commit] Explicit namespace You can explicitly define a template namespace. If you do, you cannot have any slashes (/) in your template name. Template yaml snippet: namespace: nodejs name: lib ... User’s screwdriver.yaml: jobs: main: template: nodejs/lib@1.2.3 requires: [~pr, ~commit] Default namespace If you don’t explicitly or implicitly define a namespace, Screwdriver will assign namespace: default to your template. Users will still use your template as you defined it, but it will be grouped with other templates with default namespaces in the UI. Template yaml snippet: name: lib User’s screwdriver.yaml: jobs: main: template: lib@1.2.3 requires: [~pr, ~commit] Screwdriver Cluster Admins Database Migration This feature has breaking changes that will affect your DB if you already have existing templates. In order to migrate your templates properly, you will need to do the following steps: 1. Make sure you’ve updated your unique constraints on your templates and templateTags tables to include namespace. 2. Set a default namespace when no namespace exists. In postgresql, run: UPDATE public."templates" SET namespace = 'default' WHERE name !~ '.*/.*' 3. Set implicit namespaces if users defined them. This requires two calls in postgres, one to split by namespace and name, the second to remove the curly braces ({ and }) injected by the regexp call. UPDATE public."templates" SET namespace = regexp_matches(name, '(.*)/.*'), name = regexp_matches(name, '.*/(.*)') WHERE name ~ '.*/.*' UPDATE public."templates" SET namespace = btrim(namespace, '{}'), name = btrim(name, '{}') Compatibility List Template namespaces require the following minimum versions of Screwdriver: - screwdrivercd/screwdriver:v0.5.396 - screwdrivercd/ui:v1.0.297 - screwdrivercd/launcher:v4.0.117 - screwdrivercd/store:v3.1.2 - screwdrivercd/queue-worker:v1.12.18Contributors Thanks to the following people for making this feature possible: - jithin1987 - lusol - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Introducing Template Namespaces

June 29, 2018
June 25, 2018
June 25, 2018
Share

Introducing ONNX support

ONNX (Open Neural Network eXchange) is an open format for the sharing of neural network and other machine learned models between various machine learning and deep learning frameworks. As the open big data serving engine, Vespa aims to make it simple to evaluate machine learned models at serving time at scale. By adding ONNX support in Vespa in addition to our existing TensorFlow support, we’ve made it possible to evaluate models from all the commonly used ML frameworks with low latency over large amounts of data. With the rise of deep learning in the last few years, we’ve naturally enough seen an increase of deep learning frameworks as well: TensorFlow, PyTorch/Caffe2, MxNet etc. One reason for these different frameworks to exist is that they have been developed and optimized around some characteristic, such as fast training on distributed systems or GPUs, or efficient evaluation on mobile devices. Previously, complex projects with non-trivial data pipelines have been unable to pick the best framework for any given subtask due to lacking interoperability between these frameworks. ONNX is a solution to this problem. ONNX is an open format for AI models, and represents an effort to push open standards in AI forward. The goal is to help increase the speed of innovation in the AI community by enabling interoperability between different frameworks and thus streamlining the process of getting models from research to production. There is one commonality between the frameworks mentioned above that enables an open format such as ONNX, and that is that they all make use of dataflow graphs in one way or another. While there are differences between each framework, they all provide APIs enabling developers to construct computational graphs and runtimes to process these graphs. Even though these graphs are conceptually similar, each framework has been a siloed stack of API, graph and runtime. The goal of ONNX is to empower developers to select the framework that works best for their project, by providing an extensible computational graph model that works as a common intermediate representation at any stage of development or deployment. Vespa is an open source project which fits well within such an ecosystem, and we aim to make the process of deploying and serving models to production that have been trained on any framework as smooth as possible. Vespa is optimized toward serving and evaluating over potentially very large datasets while still responding in real time. In contrast to other ML model serving options, Vespa can more efficiently evaluate models over many data points. As such, Vespa is an excellent choice when combining model evaluation with serving of various types of content. Our ONNX support is quite similar to our TensorFlow support. Importing ONNX models is as simple as adding the model to the Vespa application package (under “models/”) and referencing the model using the new ONNX ranking feature: expression: sum(onnx("my_model.onnx")) The above expression runs the model and sums it to a single scalar value to use in ranking. You will have to provide the inputs to the graph. Vespa expects you to provide a macro with the same name as the input tensor. In the macro you can specify where the input should come from, be it a document field, constant or a parameter sent along with the query. More information can be had in the documentation about ONNX import. Internally, Vespa converts the ONNX operations to Vespa’s tensor API. We do the same for TensorFlow import. So the cost of evaluating ONNX and TensorFlow models are the same. We have put a lot of effort in optimizing the evaluation of tensors, and evaluating neural network models can be quite efficient. ONNX support is also quite new to Vespa, so we do not support all current ONNX operations. Part of the reason we don’t support all operations yet is that some are potentially too expensive to evaluate per document, such as convolutional neural networks and recurrent networks (LSTMs etc). ONNX also contains an extension, ONNX-ML, which contains additional operations for non-neural network cases. Support for this extension will come later at some point. We are continually working to add functionality, so please reach out to us if there is something you would like to have added. Going forward we are continually working on improving performance as well as supporting more of the ONNX (and ONNX-ML) standard. You can read more about ranking with ONNX models in the Vespa documentation. We are excited to announce ONNX support. Let us know what you are building with it!

Introducing ONNX support

June 25, 2018
June 25, 2018
June 25, 2018
mikesefanov
Share

Innovating on Authentication Standards

yahoodevelopers: By George Fletcher and Lovlesh Chhabra When Yahoo and AOL came together a year ago as a part of the new Verizon subsidiary Oath,  we took on the challenge of unifying their identity platforms based on current identity standards. Identity standards have been a critical part of the Internet ecosystem over the last 20+ years. From single-sign-on and identity federation with SAML; to the newer identity protocols including OpenID Connect, OAuth2, JOSE, and SCIM (to name a few); to the explorations of “self-sovereign identity” based on distributed ledger technologies; standards have played a key role in providing a secure identity layer for the Internet. As we navigated this journey, we ran across a number of different use cases where there was either no standard or no best practice available for our varied and complicated needs. Instead of creating entirely new standards to solve our problems, we found it more productive to use existing standards in new ways. One such use case arose when we realized that we needed to migrate the identity stored in mobile apps from the legacy identity provider to the new Oath identity platform. For most browser (mobile or desktop) use cases, this doesn’t present a huge problem; some DNS magic and HTTP redirects and the user will sign in at the correct endpoint. Also it’s expected for users accessing services via their browser to have to sign in now and then. However, for mobile applications it’s a completely different story. The normal user pattern for mobile apps is for the user to sign in (via OpenID Connect or OAuth2) and for the app to then be issued long-lived tokens (well, the refresh token is long lived) and the user never has to sign in again on the device (entering a password on the device is NOT a good experience for the user). So the issue is, how do we allow the mobile app to move from one identity provider to another without the user having to re-enter their credentials? The solution came from researching what standards currently exist that might addres this use case (see figure “Standards Landscape” below) and finding the OAuth 2.0 Token Exchange draft specification (https://tools.ietf.org/html/draft-ietf-oauth-token-exchange-13). The Token Exchange draft allows for a given token to be exchanged for new tokens in a different domain. This could be used to manage the “audience” of a token that needs to be passed among a set of microservices to accomplish a task on behalf of the user, as an example. For the use case at hand, we created a specific implementation of the Token Exchange specification (a profile) to allow the refresh token from the originating Identity Provider (IDP) to be exchanged for new tokens from the consolidated IDP. By profiling this draft standard we were able to create a much better user experience for our consumers and do so without inventing proprietary mechanisms. During this identity technical consolidation we also had to address how to support sharing signed-in users across mobile applications written by the same company (technically, signed with the same vendor signing key). Specifically, how can a signed-in user to Yahoo Mail not have to re-sign in when they start using the Yahoo Sports app? The current best practice for this is captured in OAuth 2.0 for Natives Apps (RFC 8252). However, the flow described by this specification requires that the mobile device system browser hold the user’s authenticated sessions. This has some drawbacks such as users clearing their cookies, or using private browsing mode, or even worse, requiring the IDPs to support multiple users signed in at the same time (not something most IDPs support). While, RFC 8252 provides a mechanism for single-sign-on (SSO) across mobile apps provided by any vendor, we wanted a better solution for apps provided by Oath. So we looked at how could we enable mobile apps signed by the vendor to share the signed-in state in a more “back channel” way. One important fact is that mobile apps cryptographically signed by the same vender can securely share data via the device keychain on iOS and Account Manager on Android. Using this as a starting point we defined a new OAuth2 scope, device_sso, whose purpose is to require the Authorization Server (AS) to return a unique “secret” assigned to that specific device. The precedent for using a scope to define specification behaviour is OpenID Connect itself, which defines the “openid” scope as the trigger for the OpenID Provider (an OAuth2 AS) to implement the OpenID Connect specification. The device_secret is returned to a mobile app when the OAuth2 code is exchanged for tokens and then stored by the mobile app in the device keychain and with the id_token identifying the user who signed in. At this point, a second mobile app signed by the same vendor can look in the keychain and find the id_token, ask the user if they want to use that identity with the new app, and then use a profile of the token exchange spec to obtain tokens for the second mobile app based on the id_token and the device_secret. The full sequence of steps looks like this: As a result of our identity consolidation work over the past year, we derived a set of principles identity architects should find useful for addressing use cases that don’t have a known specification or best practice. Moreover, these are applicable in many contexts outside of identity standards: 1. Spend time researching the existing set of standards and draft standards. As the diagram shows, there are a lot of standards out there already, so understanding them is critical. 2. Don’t invent something new if you can just profile or combine already existing specifications. 3. Make sure you understand the spirit and intent of the existing specifications. 4. For those cases where an extension is required, make sure to extend the specification based on its spirit and intent. 5. Ask the community for clarity regarding any existing specification or draft. 6. Contribute back to the community via blog posts, best practice documents, or a new specification. As we learned during the consolidation of our Yahoo and AOL identity platforms, and as demonstrated in our examples, there is no need to resort to proprietary solutions for use cases that at first look do not appear to have a standards-based solution. Instead, it’s much better to follow these principles, avoid the NIH (not-invented-here) syndrome, and invest the time to build solutions on standards.

Innovating on Authentication Standards

June 25, 2018
June 19, 2018
June 19, 2018
Share

Manage multiple pipelines from a single external `screwdriver.yaml`!

While we, the Screwdriver team, believe that every pipeline should have its own unique screwdriver.yaml, we acknowledge that in some cases that’s just not convenient enough. For instance, if you have many, many pipelines that share an identical configuration, onboarding them one by one can get tedious really fast. Or perhaps you’re just simply not allowed to add a screwdriver.yaml to a repository because the YAML contains juicy proprietary code. To address this issue, we are introducing a new feature known as external configuration. Just like it sounds, you can now manage many pipelines using a single screwdriver.yaml from a different repository. To use this feature, configure the screwdriver.yaml with a list of checkout URLs. Example: childPipelines: scmUrls: # List checkout URLs here - git@github.com:org/repo.git#branch - https://github.com:org/repo2.git#branch # Pipeline 'org/repo.git' and 'org/repo2.git' will inherit these jobs jobs: main: image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Parent and Child Pipelines With the addition of external configuration, it became necessary for us to distinguish between different types of pipelines. A pipeline whose screwdriver.yaml contains a childPipelines object is known as a parent pipeline. Screwdriver automatically creates/deletes child pipelines according to the scmUrls specified in the screwdriver.yaml. Child pipelines inherit their parent pipeline’s screwdriver.yaml, including all jobs, settings, and secrets. Caveats - A child pipeline can only be created by adding its checkout URL to its parent pipeline’s screwdriver.yaml. - A child pipeline can only be deleted by removing its checkout URL from its parent pipeline’s screwdriver.yaml. - A child pipeline’s checkout URL cannot be changed. If the URL is altered in the screwdriver.yaml, the original child pipeline will be deleted and a new one will be created. - A child pipeline cannot create its own secrets, but can instead override its parent pipeline’s secrets. - Users must have admin access on all child pipeline repositories. From the parent pipeline’s UI, you can see a list of all child pipelines. Additionally you can manually start builds on all of the child pipelines. We hope you will find this feature helpful! Compatibility List External configuration requires the following minimum versions of Screwdriver: - API v0.5.345 - UI v1.0.283 - Store v3.1.2 - Launcher v4.0.101Contributors Thanks to the following people for making this feature possible: - Filbird - minz1027 Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Manage multiple pipelines from a single external `screwdriver.yaml`!

June 19, 2018
June 5, 2018
June 5, 2018
Share

Parent-child in Vespa

Parent-child relationships let you model hierarchical relations in your data. This blog post talks about why and how we added this feature to Vespa, and how you can use it in your own applications. We’ll show some performance numbers and discuss practical considerations. Introduction The shortest possible background Traditional relational databases let you perform joins between tables. Joins enable efficient normalization of data through foreign keys, which means any distinct piece of information can be stored in one place and then referred to (often transitively), rather than to be duplicated everywhere it might be needed. This makes relational databases an excellent fit for a great number of applications. However, if we require scalable, real-time data processing with millisecond latency our options become more limited. To see why, and to investigate how parent-child can help us, we’ll consider a hypothetical use case. A grand business idea Let’s assume we’re building a disruptive startup for serving the cutest possible cat picture advertisements imaginable. Advertisers will run multiple campaigns, each with their own set of ads. Since they will (of course) pay us for this privilege, campaigns will have an associated budget which we have to manage at serving time. In particular, we don’t want to serve ads for a campaign that has spent all its money, as that would be free advertising. We must also ensure that campaign budgets are frequently updated when their ads have been served. Our initial, relational data model might look like this: Advertiser: id: (primary key) company_name: string contact_person_email: string Campaign: id: (primary key) advertiser_id: (foreign key to advertiser.id) name: string budget: int Ad: id: (primary key) campaign_id: (foreign key to campaign.id) cuteness: float cat_picture_url: string This data normalization lets us easily update the budgets for all ads in a single operation, which is important since we don’t want to serve ads for which there is no budget. We can also get the advertiser name for all individual ads transitively via their campaign. Scaling our expectations Since we’re expecting our startup to rapidly grow to a massive size, we want to make sure we can scale from day one. As the number of ad queries grow, we ideally want scaling up to be as simple as adding more server capacity. Unfortunately, scaling joins beyond a single server is a significant design and engineering challenge. As a consequence, most of the new data stores released in the past decade have been of the “NoSQL” variant (which might also be called “non-relational databases”). NoSQL’s horizontal scalability is usually achieved by requiring an application developer to explicitly de-normalize all data. This removes the need for joins altogether. For our use case, we have to store budget and advertiser name across multiple document types and instances (duplicated data here marked with bold text): Advertiser: id: (primary key) company_name: string contact_person_email: string Campaign: id: (primary key) advertiser_company_name: string name: string budget: int Ad: id: (primary key) campaign_budget: int campaign_advertiser_company_name: string cuteness: float cat_picture_url: string Now we can scale horizontally for queries, but updating the budget of a campaign requires updating all its ads. This turns an otherwise O(1) operation into O(n), and we likely have to implement this update logic ourselves as part of our application. We’ll be expecting thousands of budget updates to our cat ad campaigns per second. Multiplying this by an unknown number is likely to overload our servers or lose us money. Or both at the same time. A pragmatic middle ground In the middle between these two extremes of “arbitrary joins” and “no joins at all” we have parent-child relationships. These enable a subset of join functionality, but with enough restrictions that they can be implemented efficiently at scale. One core restriction is that your data relationships must be possible to represented as a directed, acyclic graph (DAG). As it happens, this is the case with our cat picture advertisement use case; Advertiser is a parent to 0-n Campaigns, each of which in turn is a parent to 0-n Ads. Being able to represent this natively in our application would get us functionally very close to the original, relational schema. We’ll see very shortly how this can be directly mapped to Vespa’s parent-child feature support. Parent-child support in Vespa Creating the data model Vespa’s fundamental data model is that of documents. Each document belongs to a particular schema and has a user-provided unique identifier. Such a schema is known as a document type and is specified in a search definition file. A document may have an arbitrary number of fields of different types. Some of these may be indexed, some may be kept in memory, all depending on the schema. A Vespa application may contain many document types. Here’s how the Vespa equivalent of the above denormalized schema could look (again bolding where we’re duplicating information): advertiser.sd: search advertiser { document advertiser { field company_name type string { indexing: attribute | summary } field contact_person_email type string { indexing: summary } } } campaign.sd: search campaign { document campaign { field advertiser_company_name type string { indexing: attribute | summary } field name type string { indexing: attribute | summary } field budget type int { indexing: attribute | summary } } } ad.sd: search ad { document ad { field campaign_budget type int { indexing: attribute | summary attribute: fast-search } field campaign_advertiser_company_name type string { indexing: attribute | summary } field cuteness type float { indexing: attribute | summary attribute: fast-search } field cat_picture_url type string { indexing: attribute | summary } } } Note that since all documents in Vespa must already have a unique ID, we do not need to model the primary key IDs explicitly. We’ll now see how little it takes to change this to its normalized equivalent by using parent-child. Parent-child support adds two new types of declared fields to Vespa; references and imported fields. A reference field contains the unique identifier of a parent document of a given document type. It is analogous to a foreign key in a relational database, or a pointer in Java/C++. A document may contain many reference fields, with each potentially referencing entirely different documents. We want each ad to reference its parent campaign, so we add the following to ad.sd: field campaign_ref type reference { indexing: attribute } We also add a reference from a campaign to its advertiser in campaign.sd: field advertiser_ref type reference { indexing: attribute } Since a reference just points to a particular document, it cannot be directly used in queries. Instead, imported fields are used to access a particular field within a referenced document. Imported fields are virtual; they do not take up any space in the document itself and they cannot be directly written to by put or update operations. Add to search campaign in campaign.sd: import field advertiser_ref.company_name as campaign_company_name {} Add to search ad in ad.sd: import field campaign_ref.budget as ad_campaign_budget {} You can import a parent field which itself is an imported field. This enables transitive field lookups. Add to search ad in ad.sd: import field campaign_ref.campaign_company_name as ad_campaign_company_name {} After removing the now redundant fields, our normalized schema looks like this: advertiser.sd: search advertiser { document advertiser { field company_name type string { indexing: attribute | summary } field contact_person_email type string { indexing: summary } } } campaign.sd: search campaign { document campaign { field advertiser_ref type reference { indexing: attribute } field name type string { indexing: attribute | summary } field budget type int { indexing: attribute | summary } } import field advertiser_ref.company_name as campaign_company_name {} } ad.sd: search ad { document ad { field campaign_ref type reference { indexing: attribute } field cuteness type float { indexing: attribute | summary attribute: fast-search } field cat_picture_url type string { indexing: attribute | summary } } import field campaign_ref.budget as ad_campaign_budget {} import field campaign_ref.campaign_company_name as ad_campaign_company_name {} } Feeding with references When feeding documents to Vespa, references are assigned like any other string field: [ { "put": "id:test:advertiser::acme", "fields": { “company_name”: “ACME Inc. cats and rocket equipment”, “contact_person_email”: “wile-e@example.com” } }, { "put": "id:acme:campaign::catnip", "fields": { “advertiser_ref”: “id:test:advertiser::acme”, “name”: “Most excellent catnip deals”, “budget”: 500 } }, { "put": "id:acme:ad::1", "fields": { "campaign_ref": "id:acme:campaign::catnip", “cuteness”: 100.0, “cat_picture_url”: “/acme/super_cute.jpg” } ] We can efficiently update the budget of a single campaign, immediately affecting all its child ads: [ { "update": "id:test:campaign::catnip", "fields": { "budget": { "assign": 450 } } } ] Querying using imported fields You can use imported fields in queries as if they were a regular field. Here are some examples using YQL: Find all ads that still have a budget left in their campaign: select * from ad where ad_campaign_budget > 0; Find all ads that have less than $500 left in their budget and belong to an advertiser whose company name contains the word “ACME”: select * from ad where ad_campaign_budget < 500 and ad_campaign_company_name contains “ACME”; Note that imported fields are not part of the default document summary, so you must add them explicitly to a separate summary if you want their values returned as part of a query result: document-summary my_ad_summary { summary ad_campaign_budget type int {} summary ad_campaign_company_name type string {} summary cuteness type float {} summary cat_picture_url type string {} } Add summary=my_ad_summary as a query HTTP request parameter to use it. Global documents One of the primary reasons why distributed, generalized joins are so hard to do well efficiently is that performing a join on node A might require looking at data that is only found on node B (or node C, or D…). Vespa gets around this problem by requiring that all documents that may be joined against are always present on every single node. This is achieved by marking parent documents as global in the services.xml declaration. Global documents are automatically distributed to all nodes in the cluster. In our use case, both advertisers and campaigns are used as parents: You cannot deploy an application containing reference fields pointing to non-global document types. Vespa verifies this at deployment time. Performance Feeding of campaign budget updates Scenario: feed 2 million ad documents 4 times to a content cluster with one node, each time with a different ratio between ads and parent campaigns. Treat 1:1 as baseline (i.e. 2 million ads, 2 million campaigns). Measure relative speedup as the ratio of how many fewer campaigns must be fed to update the budget in all ads. Results - 1 ad per campaign: 35000 campaign puts/second - 10 ads per campaign: 29000 campaign puts/second, 8.2x relative speedup - 100 ads per campaign: 19000 campaign puts/second, 54x relative speedup - 1000 ads percampaign: 6200 campaign puts/second, 177x relative speedup Note that there is some cost associated with higher fan-outs due to internal management of parent-child mappings, so the speedup is not linear with the fan-out. Searching on ads based on campaign budgets Scenario: we want to search for all ads having a specific budget value. First measure with all ad budgets denormalized, then using an imported budget field from the ads’ referenced campaign documents. As with the feeding benchmark, we’ll use 1, 10, 100 and 1000 ads per campaign with a total of 2 million ads combined across all campaigns. Measure average latency over 5 runs. In each case, the budget attribute is declared as fast-search, which means it has a B-tree index. This allows for efficent value and range searches. Results - 1 ad per campaign: denormalized 0.742 ms, imported 0.818 ms, 10.2% slowdown - 10 ads per campaign: denormalized 0.986 ms, imported 1.186 ms, 20.2% slowdown - 100 ads per campaign: denormalized 0.830 ms, imported 0.958 ms, 15.4% slowdown - 1000 ads per campaign: denormalized 0.936 ms, imported 0.922 ms, 1.5% speedup The observed speedup for the biggest fan-out is likely an artifact of measurement noise. We can see that although there is generally some cost associated with the extra indirection, it is dwarfed by the speedup we get at feeding time. Practical concerns Although a powerful feature, parent-child does not make sense for every use case. Prefer to use parent-child if the relationships between your data items can be naturally represented with such a hierarchy. The 3-level ad → campaign → advertiser example we’ve covered is such a use case. Parent-child is limited to DAG relations and therefore can’t be used to model an arbitrary graph. Parent-child in Vespa is currently only useful when searching in child documents. Queries can follow references from children to parents, but can’t go from parents to children. This is due to how Vespa maintains its internal reference mappings. You CAN search for - “All campaigns with advertiser name X” (campaign → advertiser) - “All ads with a campaign whose budget is greater than X” (ad → campaign) - “All ads with advertiser name X” (ad → campaign → advertiser, via transitive import) You CAN’T search for - “All advertisers with campaigns that have a budget greater than X” (campaign ← advertiser) - “All campaigns that have more than N ads” (ad ← campaign) Parent-child references do not enforce referential integrity constraints. You can feed a child document containing a reference to a parent document that does not exist. Note that you can feed the missing parent document later. Vespa will automatically resolve references from existing child documents. A lot of work has gone into minimizing the performance impact of using imported fields, but there is still some performance penalty introduced by the parent-child indirection. This means that using a denormalized data model may still be faster at search time, while a normalized parent-child model will generally be faster to feed. You must determine what you expect to be the bottleneck in your application and perform benchmarks for your particular use case. There is a fixed per-document memory cost associated with maintaining the internal parent-child mappings. Fields that are imported from a parent must be declared as attribute in the parent document type. As mentioned in the Global documents section, all parent documents must be present on all nodes. This is one of the biggest caveats with the parent-child feature: all nodes must have sufficient capacity for all parents. A core assumption that we have made for the use of this feature is the number of parent documents is much lower than the number of child documents. At least an order of magnitude fewer documents per parent level is a reasonable heuristic. Comparisons with other systems ElasticSearch ElasticSearch also offers native support for parent-child in its data and query model. There are some distinct differences: - In ElasticSearch it’s the user’s responsibility to ensure child documents are explicitly placed on the same shard as its parents (source). This trades off ease of use with not requiring all parents on all nodes. - Changing a parent reference in ElasticSearch requires a manual delete of the child before it can be reinserted in the new parent shard (source). Parent references in Vespa can be changed with ordinary field updates at any point in time. - In ElasticSearch, referencing fields in parents is done explicitly with “has_parent” query operators (source), while Vespa abstracts this away as regular field accesses. - ElasticSearch has a “has_child” query operator which allows for querying parents based on properties of their children (source). Vespa does not currently support this. - ElasticSearch reports query slowdowns of 500-1000% when using parent-child (source), while expected overhead when using parent-child attribute fields in Vespa is on the order of 10-20%. - ElasticSearch uses a notion of a “global ordinals” index which must be rebuilt upon changes to the parent set. This may take several seconds and introduce latency spikes (source). All parent-child reference management in Vespa is fully real-time with no additional rebuild costs at feeding or serving time. Distributed SQL stores In the recent years there has been a lot of promising development happening on the distributed SQL database (“NewSQL”) front. In particular, both the open-source CockroachDB and Google’s proprietary Spanner architectures offer distributed transactions and joins at scale. As these are both aimed primarily at solving OLTP use cases rather than realtime serving, we will not cover these any further here. Summary In this blog post we’ve looked at Vespa’s new parent-child feature and how it can be used to normalize common data models. We’ve demonstrated how introducing parents both greatly speeds up and simplifies updating information shared between many documents. We’ve also seen that doing so introduces only a minor performance impact on search queries. Have an exciting use case for parent-child that you’re working on? Got any questions? Let us know! vespa.ai Vespa Engine on GitHub Vespa Engine on Gitter

Parent-child in Vespa

June 5, 2018
May 18, 2018
May 18, 2018
marcelatoath
Share

A Peek Behind the Mail Curtain

USE IMAP TO ACCESS SOME UNIQUE FEATURES By Libby Lin, Principal Product Manager Well, we actually won’t show you how we create the magic in our big OATH consumer mail factory. But nevertheless we wanted to share how interested developers could leverage some of our unique features we offer for our Yahoo and AOL Mail customers. To drive experiences like our travel and shopping smart views or message threading, we tag qualified mails with something we call DECOS and THREADID. While we will not indulge in explaining how exactly we use them internally, we wanted to share how they can be used and accessed through IMAP. So let’s just look at a sample IMAP command chain. We’ll just assume that you are familiar with the IMAP protocol at this point and you know how to properly talk to an IMAP server. So here’s how you would retrieve DECO and THREADIDs for specific messages: 1. CONNECT    openssl s_client -crlf -connect imap.mail.yahoo.com:993 2. LOGIN    a login username password    a OK LOGIN completed 3. LIST FOLDERS    a list “” “*”    * LIST (\Junk \HasNoChildren) “/” “Bulk Mail”    * LIST (\Archive \HasNoChildren) “/” “Archive”    * LIST (\Drafts \HasNoChildren) “/” “Draft”    * LIST (\HasNoChildren) “/” “Inbox”    * LIST (\HasNoChildren) “/” “Notes”    * LIST (\Sent \HasNoChildren) “/” “Sent”    * LIST (\Trash \HasChildren) “/” “Trash”    * LIST (\HasNoChildren) “/” “Trash/l2”    * LIST (\HasChildren) “/” “test level 1”    * LIST (\HasNoChildren) “/” “test level 1/nestedfolder”    * LIST (\HasNoChildren) “/” “test level 1/test level 2”    * LIST (\HasNoChildren) “/” “&T2BZfXso-”    * LIST (\HasNoChildren) “/” “&gQKAqk7WWr12hA-”    a OK LIST completed 4.SELECT FOLDER    a select inbox    * 94 EXISTS    * 0 RECENT    * OK [UIDVALIDITY 1453335194] UIDs valid    * OK [UIDNEXT 40213] Predicted next UID    * FLAGS (\Answered \Deleted \Draft \Flagged \Seen $Forwarded $Junk $NotJunk)    * OK [PERMANENTFLAGS (\Answered \Deleted \Draft \Flagged \Seen $Forwarded $Junk $NotJunk)] Permanent flags    * OK [HIGHESTMODSEQ 205]    a OK [READ-WRITE] SELECT completed; now in selected state 5. SEARCH FOR UID    a uid search 1:*    * SEARCH 1 2 3 4 11 12 14 23 24 75 76 77 78 114 120 121 124 128 129 130 132 133 134 135 136 137 138 40139 40140 40141 40142 40143 40144 40145 40146 40147 40148     40149 40150 40151 40152 40153 40154 40155 40156 40157 40158 40159 40160 40161 40162 40163 40164 40165 40166 40167 40168 40172 40173 40174 40175 40176     40177 40178 40179 40182 40183 40184 40185 40186 40187 40188 40190 40191 40192 40193 40194 40195 40196 40197 40198 40199 40200 40201 40202 40203 40204     40205 40206 40207 40208 40209 40211 40212    a OK UID SEARCH completed 6. FETCH DECOS BASED ON UID    a uid fetch 40212 (X-MSG-DECOS X-MSG-ID X-MSG-THREADID)    * 94 FETCH (UID 40212 X-MSG-THREADID “108” X-MSG-ID “ACfIowseFt7xWtj0og0L2G0T1wM” X-MSG-DECOS (“FTI” “F1” “EML”))    a OK UID FETCH completed

A Peek Behind the Mail Curtain

May 18, 2018
May 7, 2018
May 7, 2018
Share

Scaling TensorFlow model evaluation with Vespa

In this blog post we’ll explain how to use Vespa to evaluate TensorFlow models over arbitrarily many data points while keeping total latency constant. We provide benchmark data from our performance lab where we compare evaluation using TensorFlow serving with evaluating TensorFlow models in Vespa. We recently introduced a new feature that enables direct import of TensorFlow models into Vespa for use at serving time. As mentioned in a previous blog post, our approach to support TensorFlow is to extract the computational graph and parameters of the TensorFlow model and convert it to Vespa’s tensor primitives. We chose this approach over attempting to integrate our backend with the TensorFlow runtime. There were a few reasons for this. One was that we would like to support other frameworks than TensorFlow. For instance, our next target is to support ONNX. Another was that we would like to avoid the inevitable overhead of such an integration, both on performance and code maintenance. Of course, this means a lot of optimization work on our side to make this as efficient as possible, but we do believe it is a better long term solution. Naturally, we thought it would be interesting to set up some sort of performance comparison between Vespa and TensorFlow for cases that use a machine learning ranking model. Before we get to that however, it is worth noting that Vespa and TensorFlow serving has an important conceptual difference. With TensorFlow you are typically interested in evaluating a model for a single data point, be that an image for an image classifier, or a sentence for a semantic representation etc. The use case for Vespa is when you need to evaluate the model over many data points. Examples are finding the best document given a text, or images similar to a given image, or computing a stream of recommendations for a user. So, let’s explore this by setting up a typical search application in Vespa. We’ve based the application in this post on the Vespa blog recommendation tutorial part 3. In this application we’ve trained a collaborative filtering model which computes an interest vector for each existing user (which we refer to as the user profile) and a content vector for each blog post. In collaborative filtering these vectors are commonly referred to as latent factors. The application takes a user id as the query, retrieves the corresponding user profile, and searches for the blog posts that best match the user profile. The match is computed by a simple dot-product between the latent factor vectors. This is used as the first phase ranking. We’ve chosen vectors of length 128. In addition, we’ve trained a neural network in TensorFlow to serve as the second-phase ranking. The user vector and blog post vector are concatenated and represents the input (of size 256) to the neural network. The network is fully connected with 2 hidden layers of size 512 and 128 respectively, and the network has a single output value representing the probability that the user would like the blog post. In the following we set up two cases we would like to compare. The first is where the imported neural network is evaluated on the content node using Vespa’s native tensors. In the other we run TensorFlow directly on the stateless container node in the Vespa 2-tier architecture. In this case, the additional data required to evaluate the TensorFlow model must be passed back from the content node(s) to the container node. We use Vespa’s fbench utility to stress the system under fairly heavy load. In this first test, we set up the system on a single host. This means the container and content nodes are running on the same host. We set up fbench so it uses 64 clients in parallel to query this system as fast as possible. 1000 documents per query are evaluated in the first phase and the top 200 documents are evaluated in the second phase. In the following, latency is measured in ms at the 95th percentile and QPS is the actual query rate in queries per second: - Baseline: 19.68 ms / 3251.80 QPS - Baseline with additional data: 24.20 ms / 2644.74 QPS - Vespa ranking: 42.8 ms / 1495.02 QPS - TensorFlow batch ranking: 42.67 ms / 1499.80 QPS - TensorFlow single ranking: 103.23 ms / 619.97 QPS Some explanation is in order. The baseline here is the first phase ranking only without returning the additional data required for full ranking. The baseline with additional data is the same but returns the data required for ranking. Vespa ranking evaluates the model on the content backend. Both TensorFlow tests evaluate the model after content has been sent to the container. The difference is that batch ranking evaluates the model in one pass by batching the 200 documents together in a larger matrix, while single evaluates the model once per document, i.e. 200 evaluations. The reason why we test this is that Vespa evaluates the model once per document to be able to evaluate during matching, so in terms of efficiency this is a fairer comparison. We see in the numbers above for this application that Vespa ranking and TensorFlow batch ranking achieve similar performance. This means that the gains in ranking batch-wise is offset by the cost of transferring data to TensorFlow. This isn’t entirely a fair comparison however, as the model evaluation architecture of Vespa and TensorFlow differ significantly. For instance, we measure that TensorFlow has a much lower degree of cache misses. One reason is that batch-ranking necessitates a more contiguous data layout. In contrast, relevant document data can be spread out over the entire available memory on the Vespa content nodes. Another significant reason is that Vespa currently uses double floating point precision in ranking and in tensors. In the above TensorFlow model we have used floats, resulting in half the required memory bandwidth. We are considering making the floating point precision in Vespa configurable to improve evaluation speed for cases where full precision is not necessary, such as in most machine learned models. So we still have some work to do in optimizing our tensor evaluation pipeline, but we are pleased with our results so far. Now, the performance of the model evaluation itself is only a part of the system-wide performance. In order to rank with TensorFlow, we need to move data to the host running TensorFlow. This is not free, so let’s delve a bit deeper into this cost. The locality of data in relation to where the ranking computation takes place is an important aspect and indeed a core design point of Vespa. If your data is too large to fit on a single machine, or you want to evaluate your model on more data points faster than is possible on a single machine, you need to split your data over multiple nodes. Let’s assume that documents are distributed randomly across all content nodes, which is a very reasonable thing to do. Now, when you need to find the globally top-N documents for a given query, you first need to find the set of candidate documents that match the query. In general, if ranking is done on some other node than where the content is, all the data required for the computation obviously needs to be transferred there. Usually, the candidate set can be large so this incurs a significant cost in network activity, particularly as the number of content nodes increase. This approach can become infeasible quite quickly. This is why a core design aspect of Vespa is to evaluate models where the content is stored. This is illustrated in the figure above. The problem of transferring data for ranking is compounded as the number of content nodes increase, because to find the global top-N documents, the top-K documents of each content node need to be passed to the external ranker. This means that, if we have C content nodes, we need to transfer C*K documents over the network. This runs into hard network limits as the number of documents and data size for each document increases. Let’s see the effect of this when we change the setup of the same application to run on three content nodes and a single stateless container which runs TensorFlow. In the following graph we plot the 95th percentile latency as we increase the number of parallel requests (clients) from 1 to 30: Here we see that with low traffic, TensorFlow and Vespa are comparable in terms of latency. When we increase the load however, the cost of transmitting the data is the driver for the increase in latency for TensorFlow, as seen in the red line in the graph. The differences between batch and single mode TensorFlow evaluation become smaller as the system as a whole becomes largely network-bound. In contrast, the Vespa application scales much better. Now, as we increase traffic even further, will the Vespa solution likewise become network-bound? In the following graph we plot the sustained requests per second as we increase clients to 200: Vespa ranking is unable to sustain the same amount of QPS as just transmitting the data (the blue line), which is a hint that the system has become CPU-bound on the evaluation of the model on Vespa. While Vespa can sustain around 3500 QPS, the TensorFlow solution maxes out at 350 QPS which is reached quite early as we increase traffic. As the system is unable to transmit data fast enough, the latency naturally has to increase which is the cause for the linearity in the latency graph above. At 200 clients the average latency of the TensorFlow solution is around 600 ms, while Vespa is around 60 ms. So, the obvious key takeaway here is that from a scalability point of view it is beneficial to avoid sending data around for evaluation. That is both a key design point of Vespa, but also for why we implemented TensorFlow support in the first case. By running the models where the content is allows for better utilization of resources, but perhaps the more interesting aspect is the ability to run more complex or deeper models while still being able to scale the system.

Scaling TensorFlow model evaluation with Vespa

May 7, 2018
April 18, 2018
April 18, 2018
mikesefanov
Share

Achieving Major Stability and Performance Improvements in Yahoo Mail with a Novel Redux Architecture

yahoodevelopers: By Mohit Goenka, Gnanavel Shanmugam, and Lance Welsh At Yahoo Mail, we’re constantly striving to upgrade our product experience. We do this not only by adding new features based on our members’ feedback, but also by providing the best technical solutions to power the most engaging experiences. As such, we’ve recently introduced a number of novel and unique revisions to the way in which we use Redux that have resulted in significant stability and performance improvements. Developers may find our methods useful in achieving similar results in their apps. Improvements to product metrics Last year Yahoo Mail implemented a brand new architecture using Redux. Since then, we have transformed the overall architecture to reduce latencies in various operations, reduce JavaScript exceptions, and better synchronized states. As a result, the product is much faster and more stable. Stability improvements: - when checking for new emails – 20% - when reading emails – 30% - when sending emails – 20% Performance improvements: - 10% improvement in page load performance - 40% improvement in frame rendering time We have also reduced API calls by approximately 20%. How we use Redux in Yahoo Mail Redux architecture is reliant on one large store that represents the application state. In a Redux cycle, action creators dispatch actions to change the state of the store. React Components then respond to those state changes. We’ve made some modifications on top of this architecture that are atypical in the React-Redux community. For instance, when fetching data over the network, the traditional methodology is to use Thunk middleware. Yahoo Mail fetches data over the network from our API. Thunks would create an unnecessary and undesirable dependency between the action creators and our API. If and when the API changes, the action creators must then also change. To keep these concerns separate we dispatch the action payload from the action creator to store them in the Redux state for later processing by “action syncers”. Action syncers use the payload information from the store to make requests to the API and process responses. In other words, the action syncers form an API layer by interacting with the store. An additional benefit to keeping the concerns separate is that the API layer can change as the backend changes, thereby preventing such changes from bubbling back up into the action creators and components. This also allowed us to optimize the API calls by batching, deduping, and processing the requests only when the network is available. We applied similar strategies for handling other side effects like route handling and instrumentation. Overall, action syncers helped us to reduce our API calls by ~20% and bring down API errors by 20-30%. Another change to the normal Redux architecture was made to avoid unnecessary props. The React-Redux community has learned to avoid passing unnecessary props from high-level components through multiple layers down to lower-level components (prop drilling) for rendering. We have introduced action enhancers middleware to avoid passing additional unnecessary props that are purely used when dispatching actions. Action enhancers add data to the action payload so that data does not have to come from the component when dispatching the action. This avoids the component from having to receive that data through props and has improved frame rendering by ~40%. The use of action enhancers also avoids writing utility functions to add commonly-used data to each action from action creators. In our new architecture, the store reducers accept the dispatched action via action enhancers to update the state. The store then updates the UI, completing the action cycle. Action syncers then initiate the call to the backend APIs to synchronize local changes. Conclusion Our novel use of Redux in Yahoo Mail has led to significant user-facing benefits through a more performant application. It has also reduced development cycles for new features due to its simplified architecture. We’re excited to share our work with the community and would love to hear from anyone interested in learning more.

Achieving Major Stability and Performance Improvements in Yahoo Mail with a Novel Redux Architecture

April 18, 2018
March 20, 2018
March 20, 2018
marcelatoath
Share

Secure Images

oath-postmaster: By Marcel Becker The mail team at OATH is busy  integrating  Yahoo and AOL technology to deliver an even better experience across all our consumer mail products. While privacy and security are top priority for us, we also want to improve the experience and remove unnecessary clutter across all of our products. Starting this week we will be serving images in mails via our own secure proxy servers. This will not only increase speed and security in our own mail products and reduce the risk of phishing and other scams,  but it will also mean that our users don’t have to fiddle around with those “enable images” settings. Messages and inline images will now just show up as originally intended. We are aware that commercial mail senders are relying on images (so-called pixels) to track delivery and open rates. Our proxy solution will continue to support most of these cases and ensure that true mail opens are recorded. For senders serving dynamic content based on the recipient’s location (leveraging standard IP-based browser and app capabilities) we recommend falling back on other tools and technologies which do not rely on IP-based targeting. All of our consumer mail applications (Yahoo and AOL) will benefit from this change. This includes our desktop products as well as our mobile applications across iOS and Android. If you have any feedback or want to discuss those changes with us personally, just send us a note to mail-questions@oath.com.

Secure Images

March 20, 2018
March 14, 2018
March 14, 2018
Share

Introducing TensorFlow support

In previous blog posts we have talked about Vespa’s tensor API which enables some advanced ranking capabilities. The primary use case is for machine learned ranking, where you train your models using some machine learning framework, convert the models to Vespa’s tensor format, and deploy them to Vespa. This works well, but converting trained models to Vespa form is cumbersome. We are now happy to announce a new feature that makes this process a lot easier: TensorFlow import. With this feature you can directly deploy models you’ve trained in TensorFlow to Vespa, and use these models during ranking. This means that the models are executed in parallel over multiple threads and machines for a single query, which makes it possible to evaluate the model over any number of data items and still bound the total response time. In addition the data items to evaluate with the TensorFlow model can be selected dynamically with a query, and with a cheaper first-phase rank function if needed. Since the TensorFlow models are evaluated on the nodes storing the data, we avoid sending any data over the wire for evaluation. In this post we’d like to introduce this new feature by discussing how it works, some assumptions behind working with TensorFlow and Vespa, and how to use the feature. Vespa is optimized to evaluate models repeatedly over many data items (documents).  To do this efficiently, we do not evaluate the model using the TensorFlow inference engine. TensorFlow adds a non-trivial amount of overhead and instrumentation which it uses to manage potentially large scale computations. This is significant in our case, since we need to evaluate models on a micro-second scale. Hence our approach is to extract the parameters (weights) into Vespa tensors, and use the model specification in the TensorFlow graph to generate efficient Vespa tensor expressions. Importing TensorFlow models is as simple as saving the TensorFlow model using the SavedModel API, adding those files to the Vespa application package, and referencing the model using the new TensorFlow ranking feature. For instance, if your files are in models/my_model in the application package: first-phase {     expression: sum(tensorflow(“my_model/saved”)) } The above expressions runs the model, and sums it to a single scalar value to use in ranking.  One thing you will have to provide is the input(s), or feed, to the graph. Vespa expects you to provide a macro with the same name as the input placeholder. In the macro you can specify where the input should come from, be it a parameter sent along with the query, a document field (possibly in a parent document) or a constant. As mentioned, Vespa evaluates the imported models once per document. Depending on the requirements of the application, this can impose some natural limitations on the size and complexity of the models that can be evaluated. However, Vespa has a number of other search and rank features that can be used to reduce the search space before running the machine learned models. Typically, one would use the search and first ranking phases to select a relatively small number of candidate documents, which are then given their final rank score in the more computationally expensive second phase model evaluation. Also note that TensorFlow import is new to Vespa, and we currently only support a subset of the TensorFlow operations. While the supported operations should suffice for many relevant use cases, there are some that are not supported yet due to potentially being too expensive to evaluate per document. For instance, convolutional networks and recurrent networks (LSTMs etc) are not supported. We are continually working to add functionality, if you find that we have some glaring omissions, please let us know. Going forward we are focusing on further improving performance of our tensor framework for important use cases. We’ll follow up this post with one showing how the performance of evaluation in Vespa compares with TensorFlow serving. We will also add more supported frameworks and our next target is ONNX. You can read more about this feature in the ranking with TensorFlow model in Vespa documentation. We are excited to announce the TensorFlow support, and we’re eager to hear what you are building with it.

Introducing TensorFlow support

March 14, 2018
February 5, 2018
February 5, 2018
mikesefanov
Share

Success at Apache: A Newbie’s Narrative

yahoodevelopers: Kuhu Shukla (bottom center) and team at the 2017 DataWorks Summit By Kuhu Shukla This post first appeared here on the Apache Software Foundation blog as part of ASF’s “Success at Apache” monthly blog series. As I sit at my desk on a rather frosty morning with my coffee, looking up new JIRAs from the previous day in the Apache Tez project, I feel rather pleased. The latest community release vote is complete, the bug fixes that we so badly needed are in and the new release that we tested out internally on our many thousand strong cluster is looking good. Today I am looking at a new stack trace from a different Apache project process and it is hard to miss how much of the exceptional code I get to look at every day comes from people all around the globe. A contributor leaves a JIRA comment before he goes on to pick up his kid from soccer practice while someone else wakes up to find that her effort on a bug fix for the past two months has finally come to fruition through a binding +1. Yahoo – which joined AOL, HuffPost, Tumblr, Engadget, and many more brands to form the Verizon subsidiary Oath last year – has been at the frontier of open source adoption and contribution since before I was in high school. So while I have no historical trajectories to share, I do have a story on how I found myself in an epic journey of migrating all of Yahoo jobs from Apache MapReduce to Apache Tez, a then-new DAG based execution engine. Oath grid infrastructure is through and through driven by Apache technologies be it storage through HDFS, resource management through YARN, job execution frameworks with Tez and user interface engines such as Hive, Hue, Pig, Sqoop, Spark, Storm. Our grid solution is specifically tailored to Oath’s business-critical data pipeline needs using the polymorphic technologies hosted, developed and maintained by the Apache community. On the third day of my job at Yahoo in 2015, I received a YouTube link on An Introduction to Apache Tez. I watched it carefully trying to keep up with all the questions I had and recognized a few names from my academic readings of Yarn ACM papers. I continued to ramp up on YARN and HDFS, the foundational Apache technologies Oath heavily contributes to even today. For the first few weeks I spent time picking out my favorite (necessary) mailing lists to subscribe to and getting started on setting up on a pseudo-distributed Hadoop cluster. I continued to find my footing with newbie contributions and being ever more careful with whitespaces in my patches. One thing was clear – Tez was the next big thing for us. By the time I could truly call myself a contributor in the Hadoop community nearly 80-90% of the Yahoo jobs were now running with Tez. But just like hiking up the Grand Canyon, the last 20% is where all the pain was. Being a part of the solution to this challenge was a happy prospect and thankfully contributing to Tez became a goal in my next quarter. The next sprint planning meeting ended with me getting my first major Tez assignment – progress reporting. The progress reporting in Tez was non-existent – “Just needs an API fix,”  I thought. Like almost all bugs in this ecosystem, it was not easy. How do you define progress? How is it different for different kinds of outputs in a graph? The questions were many. I, however, did not have to go far to get answers. The Tez community actively came to a newbie’s rescue, finding answers and posing important questions. I started attending the bi-weekly Tez community sync up calls and asking existing contributors and committers for course correction. Suddenly the team was much bigger, the goals much more chiseled. This was new to anyone like me who came from the networking industry, where the most open part of the code are the RFCs and the implementation details are often hidden. These meetings served as a clean room for our coding ideas and experiments. Ideas were shared, to the extent of which data structure we should pick and what a future user of Tez would take from it. In between the usual status updates and extensive knowledge transfers were made. Oath uses Apache Pig and Apache Hive extensively and most of the urgent requirements and requests came from Pig and Hive developers and users. Each issue led to a community JIRA and as we started running Tez at Oath scale, new feature ideas and bugs around performance and resource utilization materialized. Every year most of the Hadoop team at Oath travels to the Hadoop Summit where we meet our cohorts from the Apache community and we stand for hours discussing the state of the art and what is next for the project. One such discussion set the course for the next year and a half for me. We needed an innovative way to shuffle data. Frameworks like MapReduce and Tez have a shuffle phase in their processing lifecycle wherein the data from upstream producers is made available to downstream consumers. Even though Apache Tez was designed with a feature set corresponding to optimization requirements in Pig and Hive, the Shuffle Handler Service was retrofitted from MapReduce at the time of the project’s inception. With several thousands of jobs on our clusters leveraging these features in Tez, the Shuffle Handler Service became a clear performance bottleneck. So as we stood talking about our experience with Tez with our friends from the community, we decided to implement a new Shuffle Handler for Tez. All the conversation points were tracked now through an umbrella JIRA TEZ-3334 and the to-do list was long. I picked a few JIRAs and as I started reading through I realized, this is all new code I get to contribute to and review. There might be a better way to put this, but to be honest it was just a lot of fun! All the whiteboards were full, the team took walks post lunch and discussed how to go about defining the API. Countless hours were spent debugging hangs while fetching data and looking at stack traces and Wireshark captures from our test runs. Six months in and we had the feature on our sandbox clusters. There were moments ranging from sheer frustration to absolute exhilaration with high fives as we continued to address review comments and fixing big and small issues with this evolving feature. As much as owning your code is valued everywhere in the software community, I would never go on to say “I did this!” In fact, “we did!” It is this strong sense of shared ownership and fluid team structure that makes the open source experience at Apache truly rewarding. This is just one example. A lot of the work that was done in Tez was leveraged by the Hive and Pig community and cross Apache product community interaction made the work ever more interesting and challenging. Triaging and fixing issues with the Tez rollout led us to hit a 100% migration score last year and we also rolled the Tez Shuffle Handler Service out to our research clusters. As of last year we have run around 100 million Tez DAGs with a total of 50 billion tasks over almost 38,000 nodes. In 2018 as I move on to explore Hadoop 3.0 as our future release, I hope that if someone outside the Apache community is reading this, it will inspire and intrigue them to contribute to a project of their choice. As an astronomy aficionado, going from a newbie Apache contributor to a newbie Apache committer was very much like looking through my telescope - it has endless possibilities and challenges you to be your best. About the Author: Kuhu Shukla is a software engineer at Oath and did her Masters in Computer Science at North Carolina State University. She works on the Big Data Platforms team on Apache Tez, YARN and HDFS with a lot of talented Apache PMCs and Committers in Champaign, Illinois. A recent Apache Tez Committer herself she continues to contribute to YARN and HDFS and spoke at the 2017 Dataworks Hadoop Summit on “Tez Shuffle Handler: Shuffling At Scale With Apache Hadoop”. Prior to that she worked on Juniper Networks’ router and switch configuration APIs. She likes to participate in open source conferences and women in tech events. In her spare time she loves singing Indian classical and jazz, laughing, whale watching, hiking and peering through her Dobsonian telescope.

Success at Apache: A Newbie’s Narrative

February 5, 2018
January 5, 2018
January 5, 2018
Share

Optimizing realtime evaluation of neural net models on Vespa

In this blog post we describe how we recently made neural network evaluation over 20 times faster on Vespa’s tensor framework. Vespa is the open source platform for building applications that carry out scalable real-time data processing, for instance search and recommendation systems. These require significant amounts of computation over large data sets. With advances in machine learning, it is desirable to run more advanced ranking models such as large linear or logistic regression models and artificial neural networks. Because of the tight computational budget at serving time, the evaluation of such models must be done in an efficient and scalable manner. We introduced the tensor API to help solve such problems. The tensor API allows the concise expression of general computations on many-dimensional data, while simultaneously leaving room for deep optimizations on the platform side.  What we mean by this is that the tensor API is very expressive and supports a large range of model types. The general evaluation of tensors is not necessarily efficient in all cases, so in addition to continually working to increase the baseline performance, we also perform specific optimizations for important use cases. In this blog post we will describe one such important optimization we recently did, which improved neural network evaluation performance by over 20x. To illustrate the types of optimization we can do, consider the following tensor expression representing a dot product between vectors v1 and v2: reduce(join(v1, v2, f(x, y)(x * y)), sum) The dot product is calculated by multiplying the vectors together by using the join operation, then summing the elements in the vector together using the reduce operation. The result is a single scalar. A naive implementation would first calculate the join and introduce a temporary tensor before the reduce sums up the cells to a single scalar. Particularly for large tensors with many dimensions, such a temporary tensor can be large and require significant memory allocations. This is obviously not the most efficient path to calculate the resulting tensor.  A general improvement would be to avoid the temporary tensor and reduce to the single scalar directly as the tensors are iterated through. In Vespa, when ranking expressions are compiled, the abstract syntax tree (AST) is analyzed for such optimizations. When known cases are recognized, the most efficient implementation is selected. In the above example, assuming the vectors are dense and they share dimensions, Vespa has optimized hardware accelerated code for doing dot products on vectors. For sparse vectors, Vespa falls back to a implementation for weighted sets which build hash tables for efficient lookups.  This method allows recognition of both large and small optimizations, from simple dot products to specialized implementations for more advanced ranking models. Vespa currently has a few optimizations implemented, and we are adding more as important use cases arise. We recently set out to improve the performance of evaluating simple neural networks, a case quite similar to the one presented in the previous blog post. The ranking expression to optimize was:    macro hidden_layer() {        expression: elu(xw_plus_b(nn_input, constant(W_fc1), constant(b_fc1), x))    }    macro final_layer() {        expression: xw_plus_b(hidden_layer, constant(W_fc2), constant(b_fc2), hidden)    }    first-phase {        expression: final_layer    } This represents a simple two-layer neural network.  Whenever a new version of Vespa is built, a large suite of integration and performance tests are run. When we want to optimize a specific use case, we first create a performance test to set a baseline.  With the performance tests we get both historical graphs as well as detailed profiling information and performance statistics sampled from the system under load.  This allows us to identify and optimize any bottlenecks. Also, it adds a bit of gamification to the process. The graph below shows the performance of a test where 10 000 random documents are ranked according to the evaluation of a simple two-layer neural network: Here, the x-axis represent builds, and the y-axis is the end-to-end latency as measured from a machine firing off queries to a server running the test on Vespa. As can be seen, over the course of optimization the latency was reduced from 150-160 ms to 7 ms, an impressive 20x end-to-end latency improvement. When a query is received by Vespa, it is first processed in the stateless container. This is usually where applications would process the query, possibly enriching it with additional information. Vespa does a bit of default work here as well, and also transforms the query a bit. For this test, no specific handling was done except this default handling. After initial processing, the query is dispatched to each node in the stateful content layer. For this test, only a single node is used in the content layer, but applications would typically have multiple. The query is processed in parallel on each node utilizing multiple cores and the ranking expression gets executed once for each document that matches the query. For this test with 10 000 documents, the ranking expression and thus the neural network gets evaluated in total 10 000 times before the top N documents are returned to the container layer. The following steps were taken to optimize this expression, with each step visible as a step in the graph above: 1. Recognize join with multiplication as part of an inner product. 2. Optimize for bias addition. 3. Optimize vector concatenation (which was part of the input to the neural network) 4. Replace appropriate sub-expressions with the dense vector-matrix product. It was particularly the final step which gave the biggest percent wise performance boost. The solution in total was to recognize the vector-matrix multiplication done in the neural network layer and replace that with specialized code that invokes the existing hardware accelerated dot product code. In the expression above, the operation xw_plus_b is replaced with a reduce of the multiplicative join and additive join. This is what is recognized and performed in one step instead of three. This strategy of optimizing specific use cases allows for a more rapid application development for users of Vespa. Consider the case where some exotic model needs to be run on Vespa. Without the generic tensor API users would have to implement their own custom rank features or wait for the Vespa core developers to implement them. In contrast, with the tensor API, teams can continue their development without external dependencies to the Vespa team.  If necessary, the Vespa team can in parallel implement the optimizations needed to meet performance requirements, as we did in this case with neural networks.

Optimizing realtime evaluation of neural net models on Vespa

January 5, 2018
December 15, 2017
December 15, 2017
Share

Blog recommendation with neural network models

Introduction The main objective of this post is to show how to deploy neural network models in Vespa using our Tensor Framework. In fact, any model that can be represented by a series of Tensor operations can be deployed in Vespa. Neural networks is just a popular example. In addition, we will introduce the multi-phase ranking model available in Vespa that can be used to run more expensive models in a phase based on a reduced number of documents returned by previous phases. This feature allow us to run models that would be prohibitively expensive to use if we had to run them at query-time across all the documents indexed in Vespa.Model Training In this section, we will define a neural network model, show how we created a suitable dataset to train the model and train the model using TensorFlow. The neural network model In the previous blog post, we computed latent factors for each user and each document and then used a dot-product between user and document vectors to rank the documents available for recommendation to a specific user. In this tutorial we will train a 2-layer fully connected neural network model that will take the same user (u) and document (d) latent factors as input and will output the probability of that specific user liking the document. More technically, our previous rank function r was given by r(u,d)=u∗d while in this tutorial it will be given by r(u,d,θ)=f(u,d,θ) where f represents the neural network model described below and θ is the neural network parameter values that we need to learn from training data. The specific form of the neural network model used here is p = sigmoid(h1×W2+b2) h1 = ReLU(x×W1+b1) where x=[u,d] is the concatenation of the user and document latent factor, ReLU is the rectifier activation function, sigmoid represents the sigmoid function, p is the output of the model and in this case can be interpreted as the probability of the user u liking a blog post d. The parameters of the model are represented by θ=(W1,W2,b1,b2). Training data For the training dataset, we will start with the (user_id, post_id) rows from the “training_set_ids” generated previously. Then, we remove every row for which there is no latent factors for the user_id or post_id contained in that row. This gives us a dataset with only positive feedback (label = 1), since each row represents one instance of a user_id liking a post_id. In order to train our model, we need to generate negative feedback (label = 0). So, for each row (user_id, post_id) in the current dataset we will generate N negative feedback rows by randomly sampling post_id_fake from the pool of post_id’s available in the current set, so that for each (user_id, post_id) row with label = 1 we will increase the dataset with N (user_id, post_id_fake) rows with label = 0. Find code to generate the dataset in the utility scripts. Training with TensorFlow With the training data in hand, we have split it into 80% training set and 20% validation set and used TensorFlow to train the model. The script used can be found in the utility scripts and executed by $ python vespaModel.py --product_features_file_path vespa_tutorial_data/user_item_cf_cv/product.json \                       --user_features_file_path vespa_tutorial_data/user_item_cf_cv/user.json \                       --dataset_file_path vespa_tutorial_data/nn_model/training_set.txt The progress of your training can be visualized using Tensorboard $ tensorboard --logdir runs/*/summaries/ Model deployment in Vespa Two Phase Ranking When a query is sent to Vespa, it will scan all documents available and select the ones (possibly all) that match the query. When the set of documents matching a query is found, Vespa must decide the order of these documents. Unless explicit sorting is used, Vespa decides this order by calculating a number for each document, the rank score, and sorts the documents by this number. The rank score can be any function that takes as arguments parameters sent by the query, document attributes defined in search definitions and global parameters not directly linked to query or document parameters. One example of rank score is the output of the neural network model defined in this tutorial. The model takes the latent factor u associated with a specific user_id (query parameter), the latent factor dd associated with document post_id (document attribute) and learned model parameters (global parameters not related to a specific query nor document) and returns the probability of user u to like document d. However, even though Vespa is designed to carry out such calculations optimally, complex expressions becomes expensive when they must be calculated over every one of a large set of matching documents. To relieve this, Vespa can be configured to run two ranking expressions - a smaller and less accurate one on all hits during the matching phase, and a more expensive and accurate one only on the best hits during the reranking phase. In general this allows a more optimal usage of the cpu budget by dedicating more of the total cpu towards the best candidate hits. The reranking phase, if specified, will by default be run on the 100 best hits on each search node, after matching and before information is returned upwards to the search container. The number of hits to rerank can be turned up or down as needed. Below is a toy example showing how to configure first and second phase ranking expressions in the rank profile section of search definitions where the second phase rank expression is run on the 200 best hits from first phase on each search node. search myapp {    …    rank-profile default inherits default {        first-phase {            expression: nativeRank + query(deservesFreshness) * freshness(timestamp)        }        second-phase {            expression {                0.7 * ( 0.7*fieldMatch(title) + 0.2*fieldMatch(description) + 0.1*fieldMatch(body) ) +                0.3 * attributeMatch(keywords)            }            rerank-count: 200        }    } } Constant Tensor files Once the model has been trained in TensorFlow, export the model parameters (W1,W2,b1,b2) to the application folder as Tensors according to the Vespa Document JSON format. The complete code to serialize the model parameters using Vespa Tensor format can be found in the utility scripts but the following code snipped shows how to serialize the hidden layer weights W1: serializer.serialize_to_disk(variable_name = "W_hidden", dimension_names = ['input', 'hidden']) Note that Vespa currently requires dimension names for all the Tensor dimensions (in this case W1 is a matrix, therefore dimension is 2). In the following section, we will use the following code in the blog_post search definition in order to be able to use the constant tensor W_hidden in our ranking expression.    constant W_hidden {        file: constants/W_hidden.json        type: tensor(input[20],hidden[40])    } A constant tensor is data that is not specific to a given document type. In the case above we define W_hidden to be a tensor with two dimensions (matrix), where the first dimension is named input and has size 20 and second dimension is named hidden and has size 40. The data were serialized to a JSON file located at constants/W_hidden.json relative to the application package folder. Vespa ranking expressions In order to evaluate the neural network model trained with TensorFlow in the previous section, we need to translate the model structure to a Vespa ranking expression to be defined in the blog_post search definition. To honor a low-latency response, we will take advantage of the Two Phase Ranking available in Vespa and define the first phase ranking to be the same ranking function used in the previous blog post, which is a dot-product between the user and latent factors. After the documents have been sorted by the first phase ranking function, we will rerank the top 200 document from each search node using the second phase ranking given by the neural network model presented above. Note that we define two ranking profiles in the search definition below. This allow us to decide which ranking profile to use at query time. We defined a ranking profile named tensor which only applies the dot-product between user and document latent factors for all matching documents and a ranking profile named nn_tensor, which rerank the top 200 documents using the neural network model discussed in the previous section. We will walk through each part of the blog_post search definition, see blog_post.sd. As always, we start the a search definition with the following line search blog_post { We define the document type blog_post the same way we have done in the previous tutorial.    document blog_post {      # Field definitions      # Examples:      field date_gmt type string {          indexing: summary      }      field language type string {          indexing: summary      }      # Remaining fields as found in previous tutorial    } We define a ranking profile named tensor which rank all the matching documents by the dot-product between the document latent factor and the user latent factor. This is the same ranking expression used in the previous tutorial, which include code to retrieve the user latent factor based on the user_id sent by the query to Vespa.    # Simpler ranking profile without    # second-phase ranking    rank-profile tensor {      first-phase {          expression {              sum(query(user_item_cf) * attribute(user_item_cf))          }      }    } Since we want to evaluate the neural network model we have trained, we need to define where to find the model parameters (W1,W2,b1,b2). See the previous section for how to write the TensorFlow model parameters to Vespa Tensor format.    # We need to specify the type and the location    # of the files storing tensor values for each    # Variable in our TensorFlow model. In this case,    # W_hidden, b_hidden, W_final, b_final    constant W_hidden {        file: constants/W_hidden.json        type: tensor(input[20],hidden[40])    }    constant b_hidden {        file: constants/b_hidden.json        type: tensor(hidden[40])    }    constant W_final {        file: constants/W_final.json        type: tensor(hidden[40], final[1])    }    constant b_final {        file: constants/b_final.json        type: tensor(final[1])    } Now, we specify a second rank-profile called nn_tensor that will use the same first phase as the rank-profile tensor but will rerank the top 200 documents using the neural network model as second phase. We refer to the Tensor Reference document for more information regarding the Tensor operations used in the code below.    # rank profile with neural network model as    # second phase    rank-profile nn_tensor {        # The input to the neural network is the        # concatenation of the document and query vectors.        macro nn_input() {            expression: concat(attribute(user_item_cf), query(user_item_cf), input)        }        # Computes the hidden layer        macro hidden_layer() {            expression: relu(sum(nn_input * constant(W_hidden), input) + constant(b_hidden))        }        # Computes the output layer        macro final_layer() {            expression: sigmoid(sum(hidden_layer * constant(W_final), hidden) + constant(b_final))        }        # First-phase ranking:        # Dot-product between user and document latent factors        first-phase {            expression: sum(query(user_item_cf) * attribute(user_item_cf))        }        # Second-phase ranking:        # Neural network model based on the user and latent factors        second-phase {            rerank-count: 200            expression: sum(final_layer)        }    } } Offline evaluation We will now query Vespa and obtain 100 blog post recommendations for each user_id in our test set. Below, we query Vespa using the tensor ranking function which contain the simpler ranking expression involving the dot-product between user and document latent factors. pig -x local -f tutorial_compute_metric.pig \  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param ENDPOINT=$(hostname):8080  -param NUMBER_RECOMMENDATIONS=100  -param RANKING_NAME=tensor  -param OUTPUT=blog-job/cf-metric We perform the same query routine below, but now using the ranking-profile nn_tensor which reranks the top 200 documents using the neural network model. pig -x local -f tutorial_compute_metric.pig \  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param ENDPOINT=$(hostname):8080  -param NUMBER_RECOMMENDATIONS=100  -param RANKING_NAME=nn_tensor  -param OUTPUT=blog-job/cf-metric The tutorial_compute_metric.pig script can be found in our repo. Comparing the recommendations obtained by those two ranking profiles and our test set, we see that by deploying a more complex and accurate model in the second phase ranking, we increased the number of relevant documents (documents read by the user) retrieved from 11948 to 12804 (more than 7% increase) and those documents retrieved appeared higher up in the list of recommendations, as shown by the expected percentile ranking metric introduced in the Vespa tutorial pt. 2 which decreased from 37.1% to 34.5%.

Blog recommendation with neural network models

December 15, 2017
December 13, 2017
December 13, 2017
mikesefanov
Share

How to Make Your Web App More Reliable and Performant Using webpack: a Yahoo Mail Case Study

yahoodevelopers: By Murali Krishna Bachhu, Anurag Damle, and Utkarsh Shrivastava As engineers on the Yahoo Mail team at Oath, we pride ourselves on the things that matter most to developers: faster development cycles, more reliability, and better performance. Users don’t necessarily see these elements, but they certainly feel the difference they make when significant improvements are made. Recently, we were able to upgrade all three of these areas at scale by adopting webpack® as Yahoo Mail’s underlying module bundler, and you can do the same for your web application. What is webpack? webpack is an open source module bundler for modern JavaScript applications. When webpack processes your application, it recursively builds a dependency graph that includes every module your application needs. Then it packages all of those modules into a small number of bundles, often only one, to be loaded by the browser. webpack became our choice module bundler not only because it supports on-demand loading, multiple bundle generation, and has a relatively low runtime overhead, but also because it is better suited for web platforms and NodeJS apps and has great community support. Comparison of webpack to other open source bundlers How did we integrate webpack? Like any developer does when integrating a new module bundler, we started integrating webpack into Yahoo Mail by looking at its basic config file. We explored available default webpack plugins as well as third-party webpack plugins and then picked the plugins most suitable for our application. If we didn’t find a plugin that suited a specific need, we wrote the webpack plugin ourselves (e.g., We wrote a plugin to execute Atomic CSS scripts in the latest Yahoo Mail experience in order to decrease our overall CSS payload**). During the development process for Yahoo Mail, we needed a way to make sure webpack would continuously run in the background. To make this happen, we decided to use the task runner Grunt. Not only does Grunt keep the connection to webpack alive, but it also gives us the ability to pass different parameters to the webpack config file based on the given environment. Some examples of these parameters are source map options, enabling HMR, and uglification. Before deployment to production, we wanted to optimize the javascript bundles for size to make the Yahoo Mail experience faster. webpack provides good default support for this with the UglifyJS plugin. Although the default options are conservative, they give us the ability to configure the options. Once we modified the options to our specifications, we saved approximately 10KB. Code snippet showing the configuration options for the UglifyJS plugin Faster development cycles for developers While developing a new feature, engineers ideally want to see their code changes reflected on their web app instantaneously. This allows them to maintain their train of thought and eventually results in more productivity. Before we implemented webpack, it took us around 30 seconds to 1 minute for changes to reflect on our Yahoo Mail development environment. webpack helped us reduce the wait time to 5 seconds. More reliability Consumers love a reliable product, where all the features work seamlessly every time. Before we began using webpack, we were generating javascript bundles on demand or during run-time, which meant the product was more prone to exceptions or failures while fetching the javascript bundles. With webpack, we now generate all the bundles during build time, which means that all the bundles are available whenever consumers access Yahoo Mail. This results in significantly fewer exceptions and failures and a better experience overall. Better Performance We were able to attain a significant reduction of payload after adopting webpack. 1. Reduction of about 75 KB gzipped Javascript payload 2. 50% reduction on server-side render time 3. 10% improvement in Yahoo Mail’s launch performance metrics, as measured by render time above the fold (e.g., Time to load contents of an email). Below are some charts that demonstrate the payload size of Yahoo Mail before and after implementing webpack. Payload before using webpack (JavaScript Size = 741.41KB) Payload after switching to webpack (JavaScript size = 669.08KB) Conclusion Shifting to webpack has resulted in significant improvements. We saw a common build process go from 30 seconds to 5 seconds, large JavaScript bundle size reductions, and a halving in server-side rendering time. In addition to these benefits, our engineers have found the community support for webpack to have been impressive as well. webpack has made the development of Yahoo Mail more efficient and enhanced the product for users. We believe you can use it to achieve similar results for your web application as well. **Optimized CSS generation with Atomizer Before we implemented webpack into the development of Yahoo Mail, we looked into how we could decrease our CSS payload. To achieve this, we developed an in-house solution for writing modular and scoped CSS in React. Our solution is similar to the Atomizer library, and our CSS is written in JavaScript like the example below: Sample snippet of CSS written with Atomizer Every React component creates its own styles.js file with required style definitions. React-Atomic-CSS converts these files into unique class definitions. Our total CSS payload after implementing our solution equaled all the unique style definitions in our code, or only 83KB (21KB gzipped). During our migration to webpack, we created a custom plugin and loader to parse these files and extract the unique style definitions from all of our CSS files. Since this process is tied to bundling, only CSS files that are part of the dependency chain are included in the final CSS.

How to Make Your Web App More Reliable and Performant Using webpack: a Yahoo Mail Case Study

December 13, 2017
December 1, 2017
December 1, 2017
Share

Vespa Meetup in Sunnyvale

vespaengine: WHAT: Vespa meetup with various presentations from the Vespa team. Several Vespa developers from Norway are in Sunnyvale, use this opportunity to learn more about the open big data serving engine Vespa and meet the team behind it. WHEN: Monday, December 4th, 6:00pm - 8:00pm PDT WHERE: Oath/Yahoo Sunnyvale Campus Building E, Classroom 9 & 10 700 First Avenue, Sunnyvale, CA 94089 MANDATORY REGISTRATION: https://goo.gl/forms/7kK2vlaipgsSSSH42 Agenda 6.00 pm:  Welcome & Intro 6.15 pm: Vespa tips and tricks 7.00 pm: Tensors in Vespa, intro and usecases 7.45 pm: Vespa future and roadmap 7.50 pm: Q&A This meetup is a good arena for sharing experience, get good tips, get inside details in Vespa, discuss and impact the roadmap, and it is a great opportunity for the Vespa team to meet our users. Hope to see many of you!

Vespa Meetup in Sunnyvale

December 1, 2017
December 1, 2017
December 1, 2017
Share

Blog recommendation in Vespa

Introduction This post builds upon the previous blog search application and extends the basic search engine to include machine learned models to help us recommend blog posts to users that arrive at our application. Assume that once a user arrives, we obtain his user identification number, denoted in here by user_id, and that we will send this information down to Vespa and expect to obtain a blog post recommendation list containing 100 blog posts tailored for that specific user. Prerequisites: - Install and build files - code source and build instructions for sbt and Spark is found at Vespa Tutorial pt. 2 - Install Pig and Hadoop - Put trainPosts.json in $VESPA_SAMPLE_APPS, the directory with the clone of vespa sample apps - Put vespa-hadoop.jar in $VESPA_SAMPLE_APPS - docker as in the blog search tutorialCollaborative Filtering We will start our recommendation system by implementing the collaborative filtering algorithm for implicit feedback described in (Hu et. al. 2008). The data is said to be implicit because the users did not explicitly rate each blog post they have read. Instead, the have “liked” blog posts they have likely enjoyed (positive feedback) but did not have the chance to “dislike” blog posts they did not enjoy (absence of negative feedback). Because of that, implicit feedback is said to be inherently noisy and the fact that a user did not “like” a blog post might have many different reasons not related with his negative feelings about that blog post. In terms of modeling, a big difference between explicit and implicit feedback datasets is that the ratings for the explicit feedback are typically unknown for the majority of user-item pairs and are treated as missing values and ignored by the training algorithm. For an implicit dataset, we would assume a rating of zero in case the user has not liked a blog post. To encode the fact that a value of zero could come from different reasons we will use the concept of confidence as introduced by (Hu et. al. 2008), which causes the positive feedback to have a higher weight than a negative feedback. Once we train the collaborative filtering model, we will have one vector representing a latent factor for each user and item contained in the training set. Those vectors will later be used in the Vespa ranking framework to make recommendations to a user based on the dot product between the user and documents latent factors. An obvious problem with this approach is that new users and new documents will not have those latent factors available to them. This is what is called a cold start problem and will be addressed with content-based techniques described in future posts.Evaluation metrics The evaluation metric used by Kaggle for this challenge was the Mean Average Precision at 100 (MAP@100). However, since we do not have information about which blog posts the users did not like (that is, we have only positive feedback) and our inability to obtain user behavior to the recommendations we make (this is an offline evaluation, different from the usual A/B testing performed by companies that use recommendation systems), we offer a similar remark as the one included in (Hu et. al. 2008) and prefer recall-oriented measures. Following (Hu et. al. 2008) we will use the expected percentile ranking.Evaluation Framework Generate training and test sets In order to evaluate the gains obtained by the recommendation system when we start to improve it with more accurate algorithms, we will split the dataset we have available into training and test sets. The training set will contain document (blog post) and user action (likes) pairs as well as any information available about the documents contained in the training set. There is no additional information about the users besides the blog posts they have liked. The test set will be formed by a series of documents available to be recommended and a set of users to whom we need to make recommendations. This list of test set documents constitutes the Vespa content pool, which is the set of documents stored in Vespa that are available to be served to users. The user actions will be hidden from the test set and used later to evaluate the recommendations made by Vespa. To create an application that more closely resembles the challenges faced by companies when building their recommendation systems, we decided to construct the training and test sets in such a way that: - There will be blog posts that had been liked in the training set by a set of users and that had also been liked in the test set by another set of users, even though this information will be hidden in the test set. Those cases are interesting to evaluate if the exploitation (as opposed to exploration) component of the system is working well. That is, if we are able to identify high quality blog posts based on the available information during training and exploit this knowledge by recommending those high quality blog posts to another set of users that might like them as well. - There will be blog posts in the test set that had never been seen in the training set. Those cases are interesting in order to evaluate how the system deals with the cold-start problem. Systems that are too biased towards exploitation will fail to recommend new and unexplored blog posts, leading to a feedback loop that will cause the system to focus into a small share of the available content. A key challenge faced by recommender system designers is how to balance the exploitation/exploration components of their system, and our training/test set split outlined above will try to replicate this challenge in our application. Notice that this split is different from the approach taken by the Kaggle competition where the blog posts available in the test set had never been seen in the training set, which removes the exploitation component of the equation. The Spark job uses trainPosts.json and creates the folders blog-job/training_set_ids and blog-job/test_set_ids containing files with post_id and user_idpairs: $ cd blog-recommendation; export SPARK_LOCAL_IP="127.0.0.1" $ spark-submit --class "com.yahoo.example.blog.BlogRecommendationApp" \  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \  --task split_set --input_file ../trainPosts.json \  --test_perc_stage1 0.05 --test_perc_stage2 0.20 --seed 123 \  --output_path blog-job/training_and_test_indices - test_perc_stage1: The percentage of the blog posts that will be located only on the test set (exploration component). - test_perc_stage2: The percentage of the remaining (post_id, user_id) pairs that should be moved to the test set (exploitation component). - seed: seed value used in order to replicate results if required. Compute user and item latent factors Use the complete training set to compute user and item latent factors. We will leave the discussion about tuning and performance improvement of the model used to the section about model tuning and offline evaluation. Submit the Spark job to compute the user and item latent factors: $ spark-submit --class "com.yahoo.example.blog.BlogRecommendationApp" \  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \  --task collaborative_filtering \  --input_file blog-job/training_and_test_indices/training_set_ids \  --rank 10 --numIterations 10 --lambda 0.01 \  --output_path blog-job/user_item_cf Verify the vectors for the latent factors for users and posts: $ head -1 blog-job/user_item_cf/user_features/part-00000 | python -m json.tool {    "user_id": 270,    "user_item_cf": {        "user_item_cf:0": -1.750116e-05,        "user_item_cf:1": 9.730623e-05,        "user_item_cf:2": 8.515047e-05,        "user_item_cf:3": 6.9297894e-05,        "user_item_cf:4": 7.343942e-05,        "user_item_cf:5": -0.00017635927,        "user_item_cf:6": 5.7642872e-05,        "user_item_cf:7": -6.6685796e-05,        "user_item_cf:8": 8.5506894e-05,        "user_item_cf:9": -1.7209566e-05    } } $ head -1 blog-job/user_item_cf/product_features/part-00000 | python -m json.tool {    "post_id": 20,    "user_item_cf": {        "user_item_cf:0": 0.0019320602,        "user_item_cf:1": -0.004728486,        "user_item_cf:2": 0.0032499845,        "user_item_cf:3": -0.006453364,        "user_item_cf:4": 0.0015929453,        "user_item_cf:5": -0.00420313,        "user_item_cf:6": 0.009350027,        "user_item_cf:7": -0.0015649397,        "user_item_cf:8": 0.009262732,        "user_item_cf:9": -0.0030964287    } } At this point, the vectors with latent factors can be added to posts and users.Add vectors to search definitions using tensors Modern machine learning applications often make use of large, multidimensional feature spaces and perform complex operations on those features, such as in large logistic regression and deep learning models. It is therefore necessary to have an expressive framework to define and evaluate ranking expressions of such complexity at scale. Vespa comes with a Tensor framework, which unify and generalizes scalar, vector and matrix operations, handles the sparseness inherent to most machine learning application (most cases evaluated by the model is lacking values for most of the features) and allow for models to be continuously updated. Additional information about the Tensor framework can be found in the tensor user guide. We want to have those latent factors available in a Tensor representation to be used during ranking by the Tensor framework. A tensor field user_item_cf is added to blog_post.sd to hold the blog post latent factor: field user_item_cf type tensor(user_item_cf[10]) { indexing: summary | attribute attribute: tensor(user_item_cf[10]) } field has_user_item_cf type byte { indexing: summary | attribute attribute: fast-search } A new search definition user.sd defines a document type named user to hold information for users: search user {    document user {        field user_id type string {            indexing: summary | attribute            attribute: fast-search        }        field has_read_items type array {            indexing: summary | attribute        }        field user_item_cf type tensor(user_item_cf[10]) {            indexing: summary | attribute            attribute: tensor(user_item_cf[10])        }        field has_user_item_cf type byte {            indexing: summary | attribute            attribute: fast-search        }    } } Where: - user_id: unique identifier for the user - user_item_cf: tensor that will hold the user latent factor - has_user_item_cf: flag to indicate the user has a latent factorJoin and feed data Build and deploy the application: $ mvn install Deploy the application (in the Docker container): $ vespa-deploy prepare /vespa-sample-apps/blog-recommendation/target/application && \  vespa-deploy activate Wait for app to activate (200 OK): $ curl -s --head http://localhost:8080/ApplicationStatus The code to join the latent factors in blog-job/user_item_cf into blog_post and user documents is implemented in tutorial_feed_content_and_tensor_vespa.pig. After joining in the new fields, a Vespa feed is generated and fed to Vespa directly from Pig : $ pig -Dvespa.feed.defaultport=8080 -Dvespa.feed.random.startup.sleep.ms=0 \  -x local \  -f ../blog-tutorial-shared/src/main/pig/tutorial_feed_content_and_tensor_vespa.pig \  -param VESPA_HADOOP_JAR=../vespa-hadoop*.jar \  -param DATA_PATH=../trainPosts.json \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param BLOG_POST_FACTORS=blog-job/user_item_cf/product_features \  -param USER_FACTORS=blog-job/user_item_cf/user_features \  -param ENDPOINT=localhost A successful data join and feed will output: Input(s): Successfully read 1196111 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/trainPosts.json" Successfully read 341416 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/training_and_test_indices/testing_set_ids" Successfully read 323727 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/user_item_cf/product_features" Successfully read 6290 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/user_item_cf/user_features" Output(s): Successfully stored 286237 records in: "localhost" Sample blog post and user: - localhost:8080/document/v1/blog-recommendation/user/docid/22702951 - localhost:8080/document/v1/blog-recommendation/blog_post/docid/1838008Ranking Set up a rank function to return the best matching blog posts given some user latent factor. Rank the documents using a dot product between the user and blog post latent factors, i.e. the query tensor and blog post tensor dot product (sum of the product of the two tensors) - from blog_post.sd: rank-profile tensor {    first-phase {        expression {            sum(query(user_item_cf) * attribute(user_item_cf))        }    } } Configure the ranking framework to expect that query(user_item_cf) is a tensor, and that it is compatible with the attribute in a query profile type - see search/query-profiles/types/root.xml and search/query-profiles/default.xml:     This configures a ranking feature named query(user_item_cf) with type tensor(user_item_cf[10]), which defines it as an indexed tensor with 10 elements. This is the same as the attribute, hence the dot product can be computed.Query Vespa with a tensor Test recommendations by sending a tensor with latenct factors: localhost:8080/search/?yql=select%20*%20from%20sources%20blog_post%20where%20has_user_item_cf%20=%201;&ranking=tensor&ranking.features.query(user_item_cf)=%7B%7Buser_item_cf%3A0%7D%3A0.1%2C%7Buser_item_cf%3A1%7D%3A0.1%2C%7Buser_item_cf%3A2%7D%3A0.1%2C%7Buser_item_cf%3A3%7D%3A0.1%2C%7Buser_item_cf%3A4%7D%3A0.1%2C%7Buser_item_cf%3A5%7D%3A0.1%2C%7Buser_item_cf%3A6%7D%3A0.1%2C%7Buser_item_cf%3A7%7D%3A0.1%2C%7Buser_item_cf%3A8%7D%3A0.1%2C%7Buser_item_cf%3A9%7D%3A0.1%7D The query string, decomposed: - yql=select * from sources blog_post where has_user_item_cf = 1 - this selects all documents of type blog_post which has a latent factor tensor - restrict=blog_post - search only in blog_post documents - ranking=tensor - use the rank-profile tensor in blog_post.sd. - ranking.features.query(user_item_cf) - send the tensor as user_item_cf. As this tensor is defined in the query-profile-type, the ranking framework knows its type (i.e. dimensions) and is able to do a dot product with the attribute of same type. The tensor before URL-encoding: {  {user_item_cf:0}:0.1,  {user_item_cf:1}:0.1,  {user_item_cf:2}:0.1,  {user_item_cf:3}:0.1,  {user_item_cf:4}:0.1,  {user_item_cf:5}:0.1,  {user_item_cf:6}:0.1,  {user_item_cf:7}:0.1,  {user_item_cf:8}:0.1,  {user_item_cf:9}:0.1 } Query Vespa with user id Next step is to query Vespa by user id, look up the user profile for the user, get the tensor from it and recommend documents based on this tensor (like the query in previous section). The user profiles is fed to Vespa in the user_item_cf field of the user document type. In short, set up a searcher to retrieve the user profile by user id - then run the query. When the Vespa Container receives a request, it will create a Query representing it and execute a configured list of such Searcher components, called a search chain. The query object contains all the information needed to create a result to the request while the Result encapsulates all the data generated from a Query. The Execution object keeps track of the call state for an execution of the searchers of a search chain: package com.yahoo.example; import com.yahoo.data.access.Inspectable; import com.yahoo.data.access.Inspector; import com.yahoo.prelude.query.IntItem; import com.yahoo.prelude.query.NotItem; import com.yahoo.prelude.query.WordItem; import com.yahoo.processing.request.CompoundName; import com.yahoo.search.Query; import com.yahoo.search.Result; import com.yahoo.search.Searcher; import com.yahoo.search.querytransform.QueryTreeUtil; import com.yahoo.search.result.Hit; import com.yahoo.search.searchchain.Execution; import com.yahoo.search.searchchain.SearchChain; import com.yahoo.tensor.Tensor; import java.util.ArrayList; import java.util.Iterator; import java.util.List; public class UserProfileSearcher extends Searcher {    public Result search(Query query, Execution execution) {        // Get tensor and read items from user profile        Object userIdProperty = query.properties().get("user_id");        if (userIdProperty != null) {            Hit userProfile = retrieveUserProfile(userIdProperty.toString(), execution);            if (userProfile != null) {                addUserProfileTensorToQuery(query, userProfile);                NotItem notItem = new NotItem();                notItem.addItem(new IntItem(1, "has_user_item_cf"));                for (String item : getReadItems(userProfile.getField("has_read_items"))){                    notItem.addItem(new WordItem(item, "post_id"));                }                QueryTreeUtil.andQueryItemWithRoot(query, notItem);            }        }        // Restric to search in blog_posts        query.getModel().setRestrict("blog_post");        // Rank blog posts using tensor rank profile        if(query.properties().get("ranking") == null) {            query.properties().set(new CompoundName("ranking"), "tensor");        }        return execution.search(query);    }    private Hit retrieveUserProfile(String userId, Execution execution) {        Query query = new Query();        query.getModel().setRestrict("user");        query.getModel().getQueryTree().setRoot(new WordItem(userId, "user_id"));        query.setHits(1);        SearchChain vespaChain = execution.searchChainRegistry().getComponent("vespa");        Result result = new Execution(vespaChain, execution.context()).search(query);        execution.fill(result); // This is needed to get the actual summary data        Iterator hiterator = result.hits().deepIterator();        return hiterator.hasNext() ? hiterator.next() : null;    }    private void addUserProfileTensorToQuery(Query query, Hit userProfile) {        Object userItemCf = userProfile.getField("user_item_cf");        if (userItemCf != null) {            if (userItemCf instanceof Tensor) {                query.getRanking().getFeatures().put("query(user_item_cf)", (Tensor)userItemCf);            }            else {                query.getRanking().getFeatures().put("query(user_item_cf)", Tensor.from(userItemCf.toString()));            }        }    }    private List getReadItems(Object readItems) {        List items = new ArrayList<>();        if (readItems instanceof Inspectable) {            for (Inspector entry : ((Inspectable)readItems).inspect().entries()) {                items.add(entry.asString());            }        }        return items;    } } The searcher is configured in in services.xml:     Deploy, then query a user to get blog recommendations: localhost:8080/search/?user_id=34030991&searchChain=user. To refine recommendations, add query terms: localhost:8080/search/?user_id=34030991&searchChain=user&yql=select%20*%20from%20sources%20blog_post%20where%20content%20contains%20%22pegasus%22;Model tuning and offline evaluation We will now optimize the latent factors using the training set instead of manually picking hyperparameter values as was done in Compute user and item latent factors: $ spark-submit --class "com.yahoo.example.blog.BlogRecommendationApp" \  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \  --task collaborative_filtering_cv \  --input_file blog-job/training_and_test_indices/training_set_ids \  --numIterations 10 --output_path blog-job/user_item_cf_cv Feed the newly computed latent factors to Vespa as before. Note that we need to update the tensor specification in the search definition in case the size of the latent vectors change. We have used size 10 (rank = 10) in the Compute user and item latent factors section but our cross-validation algorithm above tries different values for rank (10, 50, 100). $ pig -Dvespa.feed.defaultport=8080 -Dvespa.feed.random.startup.sleep.ms=0 \  -x local \  -f ../blog-tutorial-shared/src/main/pig/tutorial_feed_content_and_tensor_vespa.pig \  -param VESPA_HADOOP_JAR=../vespa-hadoop*.jar \  -param DATA_PATH=../trainPosts.json \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param BLOG_POST_FACTORS=blog-job/user_item_cf_cv/product_features \  -param USER_FACTORS=blog-job/user_item_cf_cv/user_features \  -param ENDPOINT=localhost Run the following script that will use Java UDF VespaQuery from the vespa-hadoop to query Vespa for a specific number of blog post recommendations for each user_id in our test set. With the list of recommendation for each user, we can then compute the expected percentile ranking as described in section Evaluation metrics: $ pig \  -x local \  -f ../blog-tutorial-shared/src/main/pig/tutorial_compute_metric.pig \  -param VESPA_HADOOP_JAR=../vespa-hadoop*.jar \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param BLOG_POST_FACTORS=blog-job/user_item_cf_cv/product_features \  -param USER_FACTORS=blog-job/user_item_cf_cv/user_features \  -param NUMBER_RECOMMENDATIONS=100 \  -param RANKING_NAME=tensor \  -param OUTPUT=blog-job/metric \  -param ENDPOINT=localhost:8080 At completion, observe: Input(s): Successfully read 341416 records from: "file:/sample-apps/blog-recommendation/blog-job/training_and_test_indices/testing_set_ids" Output(s): Successfully stored 5174 records in: "file:/sample-apps/blog-recommendation/blog-job/metric" In the next post we will improve accuracy using a simple neural network.Vespa and Hadoop Vespa was designed to keep low-latency performance even at Yahoo-like web scale. This means supporting a large number of concurrent requests as well as a very large number of documents. In the previous tutorial we used a data set that was approximately 5Gb. Data sets of this size do not require a distributed file system for data manipulation. However, we assume that most Vespa users would like at some point to scale their applications up. Therefore, this tutorial uses tools such as Apache Hadoop, Apache Pig and Apache Spark. These can be run locally on a laptop, like in this tutorial. In case you would like to use HDFS (Hadoop Distributed File System) for storing the data, it is just a matter of uploading it to HDFS with the following command: $ hadoop fs -put trainPosts.json blog-app/trainPosts.json If you go with this approach, you need to replace the local file paths with the equivalent HDFS file paths in this tutorial. Vespa has a set of tools to facilitate the interaction between Vespa and the Hadoop ecosystem. These can also be used locally. A Pig script example of feeding to Vespa is as simple as: REGISTER vespa-hadoop.jar DEFINE VespaStorage com.yahoo.vespa.hadoop.pig.VespaStorage(); A = LOAD '' [USING ] [AS ]; -- apply any transformations STORE A INTO '$ENDPOINT' USING VespaStorage(); Use Pig to feed a file into Vespa: $ pig -x local -f feed.pig -p ENDPOINT=endpoint-1,endpoint-2 Here, the -x local option is added to specify that this script is run locally, and will not attempt to retrieve scripts and data from HDFS. You need both Pig and Hadoop libraries installed on your machine to run this locally, but you don’t need to install and start a running instance of Hadoop. More examples of feeding to Vespa from Pig is found in sample apps.

Blog recommendation in Vespa

December 1, 2017
June 27, 2017
June 27, 2017
mikesefanov
Share

Speed and Stability: Yahoo Mail’s Forward-Thinking Continuous Integration and Delivery Pipeline

By Mohit Goenka, Senior Engineering Manager Building the technology powering the best consumer email inbox in the world is no easy task. When you start on such a journey, it is important to consider how to deliver such an experience to the users. After all, any consumer feature we build can only make a difference after it is delivered to everyone via the tech pipeline.  As we began building out the new version of Yahoo Mail, we wanted to ensure that our internal developer productivity would not be hindered by how our pipelines work. Keeping this in mind, we identified the following principles as most important while designing the delivery pipeline for the new Yahoo Mail experience:  - Product updates are pushed at regular intervals - Releases are stable - Builds are not blocked by irrational test failures - Developers are notified of code pushes - Hotfixes - Rollbacks - Heartbeat pushes  Product updates are pushed at regular intervals  We ensure that our engineers can push any code changes to all Mail users everyday, with the ability to push multiple times a day, if necessary or desired. This is possible because of the time we spent building a solid testing infrastructure, which continues to evolve as we scale to new users and add new features to the product. Every one of our builds runs 10,000+ unit tests and 5,000+ integration tests on various combinations of operating systems and browsers. It is important to push product updates regularly as it allows all our users to get the best Mail experience possible.  Releases are stable  Every code release starts with the company’s internal audience first, where all our employees get to try out the latest changes before they go out to production. This begins with our alpha and beta environments that our Mail engineers use by default. Our build then goes out to the canary environment, which is a small subset of production users, before making it to all users. This gives us the ability to analyze quality metrics on internal and canary servers before rolling the build out to 100% of users in production. Once we go through this process, the code pushed to all our users is thoroughly baked and tested.  Builds are not blocked by irrational test failures  Running tests using web drivers on multiple browsers, as is standard when testing frontend code, comes with the problem of tests irrationally failing. As part the Yahoo Mail continuous delivery pipeline, we employ various novel strategies to recover from such failures. One such strategy is recording the data related to failed tests in the first pass of a build, and then rerunning only the failed tests in the subsequent passes. This is achieved by creating a metadata file that stores all our build-related information. As part of this process, a new bundle is created with a new set of code changes. Once a bundle is created with build metadata information, the same build job can be rerun multiple times such that subsequent reruns would only run the failing tests. This significantly improves rerun times and eliminates the chances of build detentions introduced by irrational test failures. The recorded test information is analyzed independently to understand the pattern of failing tests. This helps us in improving the stability of those intermittently failing tests.  Developers are notified of code pushes  Our build and deployment pipelines collect data related to all the authors contributing to any release through code commits or by merging various pull requests. This enables the build pipeline to send out email notifications to all our Mail developers as their code flows through each environment in our build pipeline (alpha, beta, canary, and production). With this ability, developers are well aware of where their code is in the pipeline and can test their changes as needed.  Hotfixes  We have also created a pipeline to deploy major code fixes directly to production. This is needed even after the existence of tens of thousands of tests and multitudes of checks. Every now and then, a bug may make its way into production. For such instances, we have hotfixes that are very useful. These are code patches that we quickly deploy on top of production code to address critical issues impacting large sets of users.  Rollbacks  If we find any issues in production, we do our best to minimize the impact on users by swiftly utilizing rollbacks, ensuring there is zero to minimal impact time. In order to do rollbacks, we maintain lists of all the versions pushed to production along with their release bundles and change logs. If needed, we pick the stable version that was previously pushed to production and deploy it directly on all the machines running our production instance.  Heartbeat pushes As part of our continuous delivery efforts, we have also developed a concept we call heartbeat pushes. Heartbeat pushes are notifications we send users to refresh their browsers when we issue important builds that they should immediately adopt. These can include bug fixes, product updates, or new features. Heartbeat allows us to dynamically update the latest version of Yahoo Mail when we see that a user’s current version needs to be updated. Yahoo Mail Continuous Delivery Flow In building the new Yahoo Mail experience, we knew that we needed to revamp from the ground up, starting with our continuous integration and delivery pipeline. The guiding principles of our new, forward-thinking infrastructure allow us to deliver new features and code fixes at a very high launch velocity and ensure that our users are always getting the latest and greatest Yahoo Mail experience.

Speed and Stability: Yahoo Mail’s Forward-Thinking Continuous Integration and Delivery Pipeline

June 27, 2017
June 27, 2017
June 27, 2017
mikesefanov
Share

Yahoo Mail’s New Tech Stack, Built for Performance and Reliability

By Suhas Sadanandan, Director of Engineering  When it comes to performance and reliability, there is perhaps no application where this matters more than with email. Today, we announced a new Yahoo Mail experience for desktop based on a completely rewritten tech stack that embodies these fundamental considerations and more. We built the new Yahoo Mail experience using a best-in-class front-end tech stack with open source technologies including React, Redux, Node.js, react-intl (open-sourced by Yahoo), and others. A high-level architectural diagram of our stack is below. New Yahoo Mail Tech Stack In building our new tech stack, we made use of the most modern tools available in the industry to come up with the best experience for our users by optimizing the following fundamentals: Performance A key feature of the new Yahoo Mail architecture is blazing-fast initial loading (aka, launch). We introduced new network routing which sends users to their nearest geo-located email servers (proximity-based routing). This has resulted in a significant reduction in time to first byte and should be immediately noticeable to our international users in particular. We now do server-side rendering to allow our users to see their mail sooner. This change will be immediately noticeable to our low-bandwidth users. Our application is isomorphic, meaning that the same code runs on the server (using Node.js) and the client. Prior versions of Yahoo Mail had programming logic duplicated on the server and the client because we used PHP on the server and JavaScript on the client.    Using efficient bundling strategies (JavaScript code is separated into application, vendor, and lazy loaded bundles) and pushing only the changed bundles during production pushes, we keep the cache hit ratio high. By using react-atomic-css, our homegrown solution for writing modular and scoped CSS in React, we get much better CSS reuse.   In prior versions of Yahoo Mail, the need to run various experiments in parallel resulted in additional branching and bloating of our JavaScript and CSS code. While rewriting all of our code, we solved this issue using Mendel, our homegrown solution for bucket testing isomorphic web apps, which we have open sourced.   Rather than using custom libraries, we use native HTML5 APIs and ES6 heavily and use PolyesterJS, our homegrown polyfill solution, to fill the gaps. These factors have further helped us to keep payload size minimal. With all the above optimizations, we have been able to reduce our JavaScript and CSS footprint by approximately 50% compared to the previous desktop version of Yahoo Mail, helping us achieve a blazing-fast launch. In addition to initial launch improvements, key features like search and message read (when a user opens an email to read it) have also benefited from the above optimizations and are considerably faster in the latest version of Yahoo Mail. We also significantly reduced the memory consumed by Yahoo Mail on the browser. This is especially noticeable during a long running session. Reliability With this new version of Yahoo Mail, we have a 99.99% success rate on core flows: launch, message read, compose, search, and actions that affect messages. Accomplishing this over several billion user actions a day is a significant feat. Client-side errors (JavaScript exceptions) are reduced significantly when compared to prior Yahoo Mail versions. Product agility and launch velocity We focused on independently deployable components. As part of the re-architecture of Yahoo Mail, we invested in a robust continuous integration and delivery flow. Our new pipeline allows for daily (or more) pushes to all Mail users, and we push only the bundles that are modified, which keeps the cache hit ratio high. Developer effectiveness and satisfaction In developing our tech stack for the new Yahoo Mail experience, we heavily leveraged open source technologies, which allowed us to ensure a shorter learning curve for new engineers. We were able to implement a consistent and intuitive onboarding program for 30+ developers and are now using our program for all new hires. During the development process, we emphasise predictable flows and easy debugging. Accessibility The accessibility of this new version of Yahoo Mail is state of the art and delivers outstanding usability (efficiency) in addition to accessibility. It features six enhanced visual themes that can provide accommodation for people with low vision and has been optimized for use with Assistive Technology including alternate input devices, magnifiers, and popular screen readers such as NVDA and VoiceOver. These features have been rigorously evaluated and incorporate feedback from users with disabilities. It sets a new standard for the accessibility of web-based mail and is our most-accessible Mail experience yet. Open source  We have open sourced some key components of our new Mail stack, like Mendel, our solution for bucket testing isomorphic web applications. We invite the community to use and build upon our code. Going forward, we plan on also open sourcing additional components like react-atomic-css, our solution for writing modular and scoped CSS in React, and lazy-component, our solution for on-demand loading of resources. Many of our company’s best technical minds came together to write a brand new tech stack and enable a delightful new Yahoo Mail experience for our users. We encourage our users and engineering peers in the industry to test the limits of our application, and to provide feedback by clicking on the Give Feedback call out in the lower left corner of the new version of Yahoo Mail.

Yahoo Mail’s New Tech Stack, Built for Performance and Reliability

June 27, 2017
June 12, 2017
June 12, 2017
Share

HBase Goes Fast and Lean with the Accordion Algorithm

yahooresearch: By Edward Bortnikov, Anastasia Braginsky, and Eshcar Hillel Modern products powered by NoSQL key-value (KV-)storage technologies exhibit ever-increasing performance expectations. Ideally, NoSQL applications would like to enjoy the speed of in-memory databases without giving up on reliable persistent storage guarantees. Our Scalable Systems research team has implemented a new algorithm named Accordion, that takes a significant step toward this goal, into the forthcoming release of Apache HBase 2.0. HBase, a distributed KV-store for Hadoop, is used by many companies every day to scale products seamlessly with huge volumes of data and deliver real-time performance. At Yahoo, HBase powers a variety of products, including Yahoo Mail, Yahoo Search, Flurry Analytics, and more. Accordion is a complete re-write of core parts of the HBase server technology, named RegionServer. It improves the server scalability via a better use of RAM. Namely, it accommodates more data in memory and writes to disk less frequently. This manifests in a number of desirable phenomena. First, HBase’s disk occupancy and write amplification are reduced. Second, more reads and writes get served from RAM, and less are stalled by disk I/O. Traditionally, these different metrics were considered at odds, and tuned at each other’s expense. With Accordion, they all get improved simultaneously. We stress-tested Accordion-enabled HBase under a variety of workloads. Our experiments exercised different blends of reads and writes, as well as different key distributions (heavy-tailed versus uniform). We witnessed performance improvements across the board. Namely, we saw write throughput increases of 20% to 40% (depending on the workload), tail read latency reductions of up to 10%, disk write reductions of up to 30%, and also some modest Java garbage collection overhead reduction. The figures below further zoom into Accordion’s performance gains, compared to the legacy algorithm. Figure 1. Accordion’s write throughput compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf (heavy-tailed) and Uniform primary key distributions. Figure 2. Accordion’s read latency quantiles compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution. Figure 3. Accordion’s disk I/O compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution. Accordion is inspired by the Log-Structured-Merge (LSM) tree design pattern that governs HBase storage organization. An HBase region is stored as a sequence of searchable key-value maps. The topmost is a mutable in-memory store, called MemStore, which absorbs the recent write (put) operations. The rest are immutable HDFS files, called HFiles. Once a MemStore overflows, it is flushed to disk, creating a new HFile. HBase adopts multi-versioned concurrency control – that is, MemStore stores all data modifications as separate versions. Multiple versions of one key may therefore reside in MemStore and the HFile tier. A read (get) operation, which retrieves the value by key, scans the HFile data in BlockCache, seeking the latest version. To reduce the number of disk accesses, HFiles are merged in the background. This process, called compaction, removes the redundant cells and creates larger files. LSM trees deliver superior write performance by transforming random application-level I/O to sequential disk I/O. However, their traditional design makes no attempt to compact the in-memory data. This stems from historical reasons: LSM trees were designed in the age when RAM was in very short supply, and therefore the MemStore capacity was small. With recent changes in the hardware landscape, the overall MemStore size managed by RegionServer can be multiple gigabytes, leaving a lot of headroom for optimization.  Accordion reapplies the LSM principle to MemStore in order to eliminate redundancies and other overhead while the data is still in RAM. The MemStore memory image is therefore “breathing” (periodically expanding and contracting), similarly to how an accordion bellows. This work pattern decreases the frequency of flushes to HDFS, thereby reducing the write amplification and the overall disk footprint.  With fewer flushes, the write operations are stalled less frequently as the MemStore overflows, and as a result, the write performance is improved. Less data on disk also implies less pressure on the block cache, higher hit rates, and eventually better read response times. Finally, having fewer disk writes also means having less compaction happening in the background, i.e., fewer cycles are stolen from productive (read and write) work. All in all, the effect of in-memory compaction can be thought of as a catalyst that enables the system to move faster as a whole.  Accordion currently provides two levels of in-memory compaction: basic and eager. The former applies generic optimizations that are good for all data update patterns. The latter is most useful for applications with high data churn, like producer-consumer queues, shopping carts, shared counters, etc. All these use cases feature frequent updates of the same keys, which generate multiple redundant versions that the algorithm takes advantage of to provide more value. Future implementations may tune the optimal compaction policy automatically.  Accordion replaces the default MemStore implementation in the production HBase code. Contributing its code to production HBase could not have happened without intensive work with the open source Hadoop community, with contributors stretched across companies, countries, and continents. The project took almost two years to complete, from inception to delivery.  Accordion will become generally available in the upcoming HBase 2.0 release. We can’t wait to see it power existing and future products at Yahoo and elsewhere.

HBase Goes Fast and Lean with the Accordion Algorithm

June 12, 2017
May 23, 2017
May 23, 2017
Share

Join Us at the 10th Annual Hadoop Summit / DataWorks Summit, San Jose (Jun 13-15)

We’re excited to co-host the 10th Annual Hadoop Summit, the leading conference for the Apache Hadoop community, taking place on June 13 – 15 at the San Jose Convention Center. In the last few years, the Hadoop Summit has expanded to cover all things data beyond just Apache Hadoop – such as data science, cloud and operations, IoT and applications – and has been aptly renamed the DataWorks Summit. The three-day program is bursting at the seams! Here are just a few of the reasons why you cannot miss this must-attend event: - Familiarize yourself with the cutting edge in Apache project developments from the committers - Learn from your peers and industry experts about innovative and real-world use cases, development and administration tips and tricks, success stories and best practices to leverage all your data – on-premise and in the cloud – to drive predictive analytics, distributed deep-learning and artificial intelligence initiatives - Attend one of our more than 170 technical deep dive breakout sessions from nearly 200 speakers across eight tracks - Check out our keynotes, meetups, trainings, technical crash courses, birds-of-a-feather sessions, Women in Big Data and more - Attend the community showcase where you can network with sponsors and industry experts, including a host of startups and large companies like Microsoft, IBM, Oracle, HP, Dell EMC and Teradata Similar to previous years, we look forward to continuing Yahoo’s decade-long tradition of thought leadership at this year’s summit. Join us for an in-depth look at Yahoo’s Hadoop culture and for the latest in technologies such as Apache Tez, HBase, Hive, Data Highway Rainbow, Mail Data Warehouse and Distributed Deep Learning at the breakout sessions below. Or, stop by Yahoo kiosk #700 at the community showcase. Also, as a co-host of the event, Yahoo is pleased to offer a 20% discount for the summit with the code YAHOO20. Register here for Hadoop Summit, San Jose, California! DAY 1. TUESDAY June 13, 2017 12:20 - 1:00 P.M. TensorFlowOnSpark - Scalable TensorFlow Learning On Spark Clusters Andy Feng - VP Architecture, Big Data and Machine Learning Lee Yang - Sr. Principal Engineer In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, that was open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application. 2:10 - 2:50 P.M. Handling Kernel Upgrades at Scale - The Dirty Cow Story Samy Gawande - Sr. Operations Engineer Savitha Ravikrishnan - Site Reliability Engineer Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016). 5:00 – 5:40 P.M. Data Highway Rainbow -  Petabyte Scale Event Collection, Transport, and Delivery at Yahoo Nilam Sharma - Sr. Software Engineer Huibing Yin - Sr. Software Engineer This talk presents the architecture and features of Data Highway Rainbow, Yahoo’s hosted multi-tenant infrastructure which offers event collection, transport and aggregated delivery as a service. Data Highway supports collection from multiple data centers & aggregated delivery in primary Yahoo data centers which provide a big data computing cluster. From a delivery perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and Kafka; with Storm & Kafka endpoints tailored towards latency sensitive consumers. DAY 2. WEDNESDAY June 14, 2017 9:05 - 9:15 A.M. Yahoo General Session - Shaping Data Platform for Lasting Value Sumeet Singh  – Sr. Director, Products With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo. 12:20 - 1:00 P.M. CaffeOnSpark Update - Recent Enhancements and Use Cases Mridul Jain - Sr. Principal Engineer Jun Shi - Principal Engineer By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recent development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment. 12:20 - 1:00 P.M. Tez Shuffle Handler - Shuffling at Scale with Apache Hadoop Jon Eagles - Principal Engineer   Kuhu Shukla - Software Engineer In this talk we introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler, the default shuffle service in Apache Tez. The Apache Tez Shuffle Handler adds composite fetch which has support for multi-partition fetch to mitigate performance slow down and provides deletion APIs to reduce disk usage for long running Tez sessions. As an emerging technology we will outline future roadmap for the Apache Tez Shuffle Handler and provide performance evaluation results from real world jobs at scale. 2:10 - 2:50 P.M. Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes Thiruvel Thirumoolan – Principal Engineer Francis Liu – Sr. Principal Engineer At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). We will walk through multi-tenancy features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0. 2:10 - 2:50 P.M. Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse Nick Huang – Director, Data Engineering, Yahoo Mail   Saurabh Dixit – Sr. Principal Engineer, Yahoo Mail Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail. In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption. DAY3. THURSDAY June 15, 2017 2:10 – 2:50 P.M. OracleStore - A Highly Performant RawStore Implementation for Hive Metastore Chris Drome - Sr. Principal Engineer   Jin Sun - Principal Engineer Today, Yahoo uses Hive in many different spaces, from ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive to real-time queries, such as those generated by interactive BI reporting systems. In order for Hive to succeed in this space, it must be performant in all aspects of query execution, from query compilation to job execution. One such component is the interaction with the underlying database at the core of the Metastore. As an alternative to ObjectStore, we created OracleStore as a proof-of-concept. Freed of the restrictions imposed by DataNucleus, we were able to design a more performant database schema that better met our needs. Then, we implemented OracleStore with specific goals built-in from the start, such as ensuring the deduplication of data. In this talk we will discuss the details behind OracleStore and the gains that were realized with this alternative implementation. These include a reduction of 97%+ in the storage footprint of multiple tables, as well as query performance that is 13x faster than ObjectStore with DirectSQL and 46x faster than ObjectStore without DirectSQL. 3:00 P.M. - 3:40 P.M. Bullet - A Real Time Data Query Engine Akshai Sarma - Sr. Software Engineer Michael Natkovich - Director, Engineering Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average. 3:00 P.M. - 3:40 P.M. Yahoo - Moving Beyond Running 100% of Apache Pig Jobs on Apache Tez Rohini Palaniswamy - Sr. Principal Engineer Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez. After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc. We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. 4:10 P.M. - 4:50 P.M. Leveraging Docker for Hadoop Build Automation and Big Data Stack Provisioning Evans Ye,  Software Engineer Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we’ll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation. Register here for Hadoop Summit, San Jose, California with a 20% discount code YAHOO20.  Questions? Feel free to reach out to us at bigdata@yahoo-inc.com. Hope to see you there!

Join Us at the 10th Annual Hadoop Summit / DataWorks Summit, San Jose (Jun 13-15)

May 23, 2017
May 9, 2017
May 9, 2017
Share

Understanding Athenz Architecture

By Mujib Wahab,  Henry Avetisyan and Lee Boynton Data Model Having a firm grasp on some fundamental concepts in Athenz data model will help you understand the Athenz architecture, the request flow for both centralized and decentralized authorization in system view, and how to set up role-based authorization. Domain: Domains are namespaces, strictly partitioned, providing a context for authoritative statements to be made about entities it contains. Administrative tasks can be delegated to created sub-domains to avoid reliance on central “super user” administrative roles. Resource/Action: Resources and Actions aren’t explicitly modeled in Athenz, they are referred to by name. A resource is something that is “owned” and controlled in a specific domain while the operations one can perform against that resource are defined as actions. A resource could be a concrete object like a machine or an abstract object like a security policy. For example, if a domain media.finance product wants to authorize access to a database called “storage” that it owns, the resource name for the database may look like this: media.finance:db.storage and the supported actions on this resource would be insert, update and delete. Policy: To implement access control, we have policies in our domain that govern the use of our resources. A policy is a set of assertions (rules) about granting or denying an operation/action on a resource to all the members in the configured role. Role: A role can be thought of as a group; anyone in the group can assume the role that takes a particular action. Every policy assertion describes what can be done by a role. A role can also delegate the determination of membership to another trusted domain; for example, a netops role managed outside a property domain. This is how we can model tenant relations between a provider domain and tenant domains. Because roles are defined in domains, they can be partitioned by domain, unlike users, which are global. This allows the distributed operation to be more easily scaled. Principal: The actors in Athenz that can assume a role are called principals. These principals are authenticated and can be users (for example, authenticated by their Unix or Kerberos credentials). Principals can also be services that are authenticated by a service management system. Athenz currently provides service identity and authentication support. User: Users are actually defined in some external authority, e.g. Unix or Kerberos system. A special domain is reserved for the purpose of namespacing users; the name of that domain is “user,” so some example users are: user.john or user.joe. The credentials that the external system requires are exchanged for a NToken before operating on any data. Service: The concept of a Service Identity is introduced as the identity of independent agents of execution. Services have a simple way of naming them, e.g. media.finance.storage identifies a service called “storage” in domain media.finance. A Service may be used as a principal when specifying roles, just like a user. Athenz provides support for registering such a Service, in a domain, along with its public key that can be used to later verify a N-Token that is presented by the service.System View Let’s look at all the services and libraries that work together to provide support for Athenz authorization system. ZMS (authZ Management System): ZMS is the source of truth for domains, roles, and policies for centralized authorization. In addition to allowing CRUD operations on the basic entities, ZMS provides an API to replicate the entities, per domain, to ZTS. ZMS supports a centralized call to check if a principal has access to a resource both for internal management system checks, as well as a simple centralized deployment. Because ZMS supports service identities, ZMS can authenticate services. For centralized authorization, ZMS may be the only Athenz subsystem that you need to interact with. ZTS (authZ Token System): ZTS, the authentication token service, is only needed to support decentralized functionality. In many ways, ZTS is like a local replica of ZMS’s data to check a principal’s authentication and confirm membership in roles within a domain. The authentication is in the form of a signed ZToken that can be presented to any decentralized service that wants to authorize access efficiently. Multiple ZTS instances can be distributed to different locations as needed to scale for issuing tokens. SIA (Service Identity Agent) Provider: SIA Provider is part of the container, although likely built with Athenz libraries. As services are authenticated by their private keys, the job of the SIA Provider is to generate a NToken and sign it with the given private key so that the service can present that NToken to ZMS/ZTS as its identity credentials. The corresponding public key must be registered in ZMS so Athenz services can validate the signature. ZPE (AuthZ Policy Engine): Like ZTS, ZPE, the authorization policy engine is only needed to support decentralized authorization. ZPE is the subsystem of Athenz that evaluates policies for a set of roles to yield an allowed or a denied response. ZPE is a library that your service calls and only refers to a local policy cache for your services domain (a small amount of data).

Understanding Athenz Architecture

May 9, 2017
May 9, 2017
May 9, 2017
Share

Open Sourcing Athenz:    Fine-Grained, Role-Based Access Control

By Lee Boynton, Henry Avetisyan, Ken Fox, Itsik Figenblat, Mujib Wahab, Gurpreet Kaur, Usha Parsa, and Preeti Somal Today, we are pleased to offer Athenz, an open-source platform for fine-grained access control, to the community. Athenz is a role-based access control (RBAC) solution, providing trusted relationships between applications and services deployed within an organization requiring authorized access. If you need to grant access to a set of resources that your applications or services manage, Athenz provides both a centralized and a decentralized authorization model to do so. Whether you are using container or VM technology independently or on bare metal, you may need a dynamic and scalable authorization solution. Athenz supports moving workloads from one node to another and gives new compute resources authorization to connect to other services within minutes, as opposed to relying on IP and network ACL solutions that take time to propagate within a large system. Moreover, in very high-scale situations, you may run out of the limited number of network ACL rules that your hardware can support. Prior to creating Athenz, we had multiple ways of managing permissions and access control across all services within Yahoo. To simplify, we built a fine-grained, role-based authorization solution that would satisfy the feature and performance requirements our products demand. Athenz was built with open source in mind so as to share it with the community and further its development. At Yahoo, Athenz authorizes the dynamic creation of compute instances and containerized workloads, secures builds and deployment of their artifacts to our Docker registry, and among other uses, manages the data access from our centralized key management system to an authorized application or service. Athenz provides a REST-based set of APIs modeled in Resource Description Language (RDL) to manage all aspects of the authorization system, and includes Java and Go client libraries to quickly and easily integrate your application with Athenz. It allows product administrators to manage what roles are allowed or denied to their applications or services in a centralized management system through a self-serve UI. Access Control Models Athenz provides two authorization access control models based on your applications’ or services’ performance needs. More commonly used, the centralized access control model is ideal for provisioning and configuration needs. In instances where performance is absolutely critical for your applications or services, we provide a unique decentralized access control model that provides on-box enforcement of authorization.   Athenz’s authorization system utilizes two types of tokens: principal tokens (N-Tokens) and role tokens (Z-Tokens). The principal token is an identity token that identifies either a user or a service. A service generates its principal token using that service’s private key. Role tokens authorize a given principal to assume some number of roles in a domain for a limited period of time. Like principal tokens, they are signed to prevent tampering. The name “Athenz” is derived from “Auth” and the ‘N’ and ‘Z’ tokens. Centralized Access Control: The centralized access control model requires any Athenz-enabled application to contact the Athenz Management Service directly to determine if a specific authenticated principal (user and/or service) has been authorized to carry out the given action on the requested resource. At Yahoo, our internal continuous delivery solution uses this model. A service receives a simple Boolean answer whether or not the request should be processed or rejected. In this model, the Athenz Management Service is the only component that needs to be deployed and managed within your environment. Therefore, it is suitable for provisioning and configuration use cases where the number of requests processed by the server is small and the latency for authorization checks is not important. The diagram below shows a typical control plane-provisioning request handled by an Athenz-protected service. Athenz Centralized Access Control Model Decentralized Access Control: This approach is ideal where the application is required to handle large number of requests per second and latency is a concern. It’s far more efficient to check authorization on the host itself and avoid the synchronous network call to a centralized Athenz Management Service. Athenz provides a way to do this with its decentralized service using a local policy engine library on the local box. At Yahoo, this is an approach we use for our centralized key management system. The authorization policies defining which roles have been authorized to carry out specific actions on resources, are asynchronously updated on application hosts and used by the Athenz local policy engine to evaluate the authorization check. In this model, a principal needs to contact the Athenz Token Service first to retrieve an authorization role token for the request and submit that token as part of its request to the Athenz protected service. The same role token can then be re-used for its lifetime. The diagram below shows a typical decentralized authorization request handled by an Athenz-protected service.Athenz Decentralized Access Control Model With the power of an RBAC system in which you can choose a model to deploy according your performance latency needs, and the flexibility to choose either or both of the models in a complex environment of hosting platforms or products, it gives you the ability to run your business with agility and scale. Looking to the Future We are actively engaged in pushing the scale and reliability boundaries of Athenz. As we enhance Athenz, we look forward to working with the community on the following features: - Using local CA signed TLS certificates - Extending Athenz with a generalized model for service providers to launch instances with bootstrapped Athenz service identity TLS certificates - Integration with public cloud services like AWS. For example, launching an EC2 instance with a configured Athenz service identity or obtaining AWS temporary credentials based on authorization policies defined in ZMS. Our goal is to integrate Athenz with other open source projects that require authorization support and we welcome contributions from the community to make that happen. It is available under Apache License Version 2.0. To evaluate Athenz, we provide both AWS AMI and Docker images so that you can quickly have a test development environment up and running with ZMS (Athenz Management Service), ZTS (Athenz Token Service), and UI services. Please join us on the path to making application authorization easy. Visit http://www.athenz.io to get started!

Open Sourcing Athenz:    Fine-Grained, Role-Based Access Control

May 9, 2017
February 13, 2017
February 13, 2017