developer

Latest Blogposts

Stories and updates you can see

Reset

Filter Events

Image Date Details*
Dash Open 14: How Verizon Media’s Data Platforms and Systems Engineering Team Uses and Contributes to Open Source December 3, 2019
December 3, 2019
Share

Dash Open 14: How Verizon Media’s Data Platforms and Systems Engineering Team Uses and Contributes to Open Source

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Tom Miller, Director of Software Development Engineering on the Data Platforms and Systems Engineering Team at Verizon Media. Tom shares how his team uses and contributes to open source. Tom also chats about empowering his team to do great work and what it’s like to live and work in Champaign, IL. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify. P.S. If you enjoyed this podcast then you might be interested in this Software Development Engineer position in Champaign!

Dash Open 14: How Verizon Media’s Data Platforms and Systems Engineering Team Uses and Contributes to Open Source

December 3, 2019
E-commerce search and recommendation with Vespa.ai November 29, 2019
November 29, 2019
Share

E-commerce search and recommendation with Vespa.ai

Introduction Holiday shopping season is upon us and it’s time for a blog post on E-commerce search and recommendation using Vespa.ai. Vespa.ai is used as the search and recommendation backend at multiple Yahoo e-commerce sites in Asia, like tw.buy.yahoo.com. This blog post discusses some of the challenges in e-commerce search and recommendation, and shows how they can be solved using the features of Vespa.ai. Photo by Jonas Leupe on Unsplash Text matching and ranking in e-commerce search E-commerce search have text ranking requirements where traditional text ranking features like BM25 or TF-IDF might produce poor results. For an introduction to some of the issues with TF-IDF/BM25 see the influence of TF-IDF algorithms in e-commerce search. One example from the blog post is a search for ipad 2 which with traditional TF-IDF ranking will rank ‘black mini ipad cover, compatible with ipad 2’ higher than ‘Ipad 2’ as the former product description has several occurrences of the query terms Ipad and 2. Vespa allows developers and relevancy engineers to fine tune the text ranking features to meet the domain specific ranking challenges. For example developers can control if multiple occurrences of a query term in the matched text should impact the relevance score. See text ranking occurrence tables and Vespa text ranking types for in-depth details. Also the Vespa text ranking features takes text proximity into account in the relevancy calculation, i.e how close the query terms appear in the matched text. BM25/TF-IDF on the other hand does not take query term proximity into account at all. Vespa also implements BM25 but it’s up to the relevancy engineer to chose which of the rich set of built-in text ranking features in Vespa that is used. Vespa uses OpenNLP for linguistic processing like tokenization and stemming with support for multiple languages (as supported by OpenNLP).Custom ranking business logic in e-commerce search Your manager might tell you that these items of the product catalog should be prominent in the search results. How to tackle this with your existing search solution? Maybe by adding some synthetic query terms to the original user query, maybe by using separate indexes with federated search or even with a key value store which rarely is in synch with the product catalog search index? With Vespa it’s easy to promote content as Vespa’s ranking framework is just math and allows the developer to formulate the relevancy scoring function explicitly without having to rewrite the query formulation. Vespa controls ranking through ranking expressions configured in rank profiles which enables full control through the expressive Vespa ranking expression language. The rank profile to use is chosen at query time so developers can design multiple ranking profiles to rank documents differently based on query intent classification. See later section on query classification for more details how query classification can be done with Vespa. A sample ranking profile which implements a tiered relevance scoring function where sponsored or promoted items are always ranked above non-sponsored documents is shown below. The ranking profile is applied to all documents which matches the query formulation and the relevance score of the hit is the assigned the value of the first-phase expression. Vespa also supports multi-phase ranking. Sample hand crafted ranking profile defined in the Vespa application package. The above example is hand crafted but for optimal relevance we do recommend looking at learning to rank (LTR) methods. See learning to Rank using TensorFlow Ranking and learning to Rank using XGBoost. The trained MLR models can be used in combination with the specific business ranking logic. In the example above we could replace the default-ranking function with the trained MLR model, hence combining business logic with MLR models. Facets and grouping in e-commerce search Guiding the user through the product catalog by guided navigation or faceted search is a feature which users expects from an e-commerce search solution today and with Vespa, facets and guided navigation is easily implemented by the powerful Vespa Grouping Language. Sample screenshot from Vespa e-commerce sample application UI demonstrating search facets using Vespa Grouping Language. The Vespa grouping language supports deep nested grouping and aggregation operations over the matched content. The language also allows pagination within the group(s). For example if grouping hits by category and displaying top 3 ranking hits per category the language allows paginating to render more hits from a specified category group.The vocabulary mismatch problem in e-commerce search Studies (e.g. this study from FlipKart) finds that there is a significant fraction of queries in e-commerce search which suffer from vocabulary mismatch between the user query formulation and the relevant product descriptions in the product catalog. For example, the query “ladies pregnancy dress” would not match a product with description “women maternity gown” due to vocabulary mismatch between the query and the product description. Traditional Information Retrieval (IR) methods like TF-IDF/BM25 would fail retrieving the relevant product right off the bat. Most techniques currently used to try to tackle the vocabulary mismatch problem are built around query expansion. With the recent advances in NLP using transfer learning with large pre-trained language models, we believe that future solutions will be built around multilingual semantic retrieval using text embeddings from pre-trained deep neural network language models. Vespa has recently announced a sample application on semantic retrieval which addresses the vocabulary mismatch problem as the retrieval is not based on query terms alone, but instead based on the dense text tensor embedding representation of the query and the document. The mentioned sample app reproduces the accuracy of the retrieval model described in the Google blog post about Semantic Retrieval. Using our query and product title example from the section above, which suffers from the vocabulary mismatch, and instead move away from the textual representation to using the respective dense tensor embedding representation, we find that the semantic similarity between them is high (0.93). The high semantic similarity means that the relevant product would be retrieved when using semantic retrieval. The semantic similarity is in this case defined as the cosine similarity between the dense tensor embedding representations of the query and the product description. Vespa has strong support for expressing and storing tensor fields which one can perform tensor operations (e.g cosine similarity) over for ranking, this functionality is demonstrated in the mentioned sample application. Below is a simple matrix comparing the semantic similarity of three pairs of (query, product description). The tensor embeddings of the textual representation is obtained with the Universal Sentence Encoder from Google. Semantic similarity matrix of different queries and product descriptions. The Universal Sentence Encoder Model from Google is multilingual as it was trained on text from multiple languages. Using these text embeddings enables multilingual retrieval so searches written in Chinese can retrieve relevant products by descriptions written in multiple languages. This is another nice property of semantic retrieval models which is particularly useful in e-commerce search applications with global reach.Query classification and query rewriting in e-commerce search Vespa supports deploying stateless machine learned (ML) models which comes handy when doing query classification. Machine learned models which classify the query is commonly used in e-commerce search solutions and the recent advances in natural language processing (NLP) using pre-trained deep neural language models have improved the accuracy of text classification models significantly. See e.g text classification using BERT for an illustrated guide to text classification using BERT. Vespa supports deploying ML models built with TensorFlow, XGBoost and PyTorch through the Open Neural Network Exchange (ONNX) format. ML models trained with mentioned tools can successfully be used for various query classification tasks with high accuracy. In e-commerce search, classifying the intent of the query or query session can help ranking the results by using an intent specific ranking profile which is tailored to the specific query intent. The intent classification can also determine how the result page is displayed and organised. Consider a category browse intent query like ‘shoes for men’. A query intent which might benefit from a query rewrite which limits the result set to contain only items which matched the unambiguous category id instead of just searching the product description or category fields for ‘shoes for men’ . Also ranking could change based on the query classification by using a ranking profile which gives higher weight to signals like popularity or price than text ranking features. Vespa also features a powerful query rewriting language which supports rule based query rewrites, synonym expansion and query phrasing.Product recommendation in e-commerce search Vespa is commonly used for recommendation use cases and e-commerce is no exception. Vespa is able to evaluate complex Machine Learned (ML) models over many data points (documents, products) in user time which allows the ML model to use real time signals derived from the current user’s online shopping session (e.g products browsed, queries performed, time of day) as model features. An offline batch oriented inference architecture would not be able to use these important real time signals. By batch oriented inference architecture we mean pre-computing the inference offline for a set of users or products and where the model inference results is stored in a key-value store for online retrieval. In our blog recommendation tutorial we demonstrate how to apply a collaborative filtering model for content recommendation and in part 2 of the blog recommendation tutorial we show to use a neural network trained with TensorFlow to serve recommendations in user time. Similar recommendation approaches are used with success in e-commerce.Keeping your e-commerce index up to date with real time updates Vespa is designed for horizontal scaling with high sustainable write and read throughput with low predictable latency. Updating the product catalog in real time is of critical importance for e-commerce applications as the real time information is used in retrieval filters and also as ranking signals. The product description or product title rarely changes but meta information like inventory status, price and popularity are real time signals which will improve relevance when used in ranking. Also having the inventory status reflected in the search index also avoids retrieving content which is out of stock. Vespa has true native support for partial updates where there is no need to re-index the entire document but only a subset of the document (i.e fields in the document). Real time partial updates can be done at scale against attribute fields which are stored and updated in memory. Attribute fields in Vespa can be updated at rates up to about 40-50K updates/s per content node.Campaigns in e-commerce search Using Vespa’s support for predicate fields it’s easy to control when content is surfaced in search results and not. The predicate field type allows the content (e.g a document) to express if it should match the query instead of the other way around. For e-commerce search and recommendation we can use predicate expressions to control how product campaigns are surfaced in search results. Some examples of what predicate fields can be used for: - Only match and retrieve the document if time of day is in the range 8–16 or range 19–20 and the user is a member. This could be used for promoting content for certain users, controlled by the predicate expression stored in the document. The time of day and member status is passed with the query. - Represent recurring campaigns with multiple time ranges. Above examples are by no means exhaustive as predicates can be used for multiple campaign related use cases where the filtering logic is expressed in the content.Scaling & performance for high availability in e-commerce search Are you worried that your current search installation will break by the traffic surge associated with the holiday shopping season? Are your cloud VMs running high on disk busy metrics already? What about those long GC pauses in the JVM old generation causing your 95percentile latency go through the roof? Needless to say but any downtime due to a slow search backend causing a denial of service situation in the middle of the holiday shopping season will have catastrophic impact on revenue and customer experience. Photo by Jon Tyson on Unsplash The heart of the Vespa serving stack is written in C++ and don’t suffer from issues related to long JVM GC pauses. The indexing and search component in Vespa is significantly different from the Lucene based engines like SOLR/Elasticsearch which are IO intensive due to the many Lucene segments within an index shard. A query in a Lucene based engine will need to perform lookups in dictionaries and posting lists across all segments across all shards. Optimising the search access pattern by merging the Lucene segments will further increase the IO load during the merge operations. With Vespa you don’t need to define the number of shards for your index prior to indexing a single document as Vespa allows adaptive scaling of the content cluster(s) and there is no shard concept in Vespa. Content nodes can be added and removed as you wish and Vespa will re-balance the data in the background without having to re-feed the content from the source of truth. In ElasticSearch, changing the number of shards to scale with changes in data volume requires an operator to perform a multi-step procedure that sets the index into read-only mode and splits it into an entirely new index. Vespa is designed to allow cluster resizing while being fully available for reads and writes. Vespa splits, joins and moves parts of the data space to ensure an even distribution with no intervention needed At the scale we operate Vespa at Verizon Media, requiring more than 2X footprint during content volume expansion or reduction would be prohibitively expensive. Vespa was designed to allow content cluster resizing while serving traffic without noticeable serving impact. Adding content nodes or removing content nodes is handled by adjusting the node count in the application package and re-deploying the application package. Also the shard concept in ElasticSearch and SOLR impacts search latency incurred by cpu processing in the matching and ranking loops as the concurrency model in ElasticSearch/SOLR is one thread per search per shard. Vespa on the other hand allows a single search to use multiple threads per node and the number of threads can be controlled at query time by a rank-profile setting: num-threads-per-search. Partitioning the matching and ranking by dividing the document volume between searcher threads reduces the overall latency at the cost of more cpu threads, but makes better use of multi-core cpu architectures. If your search servers cpu usage is low and search latency is still high you now know the reason. In a recent published benchmark which compared the performance of Vespa versus ElasticSearch for dense vector ranking Vespa was 5x faster than ElasticSearch. The benchmark used 2 shards for ElasticSearch and 2 threads per search in Vespa. The holiday season online query traffic can be very spiky, a query traffic pattern which can be difficult to predict and plan for. For instance price comparison sites might direct more user traffic to your site unexpectedly at times you did not plan for. Vespa supports graceful quality of search degradation which comes handy for those cases where traffic spikes reaches levels not anticipated in the capacity planning. These soft degradation features allow the search service to operate within acceptable latency levels but with less accuracy and coverage. These soft degradation mechanisms helps avoiding a denial of service situation where all searches are becoming slow due to overload caused by unexpected traffic spikes. See details in the Vespa graceful degradation documentation.Summary In this post we have explored some of the challenges in e-commerce Search and Recommendation and highlighted some of the features of Vespa which can be used to tackle e-commerce search and recommendation use cases. If you want to try Vespa for your e-commerce application you can go check out our e-commerce sample application found here . The sample application can be scaled to full production size using our hosted Vespa Cloud Service at https://cloud.vespa.ai/. Happy Holiday Shopping Season!

E-commerce search and recommendation with Vespa.ai

November 29, 2019
YAML tip: Using anchors for shared steps & jobs November 26, 2019
November 26, 2019
Share

YAML tip: Using anchors for shared steps & jobs

Sheridan Rawlins, Architect, Verizon Media Overview Occasionally, a pipeline needs several similar but different jobs. When these jobs are specific to a single pipeline, it would not make much sense to create a Screwdriver template. In order to reduce copy/paste issues and facilitate sharing jobs and steps in a single YAML, the tips shared in this post will hopefully be as helpful to you as they were to us. Below is a condensed example showcasing some techniques or patterns that can be used for sharing steps. Example of desired use jobs: deploy-prod: template: sd/noop deploy-location1: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location1 deploy-location2: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location2 Complete working example at the end of this post. Defining shared steps What is a step? First, let us define a step. Steps of a job look something like the following, and each step is an array element with an object with only one key and corresponding value. The key is the step name and the value is the cmd to be run. More details can be found in the SD Guide. jobs: job1: steps: - step1: echo "do step 1" - step2: echo "do step 2" What are anchors and aliases? Second, let me describe YAML anchors and aliases. An anchor may only be placed between an object key and its value. An alias may be used to copy or merge the anchor value. Recommendation for defining shared steps and jobs While an anchor can be defined anywhere in a yaml, defining shared things in the shared section makes intuitive sense. As annotations can contain freeform objects in addition to documented ones, we recommend defining annotations in the “shared” section. Now, I’ll show an example and explain the details of how it works: shared: environment: ANOTHER_ARG: another_arg_value annotations: steps: - .: &set-dryrun set-dryrun: | DRYRUN=false if [[ -n $SD_PULL_REQUEST ]]; then DRYRUN=true fi - .: &deploy deploy: | CMD=( my big deploy tool --dry-run="${DRYRUN:?}" --location "${LOCATION:?}" --another-arg "${ANOTHER_ARG:?}" ) "${CMD[@]}" Explanation of how the step anchor declaration patterns work: In order to reduce redundancy, annotations allow users to define one shared configuration with an “alias” that can be referenced multiple times, such as *some-step in the following example, used by job1 and job2. jobs: job1: steps: - *some-step job2: steps: - *some-step To use the alias, the anchor &some-step must result in an object with single key (also some-step) and value which is the shell code to execute. Because an anchor can only be declared between a key and a value, we use an array with a single object with single key . (short to type). The array allows us to use . again without conflict - if it were in an object, we might need to repeat the some-step three times such as: # Anti-pattern: do not use as it is too redundant. some-step: &some-step some-step: | # shell instructions The following is an example of a reasonably short pattern that can be used to define the steps with only redundancy being the anchor name and the step name: shared: annotations: steps: - .: &some-step some-step: | echo "do some step" When using *some-step, you alias to the anchor which is an object with single key some-step and value of echo "do some step" which is exactly what you want/need. FAQ Why the | character after some-step:? While you could just write some-step: echo "do some step", I prefer to use the | notation for describing shell code because it allows you to do multiline shell scripting. Even for one-liners, you don’t have to reason about the escape rules - as long as the commands are indented properly, they will be passed to the $USER_SHELL_BIN correctly, allowing your shell to deal with escaping naturally. set-dryrun: | DRYRUN=false if [[ -n $SD_PULL_REQUEST ]]; then DRYRUN=true fi Why that syntax for environment variables? 1. By using environment variables for shared steps, it allows the variables to be altered by the specific jobs that invoke them. 2. The syntax "${VARIABLE:?}" is useful for a step that needs a value - it will cause an error if the variable is undefined or empty.Why split CMD into array assignment and invocation? The style of defining an array and then invoking it helps readability by putting each logical flag on its own line. It can be digested by a human very easily and also copy/pasted to other commands or deleted with ease as a single line. Assigning to an array allows multiple lines as bash will not complete the statement until the closing parenthesis. Why does one flag have –flag=value and another have –flag value Most CLI parsers treat boolean flags as a flag without an expected value - omission of the flag is false, existence is true. However, many CLI parsers also accept the --flag=value syntax for boolean flags and, in my opinion, it is far easier to debug and reason about a variable (such as false) than to know that the flag exists and is false when not provided. Defining shared jobs What is a job? A job in screwdriver is an object with many fields described in the SD Guide Job anchor declaration patterns To use a shared job effectively, it is helpful to use a feature of YAML that is documented outside of the YAML 1.2 Spec called Merge Key. The syntax <<: *some-object-anchor lets you merge in keys of an anchor that has an object as its value into another object and then add or override keys as necessary. Recommendation for defining shared jobs shared: annotations: jobs: deploy-job: &deploy-job image: the-deploy-image steps: - *set-dryrun - *deploy If you browse back to the previous example of desired use (also copied here), you can see use of the <<: *deploy-job to start with the deploy-job keys/values, and then add requires and environment overrides to customize the concrete instances of the deploy job. jobs: deploy-prod: template: sd/noop deploy-location1: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location1 deploy-location2: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location2 FAQ Why is environment put in the shared section and not included with the shared job? The answer to that is quite subtle. The Merge Key merges top level keys; if you were to put defaults in a shared job, overriding environment: would end up clobbering all of the provided values. However, Screwdriver follows up the YAML parsing phase with its own logic to merge things from the shared section at the appropriate depth. Why not just use shared.steps? As noted above, Screwdriver does additional work to merge annotations, environment, and steps into each job after the YAML parsing phase. The logic for steps goes like this: 1. If a job has NO steps key, then it inherits ALL shared steps. 2. If a job has at least one step, then only matching wrapping steps (steps starting with pre or post) are copied in at the right place (before or after steps that the job provides matching the remainder of the step name after pre or post). While the above pattern might be useful for some pipelines, complex pipelines typically have a few job types and may want to share some but not all steps. Complete Example Copy paste the following into validator shared: environment: ANOTHER_ARG: another_arg_value annotations: steps: - .: &set-dryrun set-dryrun: | DRYRUN=false if [[ -n $SD_PULL_REQUEST ]]; then DRYRUN=true fi - .: &deploy deploy: | CMD=( my big deploy tool --dry-run="${DRYRUN:?}" --location "${LOCATION:?}" --another-arg "${ANOTHER_ARG:?}" ) "${CMD[@]}" jobs: deploy-job: &deploy-job image: the-deploy-image steps: - *set-dryrun - *deploy jobs: deploy-prod: template: sd/noop deploy-location1: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location1 deploy-location2: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location2

YAML tip: Using anchors for shared steps & jobs

November 26, 2019
Dash Open 13: Using and Contributing to Hadoop at Verizon Media November 24, 2019
November 24, 2019
Share

Dash Open 13: Using and Contributing to Hadoop at Verizon Media

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Eric Badger, Software Development Engineer, about using and contributing to Hadoop at Verizon Media.  Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 13: Using and Contributing to Hadoop at Verizon Media

November 24, 2019
Build Parameters November 13, 2019
November 13, 2019
Share

Build Parameters

Alan Dong, Software Engineer, Verizon Media Screwdriver team is constantly evolving and building new features for its users. Today, we are announcing a nuanced feature: Build Parameters, aka Parameterized Builds, which enables users to have more control over build pipelines.Purpose The Build Parameters feature allows users to define a set of parameters on the pipeline level; users can customize runtime parameters either through using the UI or API to kickoff builds. This means users can now implement reactive behaviors based on the parameters passed in as well.Definition There are 2 ways of defining parameters, see paramters: nameA: "value1" nameB: value: "value2" description: "description of nameB" Parameters is a dictionary which expects key:value pairs. nameA: "value1" key: string is a shorthand for writting as key: value nameA: value: "value1" description: "" These two are identical with description to be an empty string.Example See Screwdriver pipeline shared: image: node:8 parameters: region: "us-west-1" az: value: "1" description: "default availability zone" jobs: main: requires: [~pr, ~commit] steps: - step1: 'echo "Region: $(meta get parameters.region.value)"' - step2: 'echo "AZ: $(meta get parameters.az.value)"' You can also preview the parameters that being used during a build in Setup -> sd-setup-init step Pipeline Preview Screenshot:Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.780 - UI - v1.0.466Contributors Thanks to the following contributors for making this feature possible: - adong Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Build Parameters

November 13, 2019
Vespa Product Updates, October/November 2019: Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, 
Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance, and Datadog Monitoring Support November 5, 2019
November 5, 2019
Share

Vespa Product Updates, October/November 2019: Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance, and Datadog Monitoring Support

Kristian Aune, Tech Product Manager, Verizon Media In the September Vespa product update, we mentioned Tensor Float Support, Reduced Memory Use for Text Attributes, Prometheus Monitoring Support, and Query Dispatch Integrated in Container.  This month, we’re excited to share the following updates: Nearest Neighbor and Tensor Ranking Tensors are native to Vespa. We compared elastic.co to vespa.ai testing nearest neighbor ranking using dense tensor dot product. The result of an out-of-the-box configuration demonstrated that Vespa performed 5 times faster than Elastic. View the test results. Optimized JSON Tensor Feed Format A tensor is a data type used for advanced ranking and recommendation use cases in Vespa. This month, we released an optimized tensor format, enabling a more than 10x improvement in feed rate. Read more. Matched Elements in Complex Multi-value Fields Vespa is used in many use cases with structured data - documents can have arrays of structs or maps. Such arrays and maps can grow large, and often only the entries matching the query are relevant. You can now use the recently released matched-elements-only setting to return matches only. This increases performance and simplifies front-end code. Large Weighted Set Update Performance Weighted sets in documents are used to store a large number of elements used in ranking. Such sets are often updated at high volume, in real-time, enabling online big data serving. Vespa-7.129 includes a performance optimization for updating large sets. E.g. a set with 10K elements, without fast-search, is 86.5% faster to update. Datadog Monitoring Support Vespa is often used in large scale mission-critical applications. For easy integration into dashboards, Vespa is now in Datadog’s integrations-extras GitHub repository. Existing Datadog users will now find it easy to monitor Vespa. Read more. About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, October/November 2019: Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance, and Datadog Monitoring Support

November 5, 2019
Collection Page Redesign November 4, 2019
November 4, 2019
Share

Collection Page Redesign

Yufeng Gao, Software Engineer Intern, Verizon Media We would like to introduce our new collections dashboard page. Users can now know more about the statuses of pipelines and have more flexibility when managing pipelines within a collection. Main Features View Modes The new collection dashboard provides two view options - card mode and list mode. Both modes display pipeline repo names, branches, histories, and latest event info (such as commit sha, status, start date, duration). However, card mode also shows the latest events while the list mode doesn’t. Users can switch between the two modes using the toggle on the top right corner of the dashboard’s main panel. Collection Operations To create or delete a collection, users can use the left sidebar of the new collections page. For a specific existing collection, the dashboard offers three operations which can be found to the right of the title of the current collection: 1. Search all pipelines that the current collection doesn’t contain, then select and add some of them into the current collection; 2. Change the name and description of the current collection; 3. Copy and share the link of the current collection. Additionally, the dashboard also provides useful pipeline management operations: 1. Easily remove a single pipeline from the collection; 2. Remove multiple pipelines from the collection; 3. Copy and add multiple pipelines of the current collection to another collection. Default Collection Another new feature is the default collection, a collection where users can find all pipelines created by themselves. Note: Users have limited powers when it comes to the default collection; that is, they cannot conduct most operations they can do on normal collections. Users can only copy and share default collection links.Compatibility List In order to see the collection page redesign, you will need these minimum versions: - API: v0.5.781 - UI: v1.0.466Contributors - code-beast - adong - jithine Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on GitHub and Slack.

Collection Page Redesign

November 4, 2019
Recent Updates October 21, 2019
October 21, 2019
Share

Recent Updates

Jithin Emmanuel, Engineering Manager, Verizon Media Recent bug fixes in Screwdriver: Meta - skip-store option to prevent caching external meta. - meta cli is now concurrency safe. - When caching external metadata, meta-cli will not store existing cached data present in external metadata.API - Users can use SD_COVERAGE_PLUGIN_ENABLED environment variable to skip Sonarqube coverage bookend. - Screwdriver admins can now update build status to FAILURE through the API. - New API endpoint for fetching latest build for a job is now available. - Fix for branch filtering not working for PR builds. - Fix for branch filtering not working for triggered jobs.Compatibility List In order to have these improvements, you will need these minimum versions: - API - v0.5.773 - Launcher - v6.0.23Contributors Thanks to the following contributors for making this feature possible: - adong - klu909 - scr-oath - kumada626 - tk3fftk Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Recent Updates

October 21, 2019
Database schema migrations October 18, 2019
October 18, 2019
Share

Database schema migrations

Lakshminarasimhan Parthasarathy, Verizon Media Database schema migrations Screwdriver now supports database schema migrations using sequelize-cli migrations. When adding any fields to models in the data-schema, you will need to add a migration file. Sequelize-cli migrations keep track of changes to the database, helping with adding and/or reverting the changes to the DB. They also ensure models and migration files are in sync. Why schema migrations? Database schema migrations will help to manage the state of schemas. Screwdriver originally did schema deployments during API deployments while this was helpful for low-scale deployments, it also leads to unexpected issues for high-scale deployments. For such high-scale deployments, migrations are more effective as they ensure quicker and more consistent schema deployment outside of API deployments. Moreover, API traffic is not served until database schema changes are applied and ready. Cluster Admins In order to run schema migrations, DDL sync via API should be disabled using the DATASTORE_DDL_SYNC_ENABLED environment variable, since this option is enabled by default.  - Both schema migrations and DDL sync via API should not be run together. Either option should suffice based on the scale of Screwdriver deployment. - Always create new migration files for any new DDL changes. - Do not edit or remove migration files even after it’s migrated and available in the database.  Screwdriver cluster admins can refer to the following documentation for more details on  database schema migrations: - README: https://github.com/screwdriver-cd/data-schema/blob/master/CONTRIBUTING.md#migrations - Issue: https://github.com/screwdriver-cd/screwdriver/issues/1664 - Disable DDL sync via API: https://github.com/screwdriver-cd/screwdriver/pull/1756 Compatibility List In order to use this feature, you will need these minimum versions: - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.752 Contributors Thanks to the following people for making this feature possible: - parthasl - catto - dekus Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Database schema migrations

October 18, 2019
Improving Screwdriver’s meta tool October 2, 2019
October 2, 2019
Share

Improving Screwdriver’s meta tool

Sheridan Rawlins, Architect, Verizon Media Improving Screwdriver’s meta tool Over the past month there have been a few changes to the meta tool mostly focused on using external metadata, but also on helping to identify and self-diagnose a few silent gotchas we found. Metadata is a structured key/value data store that gives you access to details about your build. Metadata is carried over to all builds part of the same event, At the end of each build metadata is merged back to the event the build belongs to. This allows builds to share their metadata with other builds in the same event or external. External metadata External metadata can be populated for a job in your pipeline using the requires clause that refers to it in the form sd@${pipelineID}:${jobName} (such as sd@12345:main). If sd@12345:main runs to completion and “triggers” a job or jobs in your pipeline, a file will be created with meta from that build in /sd/meta/sd@12345:main.json, and you can refer to any of its data with commands such as meta get someKey --external sd@12345:main. The above feature has existed for some time, but there were several corner cases that made it challenging to use external metadata: 1. External metadata was not provided when the build was not triggered from the external pipeline such as by clicking the “Start” button, or via a scheduled trigger (i.e., through using the buildPeriodically annotation). 2. Restarting a previously externally-triggered build would not provide the external metadata. The notion of rollback can be facilitated by retriggering a deployment job, but if that deployment relies on metadata from an external trigger, it wouldn’t be there before these improvements.Fetching from lastSuccessfulMeta Screwdriver has an API endpoint called lastSuccessfulMeta. While it is possible to use this endpoint directly, providing this ability directly in the meta tool makes it a little easier to just “make it so”. By default now, if external metadata does not exist in the file /sd/meta/sd@12345:main.json, it is fetched from that job’s lastSuccessfulMeta via the API call. Should this behavior not be desired, the flag --skip-fetch can be used to skip fetching. For rollback behavior, however, this feature by itself wasn’t enough - consider a good deployment followed by a bad deployment. The “bad” deployment would most likely have deployed what, from a build standpoint, was “successful”. When retriggering the previous job, because it is a manual trigger, there will be no external metadata and the lastSuccessfulMeta will most likely be fetched and the newer, “bad” code would just get re-deployed again. For this reason the next feature was also added to meta - “caching external data in the local meta”. Caching external data in the local meta External metadata for sd@12345:main (whether from trigger or fetched from lastSuccessfulMeta) will now be stored into and searched first from the local meta under the key sd.12345.main. Note: no caching will be done when --skip-fetch is passed. This caching of external meta helps with a few use cases: 1. Rollback is facilitated because the external metadata at the time a job was first run is now saved and used when “Restart” is pressed for a job. 2. External metadata is now available throughout all jobs of an event - previously, only the triggered job or jobs would receive the external metadata, but because the local meta is propagated to every job in an event, the sd.12345.main key will be available to all jobs. Since meta will look there first, any job in a workflow can use that same --external sd@12345:main with confidence that it will get the same metadata which was received by the triggered job.Self-diagnosing gotchas 1. Meta uses a CLI parser called urfave/cli. Previously, it configured its CLI flags in both the “global” and “subcommand-specific” locations; this led to being able to pass flags in either location: before the subcommand like get or set or after it. However, they would only be honored in the latter position. Instead, only the –meta-space is global, and all other flags are per-subcommand. It is no longer possible to pass --external to the set subcommand. 2. Number of arguments - previously, if extra arguments were passed to flags that didn’t take them, or if arguments were forgotten to flags that expected an argument, then it was possible to become confused about the key and/or value vs flags. Now, the flags are strictly counted - 1 (“key”) for get, and 2 - (“key”, “value”) for set.

Improving Screwdriver’s meta tool

October 2, 2019
Vespa Product Updates, September 2019: Tensor Float Support, Reduced Memory Use for Text Attributes, Prometheus Monitoring Support, and Query Dispatch Integrated in Container September 28, 2019
September 28, 2019
Share

Vespa Product Updates, September 2019: Tensor Float Support, Reduced Memory Use for Text Attributes, Prometheus Monitoring Support, and Query Dispatch Integrated in Container

Kristian Aune, Tech Product Manager, Verizon Media In the August Vespa product update, we mentioned BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. This month, we’re excited to share the following updates with you: Tensor Float Support Tensors now supports float cell values, for example tensor(key{}, x[100]). Using the 32 bits float type cuts memory footprint in half compared to the 64 bits double, and can increase ranking performance up to 30%. Vespa’s TensorFlow and ONNX integration now converts to float tensors for higher performance. Read more. Reduced Memory Use for Text Attributes  Attributes in Vespa are fields stored in columnar form in memory for access during ranking and grouping. From Vespa 7.102, the enum store used to hold attribute data uses a set of smaller buffers instead of one large. This typically cuts static memory usage by 5%, but more importantly reduces peak memory usage (during background compaction) by 30%. Prometheus Monitoring Support Integrating with the Prometheus open source monitoring solution is now easy to do using the new interface to Vespa metrics. Read more. Query Dispatch Integrated in Container The Vespa query flow is optimized for multi-phase evaluation over a large set of search nodes. Since Vespa-7-109.10, the dispatch function is integrated into the Vespa Container process which simplifies the architecture with one less service to manage. Read more. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, September 2019: Tensor Float Support, Reduced Memory Use for Text Attributes, Prometheus Monitoring Support, and Query Dispatch Integrated in Container

September 28, 2019
Bug fixes and improvements September 26, 2019
September 26, 2019
Share

Bug fixes and improvements

Tiffany Kyi, Software Engineer, Verizon Media Over the last month, we’ve made changes to improve Screwdriver performance, enhance the meta-cli, and fix some feature bugs. Performance - Removed aggregate view - this feature was making calls to get all builds for all jobs in a pipeline; removing this view greatly decreases the load on our API - Added indexes for querying builds table - this change should speed up API calls to get jobsMeta - Allow json values in meta get/set - can set json values in meta using –json-value or -j - –external works even if the external job did not trigger the current one; it will fetch meta from the external job’s last successful buildBugs - Fix for ignoreCommitsBy - will skip ci for any authors that match ignoreCommitsBy field (set by the cluster admin) - Fix cmdPath when running with sourceDir - working directory is fixed now - Version or tag-specific template and command URLs - Now, when switching to a different version or tag in the UI, the URL will update accordingly (ie: Clicking on latest tag for python/validate template -> https://cd.screwdriver.cd/templates/python/validate_type/0.2.120 (corresponding version)) - Use new circuit-fuses - the latest package has enhanced logging options (https://github.com/screwdriver-cd/circuit-fuses/pull/23) - PR authors should not be able to restart builds if restrictPR is onCompatibility List Note: You will need to pull in the buildcluster-queue-worker/queue-worker first, then the API, otherwise you will get data-schema failures. In order to have these improvements, you will need these minimum versions (please read the note above): - API - v0.5.751 - UI - v1.0.447 - Store - v3.10.0 - Launcher - v6.0.18 - Build cluster queue worker - v1.3.7 - Queue worker - v2.7.11Contributors Thanks to the following contributors for making this feature possible: - adong - d2lam - ibu1224 - jithin1987 - klu909 - parthasl - scr-oath Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Bug fixes and improvements

September 26, 2019
Dash Open 12: Apache Storm 2.0 - Open Source Distributed Real-time Computation System September 25, 2019
September 25, 2019
Share

Dash Open 12: Apache Storm 2.0 - Open Source Distributed Real-time Computation System

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Kishor Patil, Sr. Principal Software Systems Engineer on the Verizon Media team. Kishor shares what’s new in Storm 2.0, an open source distributed real-time computation system, as well as, how Verizon Media uses and contributes to Storm. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 12: Apache Storm 2.0 - Open Source Distributed Real-time Computation System

September 25, 2019
Vespa Product Updates, August 2019: BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export August 19, 2019
August 19, 2019
Share

Vespa Product Updates, August 2019: BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export

Kristian Aune, Tech Product Manager, Verizon Media In the recent Vespa product update, we mentioned Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. This month, we’re excited to share the following feature updates with you: BM25 Rank Feature The BM25 rank feature implements the Okapi BM25 ranking function and is a great candidate to use in a first phase ranking function when you’re ranking text documents. Read more. Searchable Reference Attribute A reference attribute field can be searched using the document id of the parent document type instance as query term, making it easy to find all children for a parent document. Learn more. Tensor in Summary Features A tensor can now be returned in summary features. This makes rank tuning easier and can be used in custom Searchers when generating result sets. Read more. Metrics Export To export metrics out of Vespa, you can now use the new node metric interface. Aliasing metric names is possible and metrics are assigned to a namespace. This simplifies integration with monitoring products like CloudWatch and Prometheus. Learn more about this update. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, August 2019: BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export

August 19, 2019
Dash Open 11: Elide - Open Source Java Library - Easily Stand Up a JSON API or GraphQL Web Service August 14, 2019
August 14, 2019
Share

Dash Open 11: Elide - Open Source Java Library - Easily Stand Up a JSON API or GraphQL Web Service

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Gil Yehuda, Sr. Director of Open Source, interviews Aaron Klish, a Distinguished Architect on the Verizon Media team. Aaron shares why Elide, an open source Java library that enables you to stand up a JSON API or GraphQL web service with minimal effort, was built and how others can use and contribute to Elide. Learn more at http://elide.io/. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 11: Elide - Open Source Java Library - Easily Stand Up a JSON API or GraphQL Web Service

August 14, 2019
Artifacts Preview August 7, 2019
August 7, 2019
Share

Artifacts Preview

Alan Dong, Software Engineer, Verizon Media We have recently rolled out a highly requested feature: Artifacts Preview. With this feature, users can view unit test results and click through to other files without needing to download the files locally from $SD_ARTIFACTS_DIR. An example of unit tests in the Screwdriver UI: We also made artifacts a separate route so users can share artifact links with teammates. You can see the live demo at: https://cd.screwdriver.cd/pipelines/1/builds/141890/artifacts Implementation We have gone through multiple iterations of design prior to implementation to reach the above result. We have also redesigned the look and feel based on user feedback. First, the main concern in our design process was security. We wanted to make sure the artifact viewer was code-safe in the unlikely case that a generated artifact contained malicious code. In order to protect our users, we decided to embed html inside an iframe with the sandbox attribute turned on. Iframe, or inline frame, already serves as a layer of separation between our application and the artifacts we’re trying to load. Content in an iframe is able to access content from the parent frame only through a specific attribute. Using the sandbox attribute of an iframe allows for greater granularity and control on the framed content. Second, the next consideration in our design was authentication. We architected Screwdriver to be cloud-ready, with horizontal scalability in mind; thus, the main work horses of the application, the UI, API, and Store were split into microservices (see Overall Architecture). Due to this set up, all artifacts are stored in the Store, all data is shown in the UI, and the API acts as both the gateway and mediator between the UI and Store. The diagram below reflects this relationship: the UI communicates with the API, and API sends back a 302 redirect with a short-lived JWT Token issued Store link. After the link is returned, the UI makes a request with the link to get the appropriate artifacts from the Store. Third, the last main concern was user experience. We wanted to be able to preserve the user’s content type when possible so users could view their artifacts natively in their proper format. The Store generally returns HTML with an image, anchor links, CSS, or Javascript as relative paths as shown in the following examples: <img src="example.png" alt="image"><a href="./example.html"> <link href="example.css"><script src="example.js"> Our solution was to inject a customized script when the query parameter is ?type=preview, to replace the relative path so its URLs are prefixed by the API. This change allowed us to only inject code if the user is previewing artifacts through the Screwdriver UI. Otherwise, we return the user’s original content. One caveat due to this design is that since we don’t override CSS content, some background URLs will not load correctly. Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.722 - UI - v1.0.440 - Store - v3.10.0 - Launcher - v6.0.12Contributors Thanks to the following contributors for making this feature possible: - adong Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Artifacts Preview

August 7, 2019
Meet Yahoo Research at KDD 2019 August 2, 2019
August 2, 2019
Share

Meet Yahoo Research at KDD 2019

By Kim Capps-Tanaka, Chief of Staff, Yahoo Research If you’re attending KDD in Anchorage, Alaska, the Yahoo Research team would love to meet you! Send us an email or tweet to discuss research or job opportunities on the team. In addition to hosting a booth, we’re excited to present papers, posters, and talks.  Sunday, August 4th - “Modeling and Applications for Temporal Point Processes”, Junchi Yan, Hongteng Xu, Liangda Li -  8am - 12pm, Summit 8-Ground Level, Egan Monday, August 5th - “Time-Aware Prospective Modeling of Users for Online Display Advertising”, Djordje Gligorijevic, Jelena Gligorijevic, Aaron Flores - 8:40am - 9am, Kahtnu 2 - Level 2, Dena’ina - “The Future of Ads”, Brendan Kitts - 3pm-3:30pm, Kahtnu 2 - Level 2, Dena’ina - “Learning from Multi-User Activity Trails for B2B Ad Targeting”, Shaunak Mishra, Jelena Gligorijevic, Narayan Bhamidipati - 4:35pm-4:55pm, Kahtnu 2- Level 2, Dena’ina - “Automatic Feature Engineering From Very High Dimensional Event Logs Using Deep Neural Networks”, Kai Hu, Joey Wang, Yong Liu, Datong Chen  - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall Tuesday, August 6th - “Predicting Different Type of Conversions using Multi-Task Learning”, Junwei Pan, Yizhi Mao, Alfonso Ruiz, Yu Sun, Aaron Flores - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall - “Carousel Ads Optimization in Yahoo Gemini Native”, Oren Somekh, Michal Aharon, Avi Shahar, Assaf Singer, Boris Trayvas, Hadas Vogel, Dobri Dobrev - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall - “Understanding Consumer Journey using Attention-based Recurrent Neural Networks”, Yichao Zhou, Shaunak Mishra, Jelena Gligorijevic, Tarun Bhatia, Narayan Bhamidipati - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall - “Recurrent Neural Networks for Stochastic Control in Real-Time Bidding”, Nicolas Grislain, Nicolas Perrin, Antoine Thabault - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall * Bold authors denotes Yahoo Researchers Hope to see you at KDD!

Meet Yahoo Research at KDD 2019

August 2, 2019
Dash Open 10: Moloch - Open Source Large Scale Indexed Packet Capture and Search System August 1, 2019
August 1, 2019
Share

Dash Open 10: Moloch - Open Source Large Scale Indexed Packet Capture and Search System

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Andy Wick and Elyse Rinne from the Verizon Media team. Andy and Elyse share why they built Moloch, an open source large scale indexed packet capture and search system, and how others can use and contribute to Moloch. Learn more at https://molo.ch and register for MolochON 2019 (Oct 1st, Sunnyvale, CA). Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 10: Moloch - Open Source Large Scale Indexed Packet Capture and Search System

August 1, 2019
Introducing Denali: An Open Source Themeable Design System July 31, 2019
July 31, 2019
Share

Introducing Denali: An Open Source Themeable Design System

By Jazmin Orozco, Product Designer, Verizon Media As designers on the Platforms and Technology team at Yahoo (now Verizon Media), we understand firsthand that creating polished and intuitive user interfaces (UI) is a difficult task - even more so when projects do not have dedicated resources for UI design. In order to provide a solution to this, we created an easy plug and play approach called Denali. Denali is a comprehensive and customizable design system and CSS framework that provides a scalable approach to efficient UI design and development. Denali has worked so well for us internally that we’ve decided to share it with the open source community in the hope that your projects may also benefit from its use. Denali is rooted in our experience designing for a wide variety of platform interfaces including monitoring dashboards, CI/CD tools, security authentication, and localization products. Some of these platforms, such as Screwdriver and Athenz, are even open source projects themselves. When creating Denali we audited these platforms to create a library of visually consistent and reusable UI components. These became the core of our design system. We then translated the components into a CSS framework and applied the design system across our products. In doing so we were able to quickly create consistent experiences across our product family. As a whole, Denali allows us to unify the visual appearance of our platform product family, enhance our user’s experience, and facilitate efficient cross-functional front-end development. We encourage you to use Denali as the UI framework for your own open source projects. We look forward to your feedback, ideas, and contributions as we continue to improve and expand Denali. The Denali Design System simplifies the UI design and development process by providing: - A component library with corresponding front-end frameworks - Customization for themes - An icon library with a focus on data and engineering topics such as data visualization, CI/CD, and security - Design principles Component Library and Frameworks Denali’s component library contains 20+ individual component types with a corresponding CSS framework. Components are framework independent allowing you to use only what you need. Additionally, we’ve started building out other industry-leading frameworks such as Angular, Ember, and React. Theme Customization Denali’s components support theming through custom variables. This means their visual appearance can be adapted easily to fit the visual style of any brand or catered towards specific use cases while maintaining the same structure. Data and Engineering Focused Icon Library Denali’s custom icon library offers over 800 solid and outline icons geared towards engineering and data. Icons are available for use as svg, png, and as a font. We also welcome icon requests through GitHub. Design Principles Denali’s comprehensive Design Principles provide guidelines and examples on the proper implementation of components within a product’s UI to create the best user experience. Additionally, our design principles have a strong focus on accessibility best practices. We are excited to share Denali with the open source community. We look forward to seeing what you build with Denali as well as your contributions and feedback! Stay tuned for exciting updates and reach out to us on twitter @denali_design or via email. Acknowledgments Jay Torres, Chas Turansky, Marco Sandoval, Chris Esler, Dennis Chen, Jon Kilroy, Gil Yehuda, Ashley Wolf, Rosalie Bartlett

Introducing Denali: An Open Source Themeable Design System

July 31, 2019
Dash Open 09: Panoptes - Open Source Global Scale Network Telemetry Ecosystem July 30, 2019
July 30, 2019
Share

Dash Open 09: Panoptes - Open Source Global Scale Network Telemetry Ecosystem

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Ian Holmes, James Diss, and Varun Varma, from the Verizon Media team. Learn why they built and open sourced Panoptes, a global scale network telemetry ecosystem, and how others can use and contribute to Panoptes. Learn more at https://getpanoptes.io/. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 09: Panoptes - Open Source Global Scale Network Telemetry Ecosystem

July 30, 2019
Introducing Ariel: Open Source AWS Reserved Instances Management Tooling July 1, 2019
July 1, 2019
Share

Introducing Ariel: Open Source AWS Reserved Instances Management Tooling

Sean Bastille, Chief Architect, Verizon Media Effectively using Reserved Instances (RIs) is a cornerstone component for managing costs in AWS. Proper evaluations of RIs can be challenging. There are many tools, each with their own nuances, that help evaluate RI needs. At Verizon Media, we built a tool to help manage our RIs, called Ariel, and today we are pleased to announce that we have open-sourced Ariel so that you can use and customize it for your own needs. Why We Built Ariel The main reason we chose to build Ariel was due to the limitations of currently available solutions. Amazon provides RI recommendations, both as an executive service, and through Cost Explorer, however, these tools: - Target break-even RI Utilization, without the flexibility to tune - Evaluate per-account RI need, or company-wide RI need, but are not capable of combining the views Additionally, Ariel has a sophisticated configuration allowing for multiple passes of RI evaluations targeting usage slope based thresholds and allowing for simultaneous classic and convertible RI recommendations. Whereas there are 3rd party vendor tools that help optimize RI utilization, we did not find an open source solution that was free to use and could be expanded upon by a community. How Ariel Reduces EC2 Costs RIs are a core component of cost management at Verizon Media. By using RIs, we reduce EC2 costs in some workloads by as much as 70%, which in turn reduces our AWS bill by about 25%.   Ariel helps us evaluate RI purchases by determining: - What our current RI demand is, taking into consideration existing RIs - Floating RIs, which are not used in the purchasing account, but are available to the company - Which specific accounts to make purchases in so the costs can be more closely aligned with P&Ls Explore and Contribute We invite you to use Ariel and join our community by developing more features and contributing to the code. If you have any questions, feel free to email my team.

Introducing Ariel: Open Source AWS Reserved Instances Management Tooling

July 1, 2019
Trusted templates and commands June 26, 2019
June 26, 2019
Share

Trusted templates and commands

Dekus Lam, Developer Advocate, Verizon Media Tiffany Kyi, Software Engineer, Verizon Media Currently, Screwdriver offers templates and commands which help simplify configurations across piplines by encapsulating sets of predefined steps for jobs and save time by reusing prebuilt sets of instructions on jobs. Since these templates and commands are mostly sourced and powered by the developer community, it is better to have a standard process to promote some as certified, or as the Screwdriver team calls it, “Trusted”. Although the certification process may vary among teams and companies, Screwdriver provides the abstraction for system administrators to easily promote/demote partnering templates and commands. Certified, or “Trusted”, templates or commands will receive a special badge next to their name on both the search listing as well as the detailed page. “Trusted” toggle button for Screwdriver Admins: Compatibility List In order to use this feature, you will need these minimum versions (please read the note above): - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.705 - [UI] (https://hub.docker.com/r/screwdrivercd/ui) - v1.0.432Contributors Thanks to the following contributors for making this feature possible: - DekusDenial - tkyiQuestions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Trusted templates and commands

June 26, 2019
Dash Open 08: Bullet - Open Source Real-Time Query Engine for Large Data Streams June 12, 2019
June 12, 2019
Share

Dash Open 08: Bullet - Open Source Real-Time Query Engine for Large Data Streams

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Nate Speidel, a Software Engineer at Verizon Media. Nate shares why his team built and open sourced Bullet, a real-time query engine for very large data streams, and how others can use and contribute to Bullet. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 08: Bullet - Open Source Real-Time Query Engine for Large Data Streams

June 12, 2019
Shared Verizon Media’s AI Solutions at the AI Accelerator Summit - Automobile Traffic Flow Monitoring, Cellular Network Performance Prediction, IoT Analytics, and Threat Detection June 8, 2019
June 8, 2019
Share

Shared Verizon Media’s AI Solutions at the AI Accelerator Summit - Automobile Traffic Flow Monitoring, Cellular Network Performance Prediction, IoT Analytics, and Threat Detection

Chetan Trivedi, Head of Technical Program Management (Verizon Solutions Team), Verizon Media I recently spoke at the AI Accelerator Summit in San Jose. During my presentation, I shared a few of Verizon Media’s AI Solutions via four machine learning use cases, including: - Cellular Network Performance Prediction - Our team implemented a time series prediction model for the performance of base station parameters such as bearer drop, SIP, and handover failure. - Threat Detection System - DDoS (Distributed Denial of Service) use case where we implemented a real-time threat detection capability using time series data. - Automobile Traffic Flow Monitoring - A collaboration with a city to identify traffic patterns at certain traffic junctions and streets to provide insights so they can improve traffic patterns and also address safety concerns. - IoT Analytics - Detecting vending machine anomalies and addressing them before dispatching the service vehicle with personnel to fix the problem which is very costly for the businesses. During the conference, I heard many talks that reinforced common machine learning and AI industry themes. These included: - Key factors to consider when selecting the right use cases for your AI/ML efforts include understanding your error tolerance and ensuring you have sufficient training data. - Implementing AI/ML at scale (with a high volume of data) and moving towards deep learning for supported use cases, where data is highly dimensional and/or higher prediction accuracy is required with enough data to train deep learning models. - Using ensemble learning techniques such as bagging, boosting or other variants of these methods. At Verizon Media, we’ve built and open sourced several helpful tools that are focused on big data, machine learning, and AI, including: - DataSketches - high-performance library of stochastic streaming algorithms commonly called “sketches” in the data sciences. - TensorFlowOnSpark - brings TensorFlow programs to Apache Spark clusters. - Trapezium - framework to build batch, streaming and API services to deploy machine learning models using Spark and Akka compute. - Vespa - big data serving engine. If you’d like to discuss any of the above use cases or open source projects, feel free to email me.

Shared Verizon Media’s AI Solutions at the AI Accelerator Summit - Automobile Traffic Flow Monitoring, Cellular Network Performance Prediction, IoT Analytics, and Threat Detection

June 8, 2019
Apache Storm 2.0 Improvements May 30, 2019
May 30, 2019
Share

Apache Storm 2.0 Improvements

By Kishor Patil, Principal Software Systems Engineer at Verizon Media, and PMC member of Apache Storm & Bobby Evans, Apache Member and PMC member of Apache Hadoop, Spark, Storm, and Tez We are excited to be part of the new release of Apache Storm 2.0.0. The open source community has been working on this major release, Storm 2.0, for quite some time. At Yahoo we had a long time and strong commitment to using and contributing to Storm; a commitment we continue as part of Verizon Media. Together with the Apache community, we’ve added more than 1000 fixes and improvements to this new release. These improvements include sending real-time infrastructure alerts to the DevOps folks running Storm and the ability to augment ingested content with related content, thereby giving the users a deeper understanding of any one piece of content.   Performance Performance and utilization are very important to us, so we developed a benchmark to evaluate various stream processing platforms and the initial results showed Storm to be among the best. We expect to release new numbers by the end of June 2019, but in the interim, we ran some smaller Storm specific tests that we’d like to share. Storm 2.0 has a built-in load generation tool under examples/storm-loadgen. It comes with the requisite word count test, which we used here, but also has the ability to capture a statistical representation of the bolts and spouts in a running production topology and replay that load on another topology, or another version of Storm. For this test, we backported that code to Storm 1.2.2. We then ran the ThroughputVsLatency test on both code bases at various throughputs and different numbers of workers to see what impact Storm 2.0 would have. These were run out of the box with no tuning to the default parameters, except to set max.spout.pending in the topologies to be 1000 sentences, as in the past that has proven to be a good balance between throughput and latency while providing flow control in the 1.2.2 version that lacks backpressure. In general, for a WordCount topology, we noticed 50% - 80% improvements in latency for processing a full sentence. Moreover, 99 percentile latency in most cases, is lower than the mean latency in the 1.2.2 version. We also saw the maximum throughput on the same hardware more than double. Why did this happen? STORM-2306 redesigned the threading model in the workers, replaced disruptor queues with JCTools queues, added in a new true backpressure mechanism, and optimized a lot of code paths to reduce the overhead of the system. The impact on system resources is very promising. Memory usage was untouched, but CPU usage was a bit more nuanced. At low throughput (< 8000 sentences per second) the new system uses more CPU than before. This can be tuned as the system does not auto-tune itself yet. At higher rates, the slope of the line is much lower which means Storm has less overhead than before resulting in being able to process more data with the same hardware. This also means that we were able to max out each of these configurations at > 100,000 sentences per second on 2.0.0 which is over 2x the maximum 45,000 sentences per second that 1.2.2 could do with the same setup. Note that we did nothing to tune these topologies on either setup. With true backpressure, a WordCount Topology could consistently process 230,000 sentences per second by disabling the event tracking feature. Due to true backpressure, when we disabled it entirely, then we were able to achieve over 230,000 sentences per second in a stable way, which equates to over 2 million messages per second being processed on a single node. Scalability In 2.0, we have laid the groundwork to make Storm even more scalable. Workers and supervisors can now heartbeat directly into Nimbus instead of going through ZooKeeper, resulting in the ability to run much larger clusters out of the box. Developer Friendly Prior to 2.0, Storm was primarily written in Clojure. Clojure is a wonderful language with many advantages over pure Java, but its prevalence in Storm became a hindrance for many developers who weren’t very familiar with it and didn’t have the time to learn it.  Due to this, the community decided to port all of the daemon processes over to pure Java. We still maintain a backward compatible storm-clojure package for those that want to continue using Clojure for topologies. Split Classpath In older versions, Storm was a single jar, that included code for the daemons as well as the user code. We have now split this up and storm-client provides everything needed for your topology to run. Storm-core can still be used as a dependency for tests that want to run a local mode cluster, but it will pull in more dependencies than you might expect. To upgrade your topology to 2.0, you’ll just need to switch your dependency from storm-core-1.2.2 to storm-client-2.0.0 and recompile.   Backward Compatible Even though Storm 2.0 is API compatible with older versions, it can be difficult when running a hosted multi-tenant cluster. Coordinating upgrading the cluster with recompiling all of the topologies can be a massive task. Starting in 2.0.0, Storm has the option to run workers for topologies submitted with an older version with a classpath for a compatible older version of Storm. This important feature which was developed by our team, allows you to upgrade your cluster to 2.0 while still allowing for upgrading your topologies whenever they’re recompiled to use newer dependencies. Generic Resource Aware Scheduling With the newer generic resource aware scheduling strategy, it is now possible to specify generic resources along with CPU and memory such as Network, GPU, and any other generic cluster level resource. This allows topologies to specify such generic resource requirements for components resulting in better scheduling and stability. More To Come Storm is a secure enterprise-ready stream but there is always room for improvement, which is why we’re adding in support to run workers in isolated, locked down, containers so there is less chance of malicious code using a zero-day exploit in the OS to steal data. We are working on redesigning metrics and heartbeats to be able to scale even better and more importantly automatically adjust your topology so it can run optimally on the available hardware. We are also exploring running Storm on other systems, to provide a clean base to run not just on Mesos but also on YARN and Kubernetes. If you have any questions or suggestions, please feel free to reach out via email. P.S. We’re hiring! Explore the Big Data Open Source Distributed System Developer opportunity here.

Apache Storm 2.0 Improvements

May 30, 2019
Vespa Product Updates, May 2019: Deploy Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements May 29, 2019
May 29, 2019
Share

Vespa Product Updates, May 2019: Deploy Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements

Kristian Aune, Tech Product Manager, Verizon Media In a recent post, we mentioned Tensor updates, Query tracing and coverage. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to evolve. For May, we’re excited to share the following feature updates with you: Multithreaded Disk Index Fusion Content nodes are now able to sustain a higher feed rate by using multiple threads for disk index fusion. Read more. Feeding Improvements Cluster-internal communications are now multithreaded out of the box, for high throughput feeding operations. This fully utilizes a 10 Gbps network and improves utilization of high-CPU content nodes. Ideal State Optimizations Whenever the content cluster state changes, the ideal state is calculated. This is now optimized (faster and runs less often) and state transitions like node up/down will have less impact on read and write operations. Learn more in the dynamic data distribution documentation. Download Machine Learning Models During Deploy One procedure for using/importing ML models to Vespa is to put them in the application package in the models directory. Applications where models are trained frequently in some external system can refer to the model by URL rather than including it in the application package. This use case is now documented in deploying remote models, and solves the challenge of deploying huge models. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, May 2019: Deploy Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements

May 29, 2019
Custom Source Directory May 24, 2019
May 24, 2019
Share

Custom Source Directory

Min Zhang, Software Engineer, Verizon Media Tiffany Kyi, Software Engineer, Verizon Media Previously you were limited having one screwdriver.yaml at root per SCM repository. This prevented users from running workflows based on subdirectories in a monorepo. Now, you can specify a custom source directory for your pipeline, which means you can create multiple pipelines on a single repository! Usage The directory path is relative to the root of the repository. You must have a screwdriver.yaml under your source directory. Example Given a repository with the file structure depicted below: ┌── README.md ├── screwdriver.yaml ├── myapp1/ │ └──test.js ... ├── myapp2/ │ ├── app/ │ │ ├── main.js │ │ ├── ... │ │ └── package.json │ └── screwdriver.yaml │ ... Create pipeline with source directory Update pipeline with source directory In this example, jobs that requires: [~commit, ~pr] will be triggered if there are any changes to files under myapp2. Caveats - This feature is only available for the Github SCM right now. - If you use sourcePaths together with custom source directory, the scope of the sourcePaths is limited to your source directory. You can not listen on changes that are outside your source directory. Note the path for your sourcePaths is relative to the root of the repository not your source directory. For example, if you want to add sourcePaths to listen on changes to main.js and screwdriver.yaml, you should set: sourcePaths: [myapp2/app/main.js, myapp2/screwdriver.yaml] If you try to set sourcePaths: [app/main.js], it will not work, as it is missing the source dir myapp2 and you cannot set a relative source path. If you try to set sourcePaths: [myapp1/test.js], it will not work, as it is outside the scope of your source directory, myapp2. - The screwdriver.yaml must be located at the root of your custom source directory.Compatibility List In order to use this feature, you will need these minimum versions (please read the note above): - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.692 - [UI] (https://hub.docker.com/r/screwdrivercd/ui) - v1.0.425 - [Launcher] (https://hub.docker.com/r/screwdrivercd/launcher) - v6.0.4Contributors Thanks to the following contributors for making this feature possible: - minz1027 - tkyi Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Custom Source Directory

May 24, 2019
Announcing Prototrain-ranker: Open Source Search and Ranking Framework May 16, 2019
May 16, 2019
Share

Announcing Prototrain-ranker: Open Source Search and Ranking Framework

Huy Nguyen, Research Engineer, Verizon Media & Eric Dodds, Research Scientist, Verizon Media E-commerce fashion and furniture sites use a fundamentally different way of searching for content based on visual similarity. We call this “Search 2.0” in homage to Andrej Karpathy’s Software 2.0 essay. Today we’re announcing the release of an open source ranking framework called prototrain-ranker which you can use in your modern search projects. This is based on our extensive research in search technology and ranking optimizations. We’ll describe the visual search problem, how it fits into a developing trend of search engines and the evolving technologies surrounding the industry, and why we open sourced our model and machine learning framework, inviting you to use and work with us to improve. The Search 1.0 stack is one that many engineers and search practitioners are familiar with. It involves indexing documents and relies upon matching keywords to terms in a collection of documents to surface relevant content at query time. In contrast, Search 2.0 relies upon “embeddings” rather than documents, and k-nearest-neighbors retrieval rather than term matching to surface relevant content. The programmer does not directly specify the map from content to embeddings. Instead, the programmer specifies how this map is derived from data. Think of embeddings as points in a high-dimensional space that are used to represent some piece of content or metadata. In a Search 2.0 system, embeddings lying close to each other in this space are more highly “related” or “relevant” than points that are far apart. Instead of parsing a query for specific terms and then matching for those terms in our document index, a Search 2.0 system would encode the query into an embedding space and retrieve the data associated with nearby embeddings. Prototrain-ranker provides two things: (1) a “ranker” model for mapping content to embeddings and performing search, and (2) our “prototrain” framework for training prototype machine learning models like our ranker Search 2.0 system. Why Search 2.0 Whether we’re searching over videos, images, text, or other media, we can represent each type of data as an embedding using the proper deep learning techniques. Representing metadata as embeddings in high-dimensional space opens up the world of search to the powerful machinery of deep learning tools. We can learn ranking functions and directly encode “relevance” into embeddings, avoiding the need for brittle and hand-engineered ranking functions. For example, it would be error-prone and tedious to program a Search 1.0 search engine to respond to queries like “images with a red bird in the upper right-hand corner”. Certainly one could build specific classifiers for each one of these attributes (color, object, location) and index them. But each individual classifier and rule to parse the results would take work to build and test, with any new attribute entailing additional work and opportunities for errors and brittleness. Instead one could build a Search 2.0 system by obtaining pairs of images and descriptions that directly capture one’s notion of “relevance” to train an end-to-end ranking model. The flexibility of this approach – defining relevance as an abstract distance using examples rather than potentially brittle rules – allows several other capabilities in a straightforward manner. These capabilities include multi-modal input (e.g. text with an image), interpolating between queries (“something between this sofa and that one”), and conditioning a query (“a dress like this, but in any color”). Reframing search as a nearest-neighbor retrieval also has other benefits. We separate the process of ranking from the process of storing data. In doing so, we are able to reduce rules and logic of Search 1.0 matching and ranking into a portable matrix multiplication routine. This makes the search engine massively parallel and allows it to take advantage of GPU hardware which has been optimized over decades to efficiently execute matrix multiplication. Why we open sourced prototrain-ranker The code we open source today enables a key component in the Search 2.0 system. It allows one to “learn” embeddings by defining pairs of relevant and irrelevant data items. We provide as an example the necessary processing to train the model on the Stanford Online Products dataset, which provides multiple images of each of the thousands of products. The notion of relevance here is that two images contain the same item. We also use the prototrain framework for training other machine learning models such as image classifiers. You can too. Please check out the framework and/or the ranker model. We hope you will have questions or comments, and will want to contribute to the project. Engage with us via GitHub or email directly if you have questions.

Announcing Prototrain-ranker: Open Source Search and Ranking Framework

May 16, 2019
Announcing the new Bay Area CI/CD and DevOps Meetup - Join us on May 21st at the Yahoo Campus in Sunnyvale May 13, 2019
May 13, 2019
Share

Announcing the new Bay Area CI/CD and DevOps Meetup - Join us on May 21st at the Yahoo Campus in Sunnyvale

By Ashley Wolf, Open Source Program Manager, Verizon Media Continuous Delivery (CD) enables software development teams to move faster and adapt to users’ needs quicker by reducing the inherent friction associated with releasing software changes. Releasing new software was once considered risky. By implementing CD, we confront the fragility and improve engineering resilience by delivering new software constantly and automatically. At Yahoo, we built a tool called Screwdriver that enabled us to implement CD at incredible scale. In 2016, Yahoo open sourced Screwdriver.cd, a streamlined build system designed to enable Continuous Delivery to production at scale for dynamic infrastructure. In the spirit of open source and community, Verizon Media’s Open Source and Screwdriver teams (formerly Yahoo) started the Bay Area CI/CD and DevOps Meetup to build and grow community around continuous delivery. We are planning frequent meetups hosted at Yahoo where industry experts from the Bay Area will be invited to share stories about their CI/CD and DevOps experiences. We invite you to join us. Our first meetup is on May 21st, 5pm to 8:30pm, at Yahoo in Sunnyvale. Learn from speakers at Walmart Labs, SMDV, and Aeris. RSVP here. Agenda 5-5:45: Pizza & Networking. 5:45-6: Welcome & Introductions. 6-6:30: “Supercharging the CI/CD pipelines at Walmart with Concord” Vilas Veeraraghavan, Director of Engineering, Walmart Labs This talk will focus on how Concord (https://concord.walmartlabs.com/) – an open source workflow orchestration tool built at Walmart – helped supercharge the continuous delivery pipelines used by application teams. We will start with an overview of the state of CI/CD in the industry and then showcase the progress made at Walmart and the upcoming innovations we are working on. 6:30-7: “Successful Continuous Build, Integration & Deployment + Continuous or Controlled Delivery?” Karthi Sadasivan, Director of Engineering (DevOps), Aeris This talk will explore ways to improve speed, quality, and security, as well as, how to align tools, processes, and people. 7-7:30: “Practical CI/CD for React Native Apps” Ariya Hidayat, EIR, SMDV React Native emerges as a popular solution to build Android and iOS applications from a single code base written in JavaScript/TypeScript. For teams just starting to embrace React Native, the best practices to ensure rock-solid development and deployment are not widely covered yet. In this talk, we will discuss practical CI/CD techniques that allow your team to accelerate the process towards the development of world-class, high-quality React Native apps: - Automated build and verification for every single revision - Continuous check for code quality metrics - Easy deployment to the QA/QE/Verification team 7:30-8:30: Open Discussion. Using one of the available microphones, share a question or thought with attendees. Collectively, let’s share and discuss CICD/DevOps struggles and opportunities. Speakers - Vilas Veeraraghavan, Director of Engineering, Walmart Labs Vilas joined Walmart Labs in 2017 and leads the teams responsible for the continuous integration, testing, and deployment pipelines for eCommerce and Stores. Prior to joining Walmart Labs, he had long stints at Comcast and Netflix where he wore many hats as automation, performance, and failure testing lead. - Karthi Sadasivan, Director of Engineering (DevOps), Aeris Karthi heads the DevOps Practice at Aeris Communications. She has 18+ years of global IT industry experience with expertise in Product Engineering Services, DevOps, Agile Engineering and Continuous Delivery. Karthi is a DevOps Evangelist, DevOps Practitioner, and DevOps Enabler. She enjoys to architect, implement and deliver end-to-end devops solutions across multiple industry domains. Karthi is a thought-leader and solution finder, she has a strong passion for solving business problems by bringing people, process, and technologies together. - Ariya Hidayat, EIR, SMDV Ariya’s official day job is to run start-up engineering teams and he has been done that a couple of times already. Yet, he’s equally excited with building open-source tools such as PhantomJS (the world first’s headless browser) and Esprima (one of the most popular npm modules). Through his active involvement in the development communities, Ariya is on a mission to spread the gospel of engineering excellence and so far he has delivered over a hundred tech talks of various subjects. Meetups are a great way to learn about the latest technology trends, open source projects you can join, and networking opportunities that can turn into your next great job. So come for the talks, stay for the conversation, pizza, refreshments, and cookies. Invest in your tech career and meet people who care about the things you care about. Want to get involved? - Join the Meetup group to find out about upcoming events. - RSVP for the May 21st Meetup. - Volunteer at an upcoming meetup. - Apply to be a speaker. - Ask us about anything, we’re open to work with you.

Announcing the new Bay Area CI/CD and DevOps Meetup - Join us on May 21st at the Yahoo Campus in Sunnyvale

May 13, 2019
Expanding Environment Variables May 13, 2019
May 13, 2019
Share

Expanding Environment Variables

Dao Lam, Software Engineer, Verizon Media Previously, Screwdriver users had to rely on string substitutions or create a step to export variables to get environment variables evaluated. For example: RESTORE_STATEFILE=`echo ${RESTORE_STATEFILE} | envsubst '${SD_ARTIFACTS_DIR}'` Or steps: - export-env: export RESTORE_STATEFILE=${SD_ARTIFACTS_DIR}/statefile With this change, users can now expand environment variables within the “environment” field in screwdriver.yaml like below: jobs: main: image: node:10 environment: FOO: hello BAR: "${FOO} world" steps: - echo: echo $BAR # prints “hello world” requires: [~pr, ~commit] Setting default cluster environment variables (for cluster admins) Cluster admins can now set default environment variables to be injected into user builds. They can be configured in the API config under the field builds: { environment: {} } or via CLUSTER_ENVIRONMENT_VARIABLES. Read more about how to configure the API here Order of evaluation Environment variables are now evaluated in this order: - User secrets - Base ENV set by launcher such as SD_PIPELINE_ID, SD_JOB_ID, etc. - Cluster ENV set by cluster admin - Build ENV set by user in screwdriver.yaml environment fieldImportant note when pulling in this feature This new version of API v0.5.677 needs to be pulled in together with the new version of LAUNCHER 6.0.1 because it includes a breaking change (GET /v4/builds now returns environment as an array to ensure in-order evaluation inside launcher). Please schedule a short downtime when pulling this feature into your cluster to ensure the API and LAUNCHER are on compatible versions. The versions working before this change would be: API (v0.5.667) and launcher (v5.0.75) Compatibility List In order to use this feature, you will need these minimum versions (please read the note above): - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.677 - [LAUNCHER] (https://hub.docker.com/r/screwdrivercd/launcher) - v6.0.1Contributors Thanks to the following contributors for making this feature possible: - d2lamQuestions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Expanding Environment Variables

May 13, 2019
Vespa use case: shopping May 3, 2019
May 3, 2019
Share

Vespa use case: shopping

Imagine you are tasked with creating a shopping website. How would you proceed? What tools and technologies would you choose? You need a technology that allows you to create data-driven navigational views as well as search and recommend products. It should be really fast, and able to scale easily as your site grows, both in number of visitors and products. And because good search relevance and product recommendation drives sales, it should be possible to use advanced features such as machine-learned ranking to implement such features. Vespa - the open source big data serving engine - allows you to implement all these use cases in a single backend. As it  is a general engine for low latency computation it can be hard to know where to start. To help with that, we have provided a detailed shopping use case with a sample application. This sample application contains a fully-functional shopping-like front-end with reasonably advanced functionality right out of the box, including sample data. While this is an example of a searchable product catalog, with customization it could be used for other application types as well, such as video and social sites. The features highlighted in this use case are: - Grouping - used for instance in search to aggregate the results of the query into categories, brands, item ratings and price ranges. - Partial update - used in liking product reviews. - Custom document processors - used to intercept the feeding of product reviews to update the product itself. - Custom handlers and configuration - used to power the front-end of the site. The goal with this is to start a new series of example applications that each showcase different features of Vespa, and show them in context of practical applications. The use cases can be used as starting points for new applications, as they contain fully-functional Vespa application packages, including sample data for getting started quickly. The use cases come in addition to the quick start guide, which gives a very basic introduction to get up and running with Vespa, and the tutorial which is much more in-depth. With the use case series we want to fill the gap between these two with something closer to the practical problems user want to solve with Vespa. Take a look for yourself. More information can be found at https://docs.vespa.ai/documentation/use-case-shopping.html

Vespa use case: shopping

May 3, 2019
Dash Open 07: Oak - Open Source Scalable Concurrent Key-Value Map for Big Data Analytics May 1, 2019
May 1, 2019
Share

Dash Open 07: Oak - Open Source Scalable Concurrent Key-Value Map for Big Data Analytics

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Paul Donnelly, a Principal Engineer at Verizon Media, interviews Eddie Bortnikov, Senior Director of Research, and Eshcar Hillel, Senior Research Scientist. Eddie and Eshcar share how Druid (open source data store designed for sub-second queries on real-time and historical data) inspired their team to build Oak, an open source scalable concurrent key-value map for big data analytics, and how companies can use and contribute to Oak. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 07: Oak - Open Source Scalable Concurrent Key-Value Map for Big Data Analytics

May 1, 2019
Announcing the 3rd Annual Moloch Conference - Add visibility to your security infrastructure April 16, 2019
April 16, 2019
Share

Announcing the 3rd Annual Moloch Conference - Add visibility to your security infrastructure

Elyse Rinne, Software Engineer, Verizon Media We’re excited to share the 3rd Annual MolochON will be held on October 1st, 2019 in Sunnyvale, California, at the Yahoo (now Verizon Media) Campus. Moloch is a large scale, open source, indexed packet capture, and search system. Moloch is used by Verizon Media to help store and index network traffic for analysis. If your company has a network security team as part of your “Blue Team”, you’ll want to attend this event. We will also be hosting workshops on October 2nd for users interested in getting training and advice about using packet scanning tools to detect questionable network traffic from the Moloch creators and experts. Register here for the conference and workshops. Presentations will be given by the Moloch creators and developers: “Moloch - Recent Changes & Upcoming Features” by Andy Wick, Sr Princ Architect, Verizon Media & Elyse Rinne, Software Dev Engineer, Verizon Media Since the last MolochON, many new features have been added to Moloch. We will review some of these features and demo how to use them. We will also discuss a few desired upcoming features and talk about the 2.0 release. Speaker Bios: - Andy is the creator of Moloch and former Architect of AIM. He joined the AOL security team in 2011 and his motto is “the truth is in the packet”. - Elyse is the UI and full stack engineer for Moloch. She revamped the UI to be more user-friendly and maintainable. Now that the revamp has been completed, Elyse is working on implementing awesome new features to make Moloch the go-to open source tool for network security professionals! Do you want to present at MolochON? Submit a talk here and share your knowledge with the community. We’d love to hear you share with the community how you use Moloch, or what other tools and tips you have about packet sniffing. We’re always looking for new ideas and speaking at the event is a great way to join our open source community. After the conference, enjoy our complimentary happy hour where we’ll party like network security professionals and we’ll swap stories about how we catch the bad actors. Hope to see you there!

Announcing the 3rd Annual Moloch Conference - Add visibility to your security infrastructure

April 16, 2019
Dash Open 06: Apache Omid - Open Source Transaction Processing Platform for Big Data April 15, 2019
April 15, 2019
Share

Dash Open 06: Apache Omid - Open Source Transaction Processing Platform for Big Data

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Paul Donnelly, a Principal Engineer at Verizon Media, interviews Eddie Bortnikov, Senior Director of Research and Ohad Shacham, Senior Research Scientist. Eddie and Ohad share the inspiration behind Omid, an open source transaction processing platform for Big Data, and how companies can use and contribute to Omid. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 06: Apache Omid - Open Source Transaction Processing Platform for Big Data

April 15, 2019
Pull Request chain April 15, 2019
April 15, 2019
Share

Pull Request chain

Pull request chaining feature expands the workflow capabilties available when running pull request builds. By default during a pull request Screwdriver will run only those jobs which has ~pr in requires field. But with pull request chain feature turned on, there is no such restriction. shared: image: node:8 annotations: screwdriver.cd/chainPR: true jobs: first-job: requires: [ ~pr, ~commit ] steps: - echo: echo "this is first job." second-job: requires: [ first-job ] steps: - echo: echo "this is second job." With annotation chainPR set to true above pipeline workflow config will run second-job after first-job on a Pull Request Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.641 - UI - v1.0.396Contributors Thank you to the following contributors for making this feature possible: - Hiroki tktk, Software Engineer, Yahoo Japan - Yomei K, Software Engineer, Yahoo Japan - Teppei Minegishi, Software Engineer, Yahoo Japan - Yuichi Sawada, Software Engineer, Yahoo Japan - Yoshika Shota, Software Engineer, Yahoo JapanQuestions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Pull Request chain

April 15, 2019
Build Metrics April 11, 2019
April 11, 2019
Share

Build Metrics

Dao Lam, Software Engineer, Verizon Media Dekus Lam, Software Engineer, Verizon Media Screwdriver just released a new feature called Build Metrics, which gives users more insight into their pipeline, build, and step trends. Viewing the metrics You can now navigate to the Metrics tab in your pipeline to view these metrics graphs or navigate to https://${SD_UI_URL}/pipelines/${PIPELINE_ID}/metrics. The first graph shows metrics across different events for the pipeline. An event is a series of builds triggered by a single action, which could be a commit, an external pipeline trigger, or a manual start. (Read more about workflow). The graph illustrates the following data about your pipeline: - Total duration of each event - Total time it takes to pull images across builds in each event - Total time the builds spend in the queue in each event The second graph shows a build duration breakdown for corresponding events from the first graph. The third graph shows the step breakdown across multiple builds for a specific job. Chart Interactions: - Legend to filter visibility of data - Bar graph tooltip on hover for more details about the selected metric data - Copy-to-clipboard button inside tooltip - Preset time ranges & custom date ranges - Toggle between UTC and Local date time - Toggle for trendline view - Toggle for viewing only successful build data - Drag-and-zoom & button to reset zoom level - Deep links to step or build logsCompatibility List In order to use this feature, you will need these minimum versions: - [UI] (https://hub.docker.com/r/screwdrivercd/ui) - v1.0.408 - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.641Contributors Thanks to the following contributors for making this feature possible: - chasturansky - d2lam - dekuslam - parthasl Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Build Metrics

April 11, 2019
Dash Open 05: Makeskill Design Kit, the Open Source Multimodal Rapid Prototyping Suite for Alexa April 10, 2019
April 10, 2019
Share

Dash Open 05: Makeskill Design Kit, the Open Source Multimodal Rapid Prototyping Suite for Alexa

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, I interview Lauren Tsung, who was previously a Sr. Designer for Yahoo Mail and Anna Shainskaya, a Sr. Designer for Yahoo Mail at Verizon Media. Lauren and Anna share their journey from designing chatbots to publishing Makeskill, an open source project for rapid prototyping Alexa Skills. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 05: Makeskill Design Kit, the Open Source Multimodal Rapid Prototyping Suite for Alexa

April 10, 2019
Meta Event Label and Stop a Build April 9, 2019
April 9, 2019
Share

Meta Event Label and Stop a Build

Tiffany Kyi, Software Engineer, Verizon Media We’ve introduced new UI features in Screwdriver to the pipeline events page! You can now: - use meta to label events - stop a running build from the pipeline event graph Meta Event Label You can label your events using the label key in metadata. This label can be useful when trying to identify which event to rollback. To label an event, set the meta label key in your screwdriver.yaml. It will appear on the UI after the build is complete. Example screwdriver.yaml: jobs: main: steps: - set-label: | meta set label VERSION_3.0 # this will show up in your pipeline events page Example result: Stop a Build When a build is running or queued, you can now stop the build using the dropdown from the pipeline events graph. Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.639 - UI - v1.0.402Contributors Thanks to the following contributors for making this feature possible: - tkyi Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Meta Event Label and Stop a Build

April 9, 2019
Dash Open 04: Frode Lundgren - Building and Open Sourcing Vespa, the Big Data Serving Engine April 4, 2019
April 4, 2019
Share

Dash Open 04: Frode Lundgren - Building and Open Sourcing Vespa, the Big Data Serving Engine

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Amber Wilson interviews Frode Lundgren, Director of Engineering for Vespa at Verizon Media. Frode discusses the inspiration behind building Vespa and shares thoughts on personalized search. Audio and transcript available here. You can also listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 04: Frode Lundgren - Building and Open Sourcing Vespa, the Big Data Serving Engine

April 4, 2019
Panoptes, an open source distributed network telemetry ecosystem is now available on Docker April 1, 2019
April 1, 2019
Share

Panoptes, an open source distributed network telemetry ecosystem is now available on Docker

James Diss, Software Systems Engineer, Verizon Media Panoptes is an open source network telemetry system that we built to replace a myriad of tools and scripts built up over time to monitor the fleet of hosts worldwide. The core framework is designed to be extended through the use of plugins which allows the many different devices and device types to be monitored. Within Verizon Media, Panoptes provides the data collection layer that we feed to other projects to allow for visualization of device health according to need, alerting and collation of information. Normally the components of Panoptes run as a distributed system that allows for horizontal scaling and sharding to different geographical and virtual locations, handling thousands of endpoints, but this can be a difficult environment to simulate. Therefore, we have created a docker image which holds the entire structure of Panoptes in a single container. I hasten to add that this is not a production instance of Panoptes, and we would not recommend trying to use the docker container “as-is”. Rather, it is more of a workbench installation to examine in motion. The container is entirely self-contained and builds Panoptes using the freely available package with pip install yahoo-panoptes; it is open-source and built on the ubiquitous Ubuntu 18.04 (Bionic Beaver). A set of scripts are supplied that allow for examination of the running container, and it also runs a Grafana instance on the container to see the data being collected. A dashboard already exists and is connected to the internal data store (influxdb) collecting the metrics. If you would like to get started right away, you can have Panoptes running very easily by using the prebuilt docker hub image; docker pull panoptes/panoptes_docker docker run -d \ --sysctl net.core.somaxconn=511 \ --name="panoptes_docker" \ --shm-size=2G \ -p 127.0.0.1:8080:3000/tcp \ panoptes/panoptes_docker This pulls the docker image from docker hub, then runs the image with a couple of parameters. In order, the “sysctl” command allows redis to run, the “name” is the name that the running container will be given (docker ps shows the currently running containers), and the “shm-size” reserves memory for the container. The “p” parameter exposes port 3000 inside the container to port 8080 on the outside; this is to allow the Grafana library to communicate outside of the container. If you’re more interested in building the image yourself (which would allow you to play with the configuration files that are dropped in place during the build), clone the repo and build from source. git clone https://github.com/yahoo/panoptes_docker.git && cd panoptes_docker docker build . -t panoptes_docker Once the image is built, run with: docker run -d \ --sysctl net.core.somaxconn=511 \ --name="panoptes_docker" \ --shm-size=2G \ -p 127.0.0.1:8080:3000/tcp \ panoptes_docker Here are a few useful links and references: Docker Resources - Docker desktop for Mac https://docs.docker.com/docker-for-mac/install/ - Docker desktop for Windows https://docs.docker.com/docker-for-windows/install/ - Docker Hub - prebuilt images https://hub.docker.com Panoptes Resources - Panoptes in Docker prebuilt image https://hub.docker.com/r/panoptes/panoptes_docker - Panoptes in Docker GitHub repo https://github.com/yahoo/panoptes_docker - Panoptes GitHub repo https://github.com/yahoo/panoptes/ Questions, Suggestions & Contributions Your feedback and contributions are appreciated! Explore Panoptes, use and help contribute to the project, and chat with us on Slack.

Panoptes, an open source distributed network telemetry ecosystem is now available on Docker

April 1, 2019
Vespa Product Updates, March 2019: Tensor updates, Query tracing and coverage March 29, 2019
March 29, 2019
Share

Vespa Product Updates, March 2019: Tensor updates, Query tracing and coverage

In last month’s Vespa update, we mentioned Boolean Field Type, Environment Variables, and Advanced Search Core Tuning. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to feedback and contributions from the community, Vespa continues to grow. This month, we’re excited to share the following updates with you: Tensor update Easily update individual tensor cells. Add, remove, and modify cell is now supported. This enables high throughput and continuous updates as tensor values can be updated without writing the full tensor. Advanced Query Trace Query tracing now includes matching and ranking execution information from content nodes - Query Explain,  is useful for performance optimization. Search coverage in access log Search coverage is now available in the access log. This enables operators to track the fraction of queries that are degraded with lower coverage. Vespa has features to gracefully reduce query coverage in overload situations and now it’s easier to track this. Search coverage is a useful signal to reconfigure or increase the capacity for the application. Explore the access log documentation to learn more. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, March 2019: Tensor updates, Query tracing and coverage

March 29, 2019
User Teardown Steps in Templates and Manual Start for [skip ci] and restrictPR March 22, 2019
March 22, 2019
Share

User Teardown Steps in Templates and Manual Start for [skip ci] and restrictPR

Dao Lam, Software Engineer, Verizon Media Dekus Lam, Software Engineer, Verizon Media Screwdriver V4 user teardown steps are now working for templates. In the below example, teardown-write will be injected to the end of the build (before Screwdriver teardown steps) and will run regardless of build status. If the template has the same teardown step, it will be overwritten by user’s teardown step. jobs: main: image: node:8 template: template_namespace/nodejs_main@1.2.0 steps: - teardown-write: echo hello requires: [~pr, ~commit] Additionally, we also added the ability to manually start skip ci or restrictPR events. For skip ci, users can now hover over the empty build and select “Start pipeline from here” to trigger the build manually. For restrictPR, users can now click on the Start button to manually start the build. Note: if skip ci commit is made by a bot under the ignoreCommitsBy configured by the cluster, skip ci will take precedence and users can still manually start the build. Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.624 - [UI](https://hub.docker.com/r/screwdrivercd/ui - v1.0.389Contributors Thanks to the following contributors for making this feature possible: - d2lam - dekus Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

User Teardown Steps in Templates and Manual Start for [skip ci] and restrictPR

March 22, 2019
Dash Open 03: Alan Carroll - Networking On The Edge: IPv6, HTTP2, Apache Traffic Server March 20, 2019
March 20, 2019
Share

Dash Open 03: Alan Carroll - Networking On The Edge: IPv6, HTTP2, Apache Traffic Server

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Gil Yehuda (Sr. Director of Open Source at Verizon Media) interviews Alan Carroll, PhD, Senior Software Engineer for Global Networking / Edge at Verizon Media. Alan discusses networking at Verizon Media and how user traffic and proxy happens through Apache Traffic Server. He also shares his love of model rockets. Audio and transcript available here. You can also listen to this episode of Dash Open on iTunes or SoundCloud.

Dash Open 03: Alan Carroll - Networking On The Edge: IPv6, HTTP2, Apache Traffic Server

March 20, 2019
Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More March 8, 2019
March 8, 2019
rosaliebeevm
Share

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

yahoodevelopers: By Akshay Sarma, Principal Engineer, Verizon Media & Brian Xiao, Software Engineer, Verizon Media This is the first of an ongoing series of blog posts sharing releases and announcements for Bullet, an open-sourced lightweight, scalable, pluggable, multi-tenant query system. Bullet allows you to query any data flowing through a streaming system without having to store it first through its UI or API. The queries are injected into the running system and have minimal overhead. Running hundreds of queries generally fit into the overhead of just reading the streaming data. Bullet requires running an instance of its backend on your data. This backend runs on common stream processing frameworks (Storm and Spark Streaming currently supported). The data on which Bullet sits determines what it is used for. For example, our team runs an instance of Bullet on user engagement data (~1M events/sec) to let developers find their own events to validate their code that produces this data. We also use this instance to interactively explore data, throw up quick dashboards to monitor live releases, count unique users, debug issues, and more. Since open sourcing Bullet in 2017, we’ve been hard at work adding many new features! We’ll highlight some of these here and continue sharing update posts for future releases. Windowing Bullet used to operate in a request-response fashion - you would submit a query and wait for the query to meet its termination conditions (usually duration) before receiving results. For short-lived queries, say, a few seconds, this was fine. But as we started fielding more interactive and iterative queries, waiting even a minute for results became too cumbersome. Enter windowing! Bullet now supports time and record-based windowing. With time windowing, you can break up your query into chunks of time over its duration and retrieve results for each chunk.  For example, you can calculate the average of a field, and stream back results every second: In the above example, the aggregation is operating on all the data since the beginning of the query, but you can also do aggregations on just the windows themselves. This is often called a Tumbling window: With record windowing, you can get the intermediate aggregation for each record that matches your query (a Sliding window). Or you can do a Tumbling window on records rather than time. For example, you could get results back every three records: Overlapping windows in other ways (Hopping windows) or windows that reset based on different criteria (Session windows, Cascading windows) are currently being worked on. Stay tuned! Apache Pulsar support as a native PubSub Bullet uses a PubSub (publish-subscribe) message queue to send queries and results between the Web Service and Backend. As with everything else in Bullet, the PubSub is pluggable. You can use your favorite pubsub by implementing a few interfaces if you don’t want to use the ones we provide. Until now, we’ve maintained and supported a REST-based PubSub and an Apache Kafka PubSub. Now we are excited to announce supporting Apache Pulsar as well! Bullet Pulsar will be useful to those users who want to use Pulsar as their underlying messaging service. If you aren’t familiar with Pulsar, setting up a local standalone is very simple, and by default, any Pulsar topics written to will automatically be created. Setting up an instance of Bullet with Pulsar instead of REST or Kafka is just as easy. You can refer to our documentation for more details. Plug your data into Bullet without code While Bullet worked on any data source located in any persistence layer, you still had to implement an interface to connect your data source to the Backend and convert it into a record container format that Bullet understands. For instance, your data might be located in Kafka and be in the Avro format. If you were using Bullet on Storm, you would perhaps write a Storm Spout to read from Kafka, deserialize, and convert the Avro data into the Bullet record format. This was the only interface in Bullet that required our customers to write their own code. Not anymore! Bullet DSL is a text/configuration-based format for users to plug in their data to the Bullet Backend without having to write a single line of code. Bullet DSL abstracts away the two major components for plugging data into the Bullet Backend. A Connector piece to read from arbitrary data-sources and a Converter piece to convert that read data into the Bullet record container. We currently support and maintain a few of these - Kafka and Pulsar for Connectors and Avro, Maps and arbitrary Java POJOs for Converters. The Converters understand typed data and can even do a bit of minor ETL (Extract, Transform and Load) if you need to change your data around before feeding it into Bullet. As always, the DSL components are pluggable and you can write your own (and contribute it back!) if you need one that we don’t support. We appreciate your feedback and contributions! Explore Bullet on GitHub, use and help contribute to the project, and chat with us on Google Groups. To get started, try our Quickstarts on Spark or Storm to set up an instance of Bullet on some fake data and play around with it.

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

March 8, 2019
Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More March 6, 2019
March 6, 2019
Share

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

By Akshay Sarma, Principal Engineer, Verizon Media & Brian Xiao, Software Engineer, Verizon Media This is the first of an ongoing series of blog posts sharing releases and announcements for Bullet, an open-sourced lightweight, scalable, pluggable, multi-tenant query system. Bullet allows you to query any data flowing through a streaming system without having to store it first through its UI or API. The queries are injected into the running system and have minimal overhead. Running hundreds of queries generally fit into the overhead of just reading the streaming data. Bullet requires running an instance of its backend on your data. This backend runs on common stream processing frameworks (Storm and Spark Streaming currently supported). The data on which Bullet sits determines what it is used for. For example, our team runs an instance of Bullet on user engagement data (~1M events/sec) to let developers find their own events to validate their code that produces this data. We also use this instance to interactively explore data, throw up quick dashboards to monitor live releases, count unique users, debug issues, and more. Since open sourcing Bullet in 2017, we’ve been hard at work adding many new features! We’ll highlight some of these here and continue sharing update posts for future releases. Windowing Bullet used to operate in a request-response fashion - you would submit a query and wait for the query to meet its termination conditions (usually duration) before receiving results. For short-lived queries, say, a few seconds, this was fine. But as we started fielding more interactive and iterative queries, waiting even a minute for results became too cumbersome. Enter windowing! Bullet now supports time and record-based windowing. With time windowing, you can break up your query into chunks of time over its duration and retrieve results for each chunk.  For example, you can calculate the average of a field, and stream back results every second: In the above example, the aggregation is operating on all the data since the beginning of the query, but you can also do aggregations on just the windows themselves. This is often called a Tumbling window: With record windowing, you can get the intermediate aggregation for each record that matches your query (a Sliding window). Or you can do a Tumbling window on records rather than time. For example, you could get results back every three records: Overlapping windows in other ways (Hopping windows) or windows that reset based on different criteria (Session windows, Cascading windows) are currently being worked on. Stay tuned! Apache Pulsar support as a native PubSub Bullet uses a PubSub (publish-subscribe) message queue to send queries and results between the Web Service and Backend. As with everything else in Bullet, the PubSub is pluggable. You can use your favorite pubsub by implementing a few interfaces if you don’t want to use the ones we provide. Until now, we’ve maintained and supported a REST-based PubSub and an Apache Kafka PubSub. Now we are excited to announce supporting Apache Pulsar as well! Bullet Pulsar will be useful to those users who want to use Pulsar as their underlying messaging service. If you aren’t familiar with Pulsar, setting up a local standalone is very simple, and by default, any Pulsar topics written to will automatically be created. Setting up an instance of Bullet with Pulsar instead of REST or Kafka is just as easy. You can refer to our documentation for more details. Plug your data into Bullet without code While Bullet worked on any data source located in any persistence layer, you still had to implement an interface to connect your data source to the Backend and convert it into a record container format that Bullet understands. For instance, your data might be located in Kafka and be in the Avro format. If you were using Bullet on Storm, you would perhaps write a Storm Spout to read from Kafka, deserialize, and convert the Avro data into the Bullet record format. This was the only interface in Bullet that required our customers to write their own code. Not anymore! Bullet DSL is a text/configuration-based format for users to plug in their data to the Bullet Backend without having to write a single line of code. Bullet DSL abstracts away the two major components for plugging data into the Bullet Backend. A Connector piece to read from arbitrary data-sources and a Converter piece to convert that read data into the Bullet record container. We currently support and maintain a few of these - Kafka and Pulsar for Connectors and Avro, Maps and arbitrary Java POJOs for Converters. The Converters understand typed data and can even do a bit of minor ETL (Extract, Transform and Load) if you need to change your data around before feeding it into Bullet. As always, the DSL components are pluggable and you can write your own (and contribute it back!) if you need one that we don’t support. We appreciate your feedback and contributions! Explore Bullet on GitHub, use and help contribute to the project, and chat with us on Google Groups. To get started, try our Quickstarts on Spark or Storm to set up an instance of Bullet on some fake data and play around with it.

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

March 6, 2019
Vespa Product Updates, February 2019: Boolean Field Type, Environment Variables, and Advanced Search Core Tuning February 28, 2019
February 28, 2019
Share

Vespa Product Updates, February 2019: Boolean Field Type, Environment Variables, and Advanced Search Core Tuning

In last month’s Vespa update, we mentioned Parent/Child, Large File Config Download, and a Simplified Feeding Interface. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to helpful feedback and contributions from the community, Vespa continues to grow. This month, we’re excited to share the following updates: Boolean field type Vespa has released a boolean field type in #6644. This feature was requested by the open source community and is targeted for applications that have many boolean fields. This feature reduces memory footprint to 1/8 for the fields (compared to byte) and hence increases query throughput / cuts latency. Learn more about choosing the field type here. Environment variables The Vespa Container now supports setting environment variables in services.xml. This is useful if the application uses libraries that read environment variables. Advanced search core tuning You can now configure index warmup - this reduces high-latency requests at startup. Also, reduce spiky memory usage when attributes grow using resizing-amortize-count - the default is changed to provide smoother memory usage. This uses less transient memory in growing applications. More details surrounding search core configuration can be explored here. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to see.

Vespa Product Updates, February 2019: Boolean Field Type, Environment Variables, and Advanced Search Core Tuning

February 28, 2019
Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications February 27, 2019
February 27, 2019
Share

Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications

By Arun Gupta Effective monitoring of applications depends on high-quality instrumentation. By measuring key metrics for your applications, you can identify performance characteristics, bottlenecks, detect failures, and plan for growth. Here are some examples of metrics that you might want about your applications: - How much processing is being done, which could be in terms of requests, queries, transactions, records, backend calls, etc. - How long is a particular part of the code taking (ie, latency), which could be in the form of total time spent as well as statistics like weighted average (based on sum and count), min, max, percentiles, and histograms. - How many resources are being utilized, like memory, entries in a hashmap, length of an array, etc. Further, you might want to know details about your service, such as: - How many users are querying the service? - Latency experience by users, sliced by users’ device types, countries of origin, operating system versions, etc. - Number of errors encountered by users, sliced by types of errors. - Sizes of responses returned to users. At Verizon Media, we have applications and services that run at a very large-scale and metrics are critical for driving business and operational insights. We set out to find a good metrics library for our Java services that provide lots of features but performs well at scale. After evaluating available options, we realized that existing libraries did not meet our requirements: - Support for dynamic dimensions (ie, tags) - Metrics need to support associative operations - Works well in very high traffic applications - Minimal garbage collection pressure - Report metrics to multiple monitoring systems As a result, we built and open sourced UltraBrew Metrics, which is a Java library for instrumenting very large-scale applications. Performance UltraBrew Metrics can operate at millions of requests per second per JVM without measurably slowing the application down. We currently use the library to instrument multiple applications at Verizon Media, including one that uses this library 20+ million times per second on a single JVM. Here are some of the techniques that allowed us to achieve our performance target: - Minimize the need for synchronization by: - Using Java’s Unsafe API for atomic operations. - Aligning data fields to L1/L2-cache line size. - Tracking state over 2 time-intervals to prevent contention between writes and reads. - Reduce the creation of objects, including avoiding the use of Java HashMaps. - Writes happen on caller thread rather than dedicated threads. This avoids the need for a buffer between threads. Questions or Contributions To learn more about this library, please visit our GitHub. Feel free to also tweet or email us with any questions or suggestions. Acknowledgments Special thanks to my colleagues who made this possible: - Matti Oikarinen - Mika Mannermaa - Smruti Ranjan Sahoo - Ilpo Ruotsalainen - Chris Larsen - Rosalie Bartlett - The Monitoring Team at Verizon Media

Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications

February 27, 2019
Freeze Windows and Collapsed Builds February 25, 2019
February 25, 2019
Share

Freeze Windows and Collapsed Builds

Min Zhang, Software Dev Engineer, Verizon Media Pranav Ravichandran, Software Dev Engineer, Verizon Media Freeze Windows Want to prevent your deployment jobs from running on weekends? You can now freeze your Screwdriver jobs and prevent them from running during specific time windows using the freezeWindows feature. Screwdriver will collapse all the frozen jobs inside the window to a single job and run it as soon as the window expires. The job will be run from the last commit within the window. Screwdriver Users The freezeWindows setting takes a cron expression or a list of them as the value. Caveats: - Unlike buildPeriodically, freezeWindows should not use hashed time therefore the symbol H for hash is disabled. - The combinations of day of week and day of month are invalid. Therefore only one out of day of week and day of month can be specified. The other field should be set to ?. - All times are in UTC. In the following example, job1 will be frozen during the month of March, job2 will be frozen on weekends, and job3 will be frozen from 10 PM to 10 AM. shared: image: node:6 jobs: job1: freezeWindows: ['* * ? 3 *'] requires: [~commit] steps: - build: echo "build" job2: freezeWindows: ['* * ? * 0,6,7'] requires: [~job1] steps: - build: echo "build" job3: freezeWindows: ['* 0-10,22-23 ? * *'] requires: [~job2] steps: - build: echo "build" In the UI, jobs within the freeze window appear as below (deploy and auxiliary): Collapsed Builds Screwdriver now supports collapsing all BLOCKED builds of the same job into a single build (the latest one). With this feature, users with concurrent builds no longer need to wait until all of them finish in the series to get the latest release out. Screwdriver Users To opt in for collapseBuilds, Screwdriver users can configure their screwdriver.yaml using annotations as shown below: jobs: main: annotations: screwdriver.cd/collapseBuilds: true image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] In the UI, collapsed build appears as below: Cluster Admin Cluster admin can configure the default behavior as collapsed or not in queue-worker configuration. Compatibility List In order to use freeze windows and collapsed builds, you will need these minimum versions: - API - v0.5.578 - Queue-worker - v2.5.2 - Buildcluster-queue-worker:v1.1.8Contributors Thank you to the following contributors for making this feature possible: - minz1027 - pranavrcQuestions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Freeze Windows and Collapsed Builds

February 25, 2019
Restrict PRs from forked repository February 21, 2019
February 21, 2019
Share

Restrict PRs from forked repository

Dao Lam, Software Engineer, Verizon Media Previously, any Screwdriver V4 user can start PR jobs (jobs configured to run on ~pr) by forking the repository and creating a PR against it. For many pipelines, this is not a desirable behavior due to security reasons since secrets and other sensitive data might get exposed in the PR builds. Screwdriver V4 now allows users to specify whether they want to restrict forked PRs or all PRs using pipeline-level annotation screwdriver.cd/restrictPR. Example: annotations: screwdriver.cd/restrictPR: fork shared: image: node:8 jobs: main: requires: - ~pr - ~commit steps: - echo: echo test Cluster admins can set the default behavior for the cluster by setting the environment variable: RESTRICT_PR. Explore the guide here Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.581Contributors Thanks to the following contributors for making this feature possible: - d2lam - stjohnjohnson Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Restrict PRs from forked repository

February 21, 2019
Shared “Verizon Media Case Study: Zero Trust Security With Athenz” at the OpenStack Summit in Berlin February 20, 2019
February 20, 2019
Share

Shared “Verizon Media Case Study: Zero Trust Security With Athenz” at the OpenStack Summit in Berlin

By James Penick, Architect Director, Verizon Media At Verizon Media, we’ve developed and open sourced a platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures called Athenz. Athenz addresses zero trust principles, including situations where authenticated clients require explicit authorization to be allowed to perform actions, and authorization needs to always be limited to the least privilege required. During the OpenStack Summit in Berlin, I discussed Athenz and its integration with OpenStack for fully automated role-based authorization and identity provisioning. We are using Athenz to bootstrap our instances deployed in both private and public clouds with service identities in the form of short-lived X.509 certificates that allow one service to securely communicate with another. Our OpenStack instances are powered by Athenz identities at scale. To learn more about Athenz, give feedback, or contribute, please visit our Github and chat with us on Slack.

Shared “Verizon Media Case Study: Zero Trust Security With Athenz” at the OpenStack Summit in Berlin

February 20, 2019
Efficient Personal Search at Scale with Vespa, the Open Source Big Data Serving Engine February 13, 2019
February 13, 2019
Share

Efficient Personal Search at Scale with Vespa, the Open Source Big Data Serving Engine

Jon Bratseth, Distinguished Architect, Verizon Media Vespa, the open source big data serving engine, includes a mode which provides personal search at scale for a fraction of the cost of alternatives. In this article, we explain streaming search and discuss how to use it. Imagine you are tasked with building the next email service, a massive personal data store centered around search. How would you do it? An obvious answer is to just use a regular search engine, write all documents to a big index and simply restrict queries to match documents belonging to a single user. Although this works, it’s incredibly costly. Successful personal data stores have a tendency to become massive — the amount of personal data produced in the world outweighs public data by many orders of magnitude. Storing indexes in addition to raw data means paying for extra disk space and the overhead of updating this massive index each time a user changes or adds data. Index updates are costly, especially when they need to be handled in realtime, which users often expect for their own data. Systems need to handle billions of writes per day so this quickly becomes the dominating cost of the entire system. However, when you think about it, there’s really no need to go through the trouble of maintaining global indexes when each user only searches her own data. What if we instead just maintain a separate small index per user? This makes both index updates and queries cheaper but leads to a new problem: writes will arrive randomly over all users, which means we’ll need to read and write a user’s index on every update without help from caching. A billion writes per day translates to about 25k read-and-write operations per second peak. Handling traffic at that scale either means using a few thousand spinning disks, or storing all data on SSD’s. Both options are expensive. Large scale data stores already solve this problem for appending writes, by using some variant of multilevel log storage. Could we leverage this to layer the index on top of a data store? That helps, but it means we need to do our own development to put these systems together in a way that performs at scale every time for both queries and writes. And we still need to pay the cost of storing the indexes in addition to the raw user data. Do we need indexes at all though? It turns out that we don’t. Indexes consist of pointers from words/tokens to the documents containing them. This allows us to find those documents faster than would be possible if we had to read the content of the documents to find the right ones, at the considerable cost of maintaining those indexes. In personal search, however, any query only accesses a small subset of the data, and the subsets are known in advance. If we take care to store the data of each subset together we can achieve search with low latency by simply reading the data at query time — what we call streaming search. In most cases, subsets of data (i.e most users) are so small that this can be done serially on a single node. Subsets of data that are too large to stream quickly on a single node can be split over multiple nodes streaming in parallel. Numbers How many documents can be searched per node per second with this solution? Assuming a node with 500 Mb/sec read speed (either from an SSD or multiple spinning disks), and 1k average compressed document size, the disk can search max 500Mb/sec / 1k/doc = 500,000 docs/sec. If each user stores 1000 documents on average this gives a max throughput per node of 500 queries/second. This is not an exact computation since we disregard time used to seek and write, and inefficiency from reading non-compacted data on one hand, and assume an overly pessimistic zero effect from caching on the other, but it is a good indication that our solution is cost effective. What about latency? From the calculation above we see that the latency from finding the matching documents will be 2 ms on average. However, we usually care more about the 99% latency (or similar). This will be driven by large users which need to be split among multiple nodes streaming in parallel. The max data size per node is then a trade-off between latency for such users and the overall cost of executing their queries (less nodes per query is cheaper). For example, we can choose to store max 50.000 documents per user per node such that we get a max latency of 100 ms per query. Lastly, the total number of nodes decides the max parallelism and hence latency for the very largest users. For example, with 20 nodes in total per cluster, we can support 20 * 50k = 1 million documents for a single user with 100 ms latency. Streaming search Alright, we now have a cost-effective solution to implement the next email provider: store just the raw data of users, in a log-level store. Locate the data of each user on a single node in the system for locality (or 2–3 nodes for redundancy), but split over multiple nodes for users that grow large. Implement a fully functional search and relevance engine on top of the raw data store, which distributes queries to the right set of nodes for each user and merges the results. This will be inexpensive and efficient, but it sounds like a lot of work! It would be great if somebody already did all of this, ran it at scale for years and then released it as open source. Well, as luck would have it, we already did this in Vespa. In addition to the standard indexing mode, Vespa includes a streaming mode for documents which provides this solution, implemented by layering the full search engine functionality over the raw data store built into Vespa. When this solution is compared to indexed search in Vespa or more complicated sharding solutions in Elasticsearch for personal search applications, we typically see about an order of magnitude reduction in the cost of achieving a system which can sustain the query and update rates needed by the application with stable latencies over long time periods. It has been used to implement various applications such as storing and searching massive amounts of emails, personal typeahead suggestions, personal image collections, and private forum group content. Streaming search on Vespa The steps to using streaming search on Vespa are: - Set streaming mode for the document type(s) in question in services.xml. - Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123 - Pass the group id in queries by setting the query property streaming.groupname in queries. Set streaming mode for the document type(s) in question in services.xml. Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123 Pass the group id in queries by setting the query property streaming.groupname in queries. That’s it! By following the above steps, you’ll have created a scalable, battle-proven personal search solution which is an order of magnitude cheaper than any available alternative, with full support for structured and text search, advanced relevance including natural language and machine-learned models, and powerful grouping and aggregation for features like faceting. For more details see the documentation on streaming search. Have fun using Vespa and let us know (tweet or email) what you’re building and any features you’d like to see.

Efficient Personal Search at Scale with Vespa, the Open Source Big Data Serving Engine

February 13, 2019
Serving article comments using reinforcement learning of a neural net February 12, 2019
February 12, 2019
Share

Serving article comments using reinforcement learning of a neural net

Don’t look at the comments. When you allow users to make comments on your content pages you face the problem that not all of them are worth showing — a difficult problem to solve, hence the saying. In this article I’ll show how this problem has been attacked using reinforcement learning at serving time on Yahoo content sites, using the Vespa open source platform to create a scalable production solution. Yahoo properties such as Yahoo Finance, News and Sports allow users to comment on the articles, similar to many other apps and websites. To support this the team needed a system that can add, find, count and serve comments at scale in real time. Not all comments are equally as interesting or relevant though, and some articles can have hundreds of thousands of comments, so a good commenting system must also choose the right comments among these to show to users viewing the article. To accomplish this, the system must observe what users are doing and learn how to pick comments that are interesting. Here I’ll explain how this problem was solved for Yahoo properties by using Vespa — the open source big data serving engine. I’ll start with the basics and then show how comment selection using a neural net and reinforcement learning was implemented.Real-time comment serving As mentioned, the team needed a system that can add, find, count, and serve comments at scale in real time. The team chose Vespa, the open big data serving engine for this, as it supports both such basic serving as well as incorporating machine learning at serving time (which we’ll get to below). By storing each comment as a separate document in Vespa, containing the ID of the article commented upon, the ID of the user commenting, various comment metadata, and the comment text itself, the team could issue queries to quickly retrieve the comments on a given article for display, or to show a comment count next to the article: In addition, this document structure allowed less-used operations such as showing all the articles of a given user and similar. The Vespa instance used at Yahoo for this store about a billion comments at any time, serve about 12.000 queries per second, and about twice as many writes (new comments + comment metadata updates). Average latency for queries is about 4 ms, and write latency roughly 1 ms. Nodes are organized in two tiers as a single Vespa application: A single stateless cluster handling incoming queries and writes, and a content cluster storing the comments, maintaining indexes and executing the distributed part of queries in parallel. In total, 32 stateless and 96 stateful nodes are spread over 5 regional data centers. Data is automatically sharded by Vespa in each datacenter, in 6–12 shards depending on the traffic patterns of that region.Ranking comments Some articles on Yahoo pages have a very large number of comments — up to hundreds of thousands are not uncommon, and no user is going to read all of them. Therefore it is necessary to pick the best comments to show each time someone views an article. Vespa does this by finding all the comments for the article, computing a score for each, and picking the comments with the best scores to show to the user. This process is called ranking. By configuring the function to compute for each comment as a ranking expression in Vespa, the engine will compute it locally on each data partition in parallel during query execution. This allows executing these queries with low latency and ensures that more comments can be handled by adding more content nodes, without causing an increase in latency. The input to the ranking function is features which are typically stored in the document (here: a comment) or sent with the query. Comments have various features indicating how users interacted with the comment, as well as features computed from the comment content itself. In addition, the system keeps track of the reputation of each comment author as a feature. User actions are sent as update operations to Vespa as they are performed. The information about authors is also continuously changing, but since each author can write many comments it would be wasteful to have to update each comment every time there is new information about the author. Instead, the author information is stored in a separate document type — one document per author, and a document reference in Vespa is used to import that author feature into each comment. This allows updating the author information once and have it automatically take effect for all comments by that author. With these features, it’s possible in Vespa to configure a mathematical function as a ranking expression which computes the rank score or each comment to produce a ranked list of the top comments, like the following:Using a neural net and reinforcement learning The team used to rank comments with a handwritten ranking expression having hardcoded weighting of the features. This is a good way to get started but obviously not optimal. To improve it they needed to decide on a measurable target and use machine learning to optimize towards it. The ultimate goal is for users to find the comments interesting. This can not be measured directly, but luckily we can define a good proxy for interest based on signals such as dwell time (the amount of time the users spend on the comments of an article) and user actions (whether users reply to comments, provide upvotes and downvotes, etc). The team knew they wanted user interest to go up on average, but there is no way to know what the correct value of the measure of interest might be for any single given list of comments. Therefore it’s hard to create a training set of interest signals for articles (supervised learning), so reinforcement learning was chosen instead: Let the system make small changes to the live machine-learned model iteratively, observe the effect on the signal used as a proxy for user interest, and use this to converge on a model that increases it. The model chosen here was a neural net with multiple hidden layers, roughly illustrated as follows: The advantage of using a neural net compared to a simple function such as linear regression is that it can capture non-linear relationships in the feature data without anyone having to guess which relationship exists and hand-write functions to capture them (feature engineering). To explore the space of possible rankings, the team implemented a sampling algorithm in a Searcher to perturb the ranking of comments returned from each query. They logged the ranking information and user interest signals such as dwell time to their Hadoop grid where they are joined. This generates a training set each hour which is used to retrain the model using TensorFlow-on-Spark, which produces a new model for the next iteration of the reinforcement learning cycle. To implement this on Vespa, the team configured the neural net as the ranking function for comments. This was done as a manually written ranking function over tensors in a rank profile. Here is the production configuration used: rank-profile neuralNet {  function get_model_weights(field) {    expression: if(query(field) == 0, constant(field), query(field))  }  function layer_0() { # returns tensor(hidden[9])    expression: elu(xw_plus_b(nn_input,                              get_model_weights(W_0),                              get_model_weights(b_0),                              x))  }  function layer_1() { # returns tensor(out[9])    expression: elu(xw_plus_b(layer_0,                              get_model_weights(W_1),                              get_model_weights(b_1),                              hidden))  }  # xw_plus_b returns tensor(out[1]), so sum converts to double  function layer_out() {    expression: sum(xw_plus_b(layer_1,                              get_model_weights(W_out),                              get_model_weights(b_out),                              out))  }  first-phase {    expression: freshnessRank  }  second-phase {    expression: layer_out    rerank-count: 2000  } } More recently Vespa added support for deploying TensorFlow SavedModels directly (as well as similar support for tools saving in the ONNX format), which would also be a good option here since the training happens in TensorFlow. Neural nets have a pair of weight and bias tensors for each layer, which is what the team wanted the training process to optimize. The simplest way to include the weights and biases in the model is to add them as constant tensorsto the application package. However, with reinforcement learning it is necessary to be able update these tensor parameters frequently. This could be achieved by redeploying the application package frequently, as Vespa allows that to be done without restarts or disruption to ongoing queries. However, it is still a somewhat heavy-weight process, so another approach was chosen: Store the neural net parameters as tensors in a separate document type in Vespa, and create a Searcher component which looks up this document on each incoming query, and adds the parameter tensors to it before it’s passed to the content nodes for evaluation. Here is the full production code needed to accomplish this serving-time operation: import com.yahoo.document.Document; import com.yahoo.document.DocumentId; import com.yahoo.document.Field; import com.yahoo.document.datatypes.FieldValue; import com.yahoo.document.datatypes.TensorFieldValue; import com.yahoo.documentapi.DocumentAccess; import com.yahoo.documentapi.SyncParameters; import com.yahoo.documentapi.SyncSession; import com.yahoo.search.Query; import com.yahoo.search.Result; import com.yahoo.search.Searcher; import com.yahoo.search.searchchain.Execution; import com.yahoo.tensor.Tensor; import java.util.Map; public class LoadRankingmodelSearcher extends Searcher {    private static final String VESPA_ID_FORMAT = "id:canvass_search:rankingmodel::%s";    // https://docs.vespa.ai/documentation/ranking.html#using-query-variables:    private static final String FEATURE_FORMAT = "query(%s)";      /** To fetch model documents from Vespa index */    private final SyncSession fetchDocumentSession;    public LoadRankingmodelSearcher() {        this.fetchDocumentSession =           DocumentAccess.createDefault()                         .createSyncSession(new SyncParameters.Builder().build());    }    @Override    public Result search(Query query, Execution execution) {        // Fetch model document from Vespa        String id = String.format(VESPA_ID_FORMAT, query.getRanking().getProfile());        Document modelDoc = fetchDocumentSession.get(new DocumentId(id));        // Add it to the query        if (modelDoc != null) {            modelDoc.iterator().forEachRemaining((Map.Entry e) ->                addTensorFromDocumentToQuery(e.getKey().getName(), e.getValue(), query)           );        }        return execution.search(query);    }    private static void addTensorFromDocumentToQuery(String field,                                                     FieldValue value,                                                     Query query) {        if (value instanceof TensorFieldValue) {            Tensor tensor = ((TensorFieldValue) value).getTensor().get();            query.getRanking().getFeatures().put(String.format(FEATURE_FORMAT, field),                                                 tensor);        }    } } The model weight document definition is added to the same content cluster as the comment documents and simply contains attribute fields for each weight and bias tensor of the neural net (where each field below is configured with “indexing: attribute | summary”): document rankingmodel {    field modelTimestamp type long { … }  field W_0 type tensor(x[9],hidden[9]) { … }  field b_0 type tensor(hidden[9]) { … }  field W_1 type tensor(hidden[9],out[9]) { … }  field b_1 type tensor(out[9]) { … }  field W_out type tensor(out[9]) { … }  field b_out type tensor(out[1]) { … } } Since updating documents is a lightweight operation it is now possible to make frequent changes to the neural net to implement the reinforcement learning process.Results Switching to the neural net model with reinforcement learning has already led to a 20% increase in average dwell time. The average response time when ranking with the neural net increased to about 7 ms since the neural net model is more expensive. The response time stays low because in Vespa the neural net is evaluated on all the content nodes (partitions) in parallel. This avoids the bottleneck of sending the data for each comment to be evaluated over the network and allows increasing parallelization indefinitely by adding more content nodes. However, evaluating the neural net for all comments for outlier articles which have hundreds of thousands of comments would still be very costly. If you read the rank profile configuration shown above, you’ll have noticed the solution to this: Two-phase ranking was used where the comments are first selected by a cheap rank function (termed freshnessRank) and the highest scoring 2000 documents (per content node) are re-ranked using the neural net. This caps the max CPU spent on evaluating the neural net per query.Conclusion and future work In this article I have shown how to implement a real comment serving and ranking system on Vespa. With reinforcement learning gaining popularity, the serving system needs to become a more integrated part of the machine learning stack, and by using Vespa this can be accomplished relatively easily with a standard open source technology. The team working on this plan to expand on this work by applying it to other domains such as content recommendation, incorporating more features in a larger network, and exploring personalized comment ranking.

Serving article comments using reinforcement learning of a neural net

February 12, 2019
Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving February 8, 2019
February 8, 2019
Share

Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving

Online evaluation of machine-learned models (model serving) is difficult to scale to large datasets. Vespa.ai is an open source big data serving solution used to solve this problem and in use today on some of the largest such systems in the world. These systems evaluate models over millions of data points per request for hundreds of thousands of requests per second. If you’re in Warsaw on February 27th, please join Jon Bratseth (Distinguished Architect, Verizon Media) at the Big Data Technology Warsaw Summit, where he’ll share “Scalable machine-learned model serving” and answer any questions. Big Data Technology Warsaw Summit is a one-day conference with technical content focused on big data analysis, scalability, storage, and search. There will be 27 presentations and more than 500 attendees are expected. Jon’s talk will explore the problem and architectural solution, show how Vespa can be used to achieve scalable serving of TensorFlow and ONNX models, and present benchmarks comparing performance and scalability to TensorFlow Serving. Hope to see you there!

Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving

February 8, 2019
Meta Pull Request Checks February 8, 2019
February 8, 2019
Share

Meta Pull Request Checks

Screwdriver now supports adding extra status checks on pull requests through Screwdriver build meta. This feature allows users to add custom checks such as coverage results to the Git pull request. Note: This feature is only available for Github plugin at the moment. Screwdriver Users To add a check to a pull request build, Screwdriver users can configure their screwdriver.yaml with steps as shown below: jobs: main: steps: - status: | meta set meta.status.findbugs '{"status":"FAILURE","message":"923 issues found. Previous count: 914 issues.","url":"http://findbugs.com"}' meta set meta.status.coverage '{"status":"SUCCESS","message":"Coverage is above 80%."}' These commands will result in a status check in Git that will look something like: For more details, see our documentation. Compatibility List In order to use the new meta PR comments feature, you will need these minimum versions: - API:v0.5.559Contributors Thanks to the following people for making this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Meta Pull Request Checks

February 8, 2019
Serving article comments using neural nets and reinforcement learning February 4, 2019
February 4, 2019
Share

Serving article comments using neural nets and reinforcement learning

Yahoo properties such as Yahoo Finance, Yahoo News, and Yahoo Sports allow users to comment on the articles, similar to many other apps and websites. To support this we needed a system that can add, find, count and serve comments at scale in real time. Not all comments are equally as interesting or relevant though, and some articles can have hundreds of thousands of comments, so a good commenting system must also choose the right comments among these to show to users viewing the article. To accomplish this, the system must observe what users are doing and learn how to pick comments that are interesting. In this blog post, we’ll explain how we’re solving this problem for Yahoo properties by using Vespa - the open source big data serving engine. We’ll start with the basics and then show how comment selection using a neural net and reinforcement learning has been implemented. Real-time comment serving As mentioned, we need a system that can add, find, count, and serve comments at scale in real time. Vespa allows us to do this easily by storing each comment as a separate document, containing the ID of the article commented upon, the ID of the user commenting, various comment metadata, and the comment text itself. Vespa then allows us to issue queries to quickly retrieve the comments on a given article for display, or to show a comment count next to the article: Ranking comments In addition, we can show all the articles of a given user and similar less-used operations. We store about a billion comments at any time, serve about 12.000 queries per second, and about twice as many writes (new comments + comment metadata updates). Average latency for queries is about 4 ms, and write latency roughly 1 ms. Nodes are organized in two tiers as a single Vespa application: A single stateless cluster handling incoming queries and writes, and a content cluster storing the comments, maintaining indexes and executing the distributed part of queries in parallel. In total, we use 32 stateless and 96 stateful nodes spread over 5 regional data centers. Data is automatically sharded by Vespa in each datacenter, in 6-12 shards depending on the traffic patterns of that region. Some articles have a very large number of comments - up to hundreds of thousands are not uncommon, and no user is going to read all of them. Therefore we need to pick the best comments to show each time someone views an article. To do this, we let Vespa find all the comments for the article, compute a score for each, and pick the comments with the best scores to show to the user. This process is called ranking. By configuring the function to compute for each comment as a ranking expression in Vespa, the engine will compute it locally on each data partition in parallel during query execution. This allows us to execute these queries with low latency and ensures that we can handle more comments by adding more content nodes, without causing an increase in latency. The input to the ranking function is features which are typically stored in the comment or sent with the query. Comments have various features indicating how users interacted with the comment, as well as features computed from the comment content itself. In addition, we keep track of the reputation of each comment author as a feature. User actions are sent as update operations to Vespa as they are performed. The information about authors is also continuously changing, but since each author can write many comments it would be wasteful to have to update each article everytime we have new information about the author. Instead, we store the author information in a separate document type - one document per author and use a document reference in Vespa to import that author feature into each comment. This allows us to update author information once and have it automatically take effect for all comments by that author. With these features, we can configure a mathematical function as a ranking expression which computes the rank score or each comment to produce a ranked list of the top comments, like the following: Using a neural net and reinforcement learning We used to rank comments using a handwritten ranking expression with hardcoded weighting of the features. This is a good way to get started but obviously not optimal. To improve it we need to decide on a measurable target and use machine learning to optimize towards it. The ultimate goal is for users to find the comments interesting. This can not be measured directly, but luckily we can define a good proxy for interest based on signals such as dwell time (the amount of time the users spend on the comments of an article) and user actions (whether users reply to comments, provide upvotes and downvotes, etc). We know that we want user interest to go up on average, but we don’t know what the correct value of this measure of interest might be for any given list of comments. Therefore it’s hard to create a training set of interest signals for articles (supervised learning), so we chose to use reinforcement learning instead: Let the system make small changes to the live machine-learned model iteratively, observe the effect on the signal we use as a proxy for user interest, and use this to converge on a model that increases it. The model chosen is a neural net with multiple hidden layers, roughly illustrated as follows: The advantage of using a neural net compared to a simple function such as linear regression is that we can capture non-linear relationships in the feature data without having to guess which relationship exists and hand-write functions to capture them (feature engineering). To explore the space of possible rankings, we implement a sampling algorithm in a Searcher to perturb the ranking of comments returned from each query. We log the ranking information and our user interest signals such as dwell time to our Hadoop grid where they are joined. This generates a training set each hour which we use to retrain the model using TensorFlow-on-Spark, which generates a new model for the next iteration of the reinforcement learning. To implement this on Vespa, we configure the neural net as the ranking function for comments. This was done as a manually written ranking function over tensors in a rank profile:    rank-profile neuralNet {        function get_model_weights(field) {            expression: if(query(field) == 0, constant(field), query(field))        }        function layer_0() {  # returns tensor(hidden[9])            expression: elu(xw_plus_b(nn_input,                                      get_model_weights(W_0),                                      get_model_weights(b_0),                                      x))        }        function layer_1() {  # returns tensor(out[9])            expression: elu(xw_plus_b(layer_0,                                      get_model_weights(W_1),                                      get_model_weights(b_1),                                     hidden))        }        function layer_out() {  # xw_plus_b returns tensor(out[1]), so sum converts to double            expression: sum(xw_plus_b(layer_1,                                      get_model_weights(W_out),                                      get_model_weights(b_out),                                      out))        }        first-phase {            expression: freshnessRank        }        second-phase {            expression: layer_out            rerank-count: 2000        }    } More recently Vespa added support for deploying TensorFlow SavedModels directly, which would also be a good option since the training happens in TensorFlow. Neural nets have a pair of weight and bias tensors for each layer, which is what we want our training process to optimize. The simplest way to include the weights and biases in the model is to add them as constant tensors to the application package. However, to do reinforcement learning we need to be able to update them frequently. We could achieve this by redeploying the application package frequently, as Vespa allows this to be done without restarts or disruption to ongoing queries. However, it is still a somewhat heavy-weight process, so we chose another approach: Store the neural net parameters as tensors in a separate document type, and create a Searcher component which looks up this document on each incoming query, and adds the parameter tensors to it before it’s passed to the content nodes for evaluation. Here is the full code needed to accomplish this: import com.yahoo.document.Document; import com.yahoo.document.DocumentId; import com.yahoo.document.Field; import com.yahoo.document.datatypes.FieldValue; import com.yahoo.document.datatypes.TensorFieldValue; import com.yahoo.documentapi.DocumentAccess; import com.yahoo.documentapi.SyncParameters; import com.yahoo.documentapi.SyncSession; import com.yahoo.search.Query; import com.yahoo.search.Result; import com.yahoo.search.Searcher; import com.yahoo.search.searchchain.Execution; import com.yahoo.tensor.Tensor; import java.util.Map; public class LoadRankingmodelSearcher extends Searcher {   private static final String VESPA_DOCUMENTID_FORMAT = “id:canvass_search:rankingmodel::%s”;   // https://docs.vespa.ai/documentation/ranking.html#using-query-variables:   private static final String QUERY_FEATURE_FORMAT = “query(%s)”;     /** To fetch model documents from Vespa index */   private final SyncSession fetchDocumentSession;   public LoadRankingmodelSearcher() {       this.fetchDocumentSession = DocumentAccess.createDefault().createSyncSession(new SyncParameters.Builder().build());   }   @Override   public Result search(Query query, Execution execution) {       // fetch model document from Vespa       String documentId = String.format(VESPA_DOCUMENTID_FORMAT, query.getRanking().getProfile());       Document modelDoc = fetchDocumentSession.get(new DocumentId(documentId));       // Add it to the query       if (modelDoc != null) {           modelDoc.iterator().forEachRemaining((Map.Entry e) ->                                                        addTensorFromDocumentToQuery(e.getKey().getName(), e.getValue(), query)           );       }       return execution.search(query);   }   private static void addTensorFromDocumentToQuery(String field, FieldValue value, Query query) {       if (value instanceof TensorFieldValue) {           Tensor tensor = ((TensorFieldValue) value).getTensor().get();           query.getRanking().getFeatures().put(String.format(QUERY_FEATURE_FORMAT, field), tensor);       }   } } The model weight document definition is added to the same content cluster as the comment documents and simply contains attribute fields for each weight and bias tensor of the neural net:    document rankingmodel {        field modelTimestamp type long { … }        field W_0 type tensor(x[9],hidden[9]){ … }        field b_0 type tensor(hidden[9]){ … }        field W_1 type tensor(hidden[9],out[9]){ … }        field b_1 type tensor(out[9]){ … }        field W_out type tensor(out[9]){ … }        field b_out type tensor(out[1]){ … }    } Since updating documents is a lightweight operation we can now make frequent changes to the neural net to implement the reinforcement learning. Results Switching to the neural net model with reinforcement learning led to a 20% increase in average dwell time. The average response time when ranking with the neural net increased to about 7 ms since the neural net model is more expensive. The response time stays low because in Vespa the neural net is evaluated on all the content nodes (partitions) in parallel. We avoid the bottleneck of sending the data for each comment to be evaluated over the network and can increase parallelization indefinitely by adding more content nodes. However, evaluating the neural net for all comments for outlier articles which have hundreds of thousands of comments would still be very costly. If you read the rank profile configuration shown above, you’ll have noticed the solution to this: We use two-phase ranking where the comments are first selected by a cheap rank function (which we term freshnessRank) and the highest scoring 2000 documents (per content node) are re-ranked using the neural net. This caps the max CPU spent on evaluating the neural net per query. Conclusion and future work We have shown how to implement a real comment serving and ranking system on Vespa. With reinforcement learning gaining popularity, the serving system needs to become a more integrated part of the machine learning stack, and by using Vespa and TensorFlow-on-Spark, this can be accomplished relatively easily with a standard open source technology. We plan to expand on this work by applying it to other domains such as content recommendation, incorporating more features in a larger network, and exploring personalized comment ranking. Acknowledgments Thanks to Aaron Nagao, Sreekanth Ramakrishnan, Zhi Qu, Xue Wu, Kapil Thadani, Akshay Soni, Parikshit Shah, Troy Chevalier, Sreekanth Ramakrishnan, Jon Bratseth, Lester Solbakken and Håvard Pettersen for their contributions to this work.

Serving article comments using neural nets and reinforcement learning

February 4, 2019
Vespa 7 is released! February 1, 2019
February 1, 2019
Share

Vespa 7 is released!

This week we rolled the major version of Vespa over from 6 to 7. The releases we make public already run a large number of high traffic production applications on our Vespa cloud, and the 7 versions are no exception. There are no new features on version 7 since we release all new features incrementally on minors. Instead, the major version change is used to mark the point where we remove legacy features marked as deprecated and change some default settings. We only do this on major version changes, as Vespa uses semantic versioning. Before upgrading, go through the list of changes in the release notes to make sure your application and usage is ready. Upgrading can be done by following the regular live upgrade procedure.

Vespa 7 is released!

February 1, 2019
Bay Area Hadoop Meetup Recap - Bullet (Open Source Real-Time Data Query Engine) & Vespa (Open Source Big Data Serving Engine) January 31, 2019
January 31, 2019
Share

Bay Area Hadoop Meetup Recap - Bullet (Open Source Real-Time Data Query Engine) & Vespa (Open Source Big Data Serving Engine)

Nate Speidel, Software Engineer, Verizon Media In December, I joined Michael Natkovich (Director, Software Dev Engineering, Verizon Media) at a Bay Area Hadoop meetup to share about Bullet. Created by Yahoo, Bullet is an open-source multi-tenant query system. It’s lightweight, scalable and pluggable, and allows you to query any data flowing through a streaming system without having to store it. Bullet queries look forward in time and we use it to support intractable Big Data aggregations like Top K, Counting Distincts, and Windowing efficiently without having a storage layer using Sketch-based algorithms. Jon Bratseth, Distinguished Architect at Verizon Media, joined us at the meetup and presented “Big Data Serving with Vespa”. Largely developed by engineers from Yahoo, Vespa is a big data processing and serving engine, available as open source on GitHub. Vespa allows you to search, organize, and evaluate machine-learned models from TensorFlow over large, evolving data sets, with latencies in the tens of milliseconds. Many of our products — such as Yahoo News, Yahoo Sports, Yahoo Finance and Oath Ads Platforms — currently employ Vespa. To learn about future product updates from Bullet or Vespa, follow YDN on Twitter or LinkedIn.

Bay Area Hadoop Meetup Recap - Bullet (Open Source Real-Time Data Query Engine) & Vespa (Open Source Big Data Serving Engine)

January 31, 2019
Musings from our CI/CD Meetup: Using Screwdriver, Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, and Data Agility for Stateful Workloads in Kubernetes January 29, 2019
January 29, 2019
Share

Musings from our CI/CD Meetup: Using Screwdriver, Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, and Data Agility for Stateful Workloads in Kubernetes

By Jithin Emmanuel, Sr. Software Dev Manager, Verizon Media On Tuesday, December 4th, I joined speakers from Spotinst, Nirmata, CloudYuga, and MayaData, at the Microservices and Cloud Native Apps Meetup in Sunnyvale. We shared how Screwdriver is used for CI/CD at Verizon Media. Created by Yahoo and open-sourced in 2016, Screwdriver is a build platform designed for continuous delivery at scale. Screwdriver supports an expanding list of source code services, execution engines, and databases since it is not tied to any specific compute platform. Moreover, it has a fully documented API and growing open source community base. The meetup also featured very interesting CI/CD presentations including these: - A Quick Overview of Intro to Kubernetes Course, by Neependra Khare, Founder, CloudYuga Neependra discussed his online course which includes some of Kubernetes’ basic concepts, architecture, the problems it solves, and the model that it uses to handle containerized deployments and scaling. Additionally, CloudYuga provides training in Docker, Kubernetes, Mesos Marathon, Container Security, GO Language, Advanced Linux Administration, and more. - Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, by Amiram Shachar, CEO & Founder, Spotinst Amiram discussed two important concepts of Kubernetes: Headroom and 2 Levels Scaling. Amiram also reviewed the different Kubernetes deployment tools, including Kubernetes Operations (Kops). Ritesh Patel, Founder and VP Products at Nirmata, demoed Spotinst and Nirmata. Nirmata provides a complete solution for Kubernetes deployment and management for cloud-based app containerization. Spotinst is workload automation software that’s focused on helping enterprises save time and costs on their cloud compute infrastructure.  - Data Agility for Stateful Workloads in Kubernetes, by Murat Karslioglu, VP Products, MayaData MayaData is focused on freeing DevOps and Kubernetes from storage constraints with OpenEBS. Murat discussed accelerating CI/CD Pipelines and DevOps, using chaos engineering and containerized storage. Murat also explored some of the open source tools available from MayaData and introduced the MayaData Agility Platform (MDAP). Murat’s presentation ended with a live demo of OpenEBS and Litmus. To learn about future meetups, follow us on Twitter at @YDN or on LinkedIn.

Musings from our CI/CD Meetup: Using Screwdriver, Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, and Data Agility for Stateful Workloads in Kubernetes

January 29, 2019
Vespa Product Updates, January 2019: Parent/Child, Large File Config Download, and a Simplified Feeding Interface January 28, 2019
January 28, 2019
Share

Vespa Product Updates, January 2019: Parent/Child, Large File Config Download, and a Simplified Feeding Interface

In last month’s Vespa update, we mentioned ONNX integration, precise transaction log pruning, grouping on maps, and improvements to streaming search performance.  Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to feedback and contributions from the community, Vespa continues to evolve. This month, we’re excited to share the following updates with you: Parent/Child We’ve added support for multiple levels of parent-child document references. Documents with references to parent documents can now import fields, with minimal impact on performance. This simplifies updates to parent data as no denormalization is needed and supports use cases with many-to-many relationships, like Product Search. Read more in parent-child. File URL references in application packages Serving nodes sometimes require data files which are so large that it doesn’t make sense for them to be stored and deployed in the application package. Such files can now be included in application packages by using the URL reference. When the application is redeployed, the files are automatically downloaded and injected into the components who depend on them. Batch feed in java client The new SyncFeedClient provides a simplified API for feeding batches of data with high performance using the Java HTTP client. This is convenient when feeding from systems without full streaming support such as Kafka and DynamoDB. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to see.

Vespa Product Updates, January 2019: Parent/Child, Large File Config Download, and a Simplified Feeding Interface

January 28, 2019
Pipeline page redesign January 25, 2019
January 25, 2019
Share

Pipeline page redesign

Check out Screwdriver’s redesigned UI for the pipeline page! In addition to a smoother interface and easier navigation, here are some utility fixes: Disabled jobs We’ve change disabled job icons to stand out more in the pipeline graph. Also, you can now: - Hover over a disabled job in the pipeline graph to view its details (who disabled it). - Add a reason when you disable a job from the Pipeline Options tab. This information will be displayed on the same page. Disabled job confirmation: Disabled job reason display: Pipeline events The event list has now been conveniently shifted to the right sidebar! The sidebar now has minimal data, including only showing a minified version of the parts of your workflow that ran, to make for quicker information processing. This change gives more space for large workflow graphs and makes for less scrolling on the page. Pull requests can be accessed by switching from the Events tab to the Pull Requests tab on the top right. Old and new pipeline page comparison: Pull requests sidebar: Compatibility List In order to see the new pipeline redesign, you will need these minimum versions: - API:v0.5.551 - UI:v1.0.365Contributors Thanks to the following people for making this feature possible: - DekusDenial - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Pipeline page redesign

January 25, 2019
Moloch 1.7.0 - Notifications, Field History, and More January 22, 2019
January 22, 2019
Share

Moloch 1.7.0 - Notifications, Field History, and More

Andy Wick, Chief Architect, Verizon Media & Elyse Rinne, Software Engineer, Verizon Media Since wrapping up the 2nd annual MolochON, we’ve been working on Moloch 1.7.0 - available here. Moloch is a large scale, open source, full packet capturing, indexing, and database system. We’ve been improving it with the help of our open source community. This release includes two bug fixes in capture and several new features. Here’s a list of all the changes. Fixed corrupt file sequence numbers When Elasticsearch was responding slowly or capture was busy, it was possible for corrupt sequence numbers to be created. This would lead to packet capture (pcap) that couldn’t be viewed and random items appearing in the sequence table. This is now fixed. Removed 256 offline files limit When running against offline files, capture would stop properly recording sequence numbers for files after the 256 file per capture run. This lead to pcap that couldn’t be viewed for those files forcing the user to restart the capture session for the next 256 files. With the new fix in place, you can now capture and store to more than 256 files. Field Intersections We’ve added a new API endpoint and Actions menu item that allows you to export unique values and counts across multiple fields. It’s now easy to find all the http hosts that a destination IP is serving. Calling this feature from the actions menu on the UI results in exporting the fields currently displayed (excluding the time and info columns). You can use previously saved column configs to switch between the data you want exported. See the demo video for more ideas. If you are in the business of packet capture as part of your job in network security, join the Moloch community, use and help contribute to the project, and chat with us on Slack. To get started, check out our README and FAQ pages on GitHub. P.S. We’re hiring security professionals, whom we lovingly call paranoids!

Moloch 1.7.0 - Notifications, Field History, and More

January 22, 2019
Efficient personal search at large scale January 21, 2019
January 21, 2019
Share

Efficient personal search at large scale

Vespa includes a relatively unknown mode which provides personal search at massive scale for a fraction of the cost of alternatives: streaming search. In this article we explain streaming search and how to use it. Imagine you are tasked with building the next Gmail, a massive personal data store centered around search. How do you do it? An obvious answer is to just use a regular search engine, write all documents to a big index and simply restrict queries to match documents belonging to a single user. This works, but the problem is cost. Successful personal data stores has a tendency to become massive — the amount of personal data produced in the world outweighs public data by many orders of magnitude. Storing indexes in addition to raw data means paying for extra disk space for all this data and paying for the overhead of updating this massive index each time a user changes or adds data. Index updates are costly, especially when they need to be handled in real time, which users often expect for their own data. Systems like Gmail handle billions of writes per day so this quickly becomes the dominating cost of the entire system. However, when you think about it there’s really no need to go through the trouble of maintaining global indexes when each user only searches her own data. What if we just maintain a separate small index per user? This makes both index updates and queries cheaper, but leads to a new problem: Writes will arrive randomly over all users, which means we’ll need to read and write a user’s index on every update without help from caching. A billion writes per day translates to about 25k read-and write operations per second peak. Handling traffic at that scale either means using a few thousand spinning disks, or storing all data on SSD’s. Both options are expensive. Large scale data stores already solve this problem for appending writes, by using some variant of multilevel log storage. Could we leverage this to layer the index on top of a data store like that? That helps, but means we need to do our own development to put these systems together in a way that performs at scale every time for both queries and writes. And we still pay the cost of storing the indexes in addition to the raw user data. Do we need indexes at all though? With some reflection, it turns out that we don’t. Indexes consists of pointers from words/tokens to the documents containing them. This allows us to find those documents faster than would be possible if we had to read the content of the documents to find the right ones, of course at the considerable cost of maintaining those indexes. In personal search however, any query only accesses a small subset of the data, and the subsets are know in advance. If we take care to store the data of each subset together we can achieve search with low latency by simply reading the data at query time — what we call streaming search. In most cases, most subsets of data (i.e most users) are so small that this can be done serially on a single node. Subsets of data that are too large to stream quickly on a single node can be split over multiple nodes streaming in parallel.Numbers How many documents can be searched per node per second with this solution? Assuming a node with 500 Mb/sec read speed (either from an SSD or multiple spinning disks), and 1k average compressed document size, the disk can search max 500Mb/sec / 1k/doc = 500,000 docs/sec. If each user store 1000 documents each on average this gives a max throughput per node of 500 queries/second. This is not an exact computation since we disregard time used to seek and write, and inefficiency from reading non-compacted data on one hand, and assume an overly pessimistic zero effect from caching on the other, but it is a good indication that our solution is cost effective. What about latency? From the calculation above we see that the latency from finding the matching documents will be 2 ms on average. However, we usually care more about the 99% latency (or similar). This will be driven by large users which needs to be split among multiple nodes streaming in parallel. The max data size per node is then a tradeoff between latency for such users and the overall cost of executing their queries (less nodes per query is cheaper). For example, we can choose to store max 50.000 documents per user per node such that we get a max latency of 100 ms per query. Lastly, the total number of nodes decides the max parallelism and hence latency for the very largest users. For example, with 20 nodes in total a cluster we can support 20 * 50k = 1 million documents for a single user with 100 ms latency.Streaming search All right — with this we have our cost-effective solution to implement the next Gmail: Store just the raw data of users, in a log-level store. Locate the data of each user on a single node in the system for locality (or, really 2–3 nodes for redundancy), but split over multiple nodes for users that grow large. Implement a fully functional search and relevance engine on top of the raw data store, which distributes queries to the right set of nodes for each user and merges the results. This will be cheap and efficient, but it sounds like a lot of work! It sure would be nice if somebody already did all of it, ran it at large scale for years and then released it as open source. Well, as luck would have it we already did this in Vespa. In addition to the standard indexing mode, Vespa includes a streaming mode for documents which provides this solution, implemented by layering the full search engine functionality over the raw data store built into Vespa. When this solution is compared to indexed search in Vespa or more complicated sharding solutions in Elastic Search for personal search applications, we typically see about an order of magnitude reduction in cost of achieving a system which can sustain the query and update rates needed by the application with stable latencies over long time periods. It has been used to implement various applications such as storing and searching massive amounts of mails, personal typeahead suggestions, personal image collections, and private forum group content.Using streaming search on Vespa The steps to using streaming search on Vespa are: - Set streaming mode for the document type(s) in question in services.xml. - Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123 - Pass the group id in queries by setting the query property streaming.groupname in queries. That’s it! With those steps you have created a scalable, battle-proven personal search solution which is an order of magnitude cheaper than any alternative out there, with full support for structured and text search, advanced relevance including natural language and machine-learned models, and powerful grouping and aggregation for features like faceting. For more details see the documentation on streaming search. Have fun with it, and as usual let us know what you are building!

Efficient personal search at large scale

January 21, 2019
Dash Open Podcast: Episode 02 - Building Community and Mentorship around Hackdays January 17, 2019
January 17, 2019
Share

Dash Open Podcast: Episode 02 - Building Community and Mentorship around Hackdays

By Ashley Wolf, Open Source Program Manager, Verizon Media The second installment of Dash Open is ready for you to tune in! In this episode, Gil Yehuda, Sr. Director of Open Source at Verizon Media, interviews Dav Glass, Distinguished Architect of IaaS and Node.js at Verizon Media. Dav discusses how open source inspired him to start HackSI, a Hack Day for all ages, as well as robotics mentorship programs for the Southern Illinois engineering community. Listen now on iTunes or SoundCloud. Dash Open is your place for interesting conversations about open source and other technologies, from the open source program office at Verizon Media. Verizon Media is the home of many leading brands including Yahoo, Aol, Tumblr, TechCrunch, and many more. Follow us on Twitter @YDN and on LinkedIn.

Dash Open Podcast: Episode 02 - Building Community and Mentorship around Hackdays

January 17, 2019
Meta PR Comments January 10, 2019
January 10, 2019
Share

Meta PR Comments

Screwdriver now supports commenting on pull requests through Screwdriver build meta. This feature allows users to add custom data such as coverage results to the Git pull request. Screwdriver Users To add a comment to a pull request build, Screwdriver users can configure their screwdriver.yaml with steps as shown below: jobs: main: steps: - postdeploy: | meta set meta.summary.coverage "Coverage increased by 15%" meta set meta.summary.markdown "this markdown comment is **bold** and *italic*" These commands will result in a comment in Git that will look something like: Cluster Admins In order to enable meta PR comments, you’ll need to create a bot user in Git with a personal access token with the public_repo scope. In Github, create a new user. Follow instructions to create a personal access token, set the scope as public_repo. Copy this token and set it as commentUserToken in your scms settings in your API config yaml. You need this headless user for commenting since Github requires public_repo scope in order to comment on pull requests (https://github.community/t5/How-to-use-Git-and-GitHub/Why-does-GitHub-API-require-admin-rights-to-leave-a-comment-on-a/td-p/357). For more information about Github scope, see https://developer.github.com/apps/building-oauth-apps/understanding-scopes-for-oauth-apps. Compatibility List In order to use the new meta PR comments feature, you will need these minimum versions: - API:v0.5.545Contributors Thanks to the following people for making this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Meta PR Comments

January 10, 2019
Multiple Build Cluster January 3, 2019
January 3, 2019
Share

Multiple Build Cluster

Screwdriver now supports running builds across multiple build clusters. This feature allows Screwdriver to provide a native hot/hot HA solution with multiple clusters on standby. This also opens up the possibility for teams to run their builds in their own infrastructure. Screwdriver Users To specify a build cluster, Screwdriver users can configure their screwdriver.yamls using annotations as shown below: jobs: main: annotations: screwdriver.cd/buildClusters: us-west-1 image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Users can view a list of available build clusters at /v4/buildclusters. Without the annotation, Screwdriver assigns builds to a default cluster that is managed by the Screwdriver team. Users can assign their build to run in any cluster they have access to (the default cluster or any external cluster that your repo is allowed to use, which is indicated by the field scmOrganizations). Contact your cluster admin if you want to onboard your own build cluster. Cluster Admins Screwdriver cluster admins can refer to the following issues and design doc to set up multiple build clusters properly. - Design: https://github.com/screwdriver-cd/screwdriver/blob/master/design/build-clusters.md - Feature issue: https://github.com/screwdriver-cd/screwdriver/issues/1319Compatibility List In order to use the new build clusters feature, you will need these minimum versions: - API:v0.5.537 - Scheduler:v2.4.2 - Buildcluster-queue-worker:v1.1.3Contributors Thanks to the following people for making this feature possible: - minz1027 - parthasl - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Multiple Build Cluster

January 3, 2019
Announcing OpenTSDB 2.4.0: Rollup and Pre-Aggregation Storage, Histograms, Sketches, and More December 27, 2018
December 27, 2018
amberwilsonla
Share

Announcing OpenTSDB 2.4.0: Rollup and Pre-Aggregation Storage, Histograms, Sketches, and More

yahoodevelopers: By Chris Larsen, Architect OpenTSDB is one of the first dedicated open source time series databases built on top of Apache HBase and the Hadoop Distributed File System. Today, we are proud to share that version 2.4.0 is now available and has many new features developed in-house and with contributions from the open source community. This release would not have been possible without support from our monitoring team, the Hadoop and HBase developers, as well as contributors from other companies like Salesforce, Alibaba, JD.com, Arista and more. Thank you to everyone who contributed to this release! A few of the exciting new features include: Rollup and Pre-Aggregation Storage As time series data grows, storing the original measurements becomes expensive. Particularly in the case of monitoring workflows, users rarely care about last years’ high fidelity data. It’s more efficient to store lower resolution “rollups” for longer periods, discarding the original high-resolution data. OpenTSDB now supports storing and querying such data so that the raw data can expire from HBase or Bigtable, and the rollups can stick around longer. Querying for long time ranges will read from the lower resolution data, fetching fewer data points and speeding up queries. Likewise, when a user wants to query tens of thousands of time series grouped by, for example, data centers, the TSD will have to fetch and process a significant amount of data, making queries painfully slow. To improve query speed, pre-aggregated data can be stored and queried to fetch much less data at query time, while still retaining the raw data. We have an Apache Storm pipeline that computes these rollups and pre-aggregates, and we intend to open source that code in 2019. For more details, please visit http://opentsdb.net/docs/build/html/user_guide/rollups.html. Histograms and Sketches When monitoring or performing data analysis, users often like to explore percentiles of their measurements, such as the 99.9th percentile of website request latency to detect issues and determine what consumers are experiencing. Popular metrics collection libraries will happily report percentiles for the data they collect. Yet while querying for the original percentile data for a single time series is useful, trying to query and combine the data from multiple series is mathematically incorrect, leading to errant observations and problems. For example, if you want the 99.9th percentile of latency in a particular region, you can’t just sum or recompute the 99.9th of the 99.9th percentile. To solve this issue, we needed a complex data structure that can be combined to calculate an accurate percentile. One such structure that has existed for a long time is the bucketed histogram, where measurements are sliced into value ranges and each range maintains a count of measurements that fall into that bucket. These buckets can be sized based on the required accuracy and the counts from multiple sources (sharing the same bucket ranges) combined to compute an accurate percentile. Bucketed histograms can be expensive to store for highly accurate data, as many buckets and counts are required. Additionally, many measurements don’t have to be perfectly accurate but they should be precise. Thus another class of algorithms could be used to approximate the data via sampling and provide highly precise data with a fixed interval. Data scientists at Yahoo (now part of Oath) implemented a great Java library called Data Sketches that implements the Stochastic Streaming Algorithms to reduce the amount of data stored for high-throughput services. Sketches have been a huge help for the OLAP storage system Druid (also sponsored by Oath) and Bullet, Oath’s open source real-time data query engine. The latest TSDB version supports bucketed histograms, Data Sketches, and T-Digests. Some additional features include: - HBase Date Tiered Compaction support to improve storage efficiency. - A new authentication plugin interface to support enterprise use cases. - An interface to support fetching data directly from Bigtable or HBase rows using a search index such as ElasticSearch. This improves queries for small subsets of high cardinality data and we’re working on open sourcing our code for the ES schema. - Greater UID cache controls and an optional LRU implementation to reduce the amount of JVM heap allocated to UID to string mappings. - Configurable query size and time limits to avoid OOMing a JVM with large queries. Try the releases on GitHub and let us know of any issues you run into by posting on GitHub issues or the OpenTSDB Forum. Your feedback is appreciated! OpenTSDB 3.0 Additionally, we’ve started on 3.0, which is a rewrite that will support a slew of new features including: - Querying and analyzing data from the plethora of new time series stores. - A fully configurable query graph that allows for complex queries OpenTSDB 1x and 2x couldn’t support. - Streaming results to improve the user experience and avoid overwhelming a single query node. - Advanced analytics including support for time series forecasting with Yahoo’s EGADs library. Please join us in testing out the current 3.0 code, reporting bugs, and adding features.

Announcing OpenTSDB 2.4.0: Rollup and Pre-Aggregation Storage, Histograms, Sketches, and More

December 27, 2018
Musings from the 2nd Annual MolochON December 21, 2018
December 21, 2018
amberwilsonla
Share

Musings from the 2nd Annual MolochON

yahoodevelopers: By Andy Wick, Chief Architect, Oath & Elyse Rinne, Software Engineer, Oath Last month, our Moloch team hosted the second all day Moloch conference at our Dulles, Virginia campus. Moloch, the large-scale, full packet capturing, indexing, and database system was developed by Andy Wick at AOL (now part of Oath) in 2011 and open-sourced in 2012. Elyse Rinne joined the Moloch team in 2016 to enhance the tool’s front-end features. The project enjoys an active community of users and contributors. Most recently, on November 1, more than 80 Moloch users and developers joined the Moloch core team to discuss the latest features, administrative capabilities, and clever uses of Moloch. Speakers from Elastic, SANS, Cox, SecureOps, and Oath presented their experiences setting up and using Moloch in a variety of security-focused scenarios. Afterwards, the participants brainstormed new project features and enhancements. We ended with happy hour giving a chance to relax and network. Although most of the talks were not recorded due to the sensitive topics related to blue team security tactics in some of the presentations, we do have these presentation recordings and slides that are cleared for the public: - Recent Changes to Moloch - Video & Slides. - Moloch Deployments at Oath - Video & Slides. - Using Wise -  Video & Slides. - All Presentations (including external and 2017 MolochON presentations) If you are a blue team security professional, consider joining the Moloch community, use and help contribute to the project, and chat with us on Slack. To get started, check out our README and FAQ pages on GitHub. P.S. We’re hiring security professionals, whom we lovingly call paranoids!

Musings from the 2nd Annual MolochON

December 21, 2018
Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping December 14, 2018
December 14, 2018
Share

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

Hi Vespa Community! Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues. We welcome your contributions and feedback about any new features or improvements you’d like to see. For December, we’re excited to share the following product news: Streaming Search Performance Improvement Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here. ONNX Integration ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models. Precise Transaction Log Pruning Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart. Grouping on Maps Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes. Questions or suggestions? Send us a tweet or an email.

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

December 14, 2018
Vespa Product Updates, December 2018 - ONNX Import and Map Attribute Grouping December 13, 2018
December 13, 2018
amberwilsonla
Share

Vespa Product Updates, December 2018 - ONNX Import and Map Attribute Grouping

yahoodevelopers: Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues. We welcome your contributions and feedback about any new features or improvements you’d like to see. For December, we’re excited to share the following product news: Streaming Search Performance Improvement Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here. ONNX Integration ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models. Precise Transaction Log Pruning Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart. Grouping on Maps Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes. Questions or suggestions? Send us a tweet or an email.

Vespa Product Updates, December 2018 - ONNX Import and Map Attribute Grouping

December 13, 2018
Vespa Product Updates, December 2018:
ONNX Import and Map Attribute Grouping December 13, 2018
December 13, 2018
Share

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues. We welcome your contributions and feedback about any new features or improvements you’d like to see. For December, we’re excited to share the following product news: Streaming Search Performance Improvement Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here. ONNX Integration ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models. Precise Transaction Log Pruning Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart. Grouping on Maps Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes. Questions or suggestions? Send us a tweet or an email.

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

December 13, 2018
A New Chapter for Omid December 6, 2018
December 6, 2018
amberwilsonla
Share

A New Chapter for Omid

yahoodevelopers: By Ohad Shacham, Yonatan Gottesman, Edward Bortnikov Scalable Systems Research, Verizon/Oath Omid, an open source transaction processing platform for Big Data, was born as a research project at Yahoo (now part of Verizon), and became an Apache Incubator project in 2015. Omid complements Apache HBase, a distributed key-value store in Apache Hadoop suite, with a capability to clip multiple operations into logically indivisible (atomic) units named transactions. This programming model has been extremely popular since the dawn of SQL databases, and has more recently become indispensable in the NoSQL world. For example, it is the centerpiece for dynamic content indexing of search and media products at Verizon, powering a web-scale content management platform since 2015. Today, we are excited to share a new chapter in Omid’s history. Thanks to its scalability, reliability, and speed, Omid has been selected as transaction management provider for Apache Phoenix, a real-time converged OLTP and analytics platform for Hadoop. Phoenix provides a standard SQL interface to HBase key-value storage, which is much simpler and in many cases more performant than the native HBase API. With Phoenix, big data and machine learning developers get the best of all worlds: increased productivity coupled with high scalability. Phoenix is designed to scale to 10,000 query processing nodes in one instance and is expected to process hundreds of thousands or even millions of transactions per second (tps). It is widely used in the industry, including by Alibaba, Bloomberg, PubMatic, Salesforce, Sogou and many others. We have just released a new and significantly improved version of Omid (1.0.0), the first major release since its original launch. We have extended the system with multiple functional and performance features to power a modern SQL database technology, ready for deployment on both private and public cloud platforms. A few of the significant innovations include: Protocol re-design for low latency The early version of Omid was designed for use in web-scale data pipeline systems, which are throughput-oriented by nature. We re-engineered Omid’s internals to now support new ultra-low-latency OLTP (online transaction processing) applications, like messaging and algo-trading. The new protocol, Omid Low Latency (Omid LL), dissipates Omid’s major architectural bottleneck. It reduces the latency of short transactions by 5 times under light load, and by 10 to 100 times under heavy load. It also scales the overall system throughput to 550,000 tps while remaining within real-time latency SLAs. The figure below illustrates Omid LL scaling versus the previous version of Omid, for short and long transactions. Throughput vs latency, transaction size=1 op Throughput vs latency, transaction size=10 ops Figure 1. Omid LL scaling versus legacy Omid. The throughput scales beyond 550,000 tps while the latency remains flat (low milliseconds). ANSI SQL support Phoenix provides secondary indexes for SQL tables — a centerpiece tool for efficient access to data by multiple keys. The CREATE INDEX command is on-demand; it is not allowed to block already deployed applications. We added Omid support for accomplishing this without impeding concurrent database operations or sacrificing consistency. We further introduced a mechanism to avoid recursive read-your-own-writes scenarios in complex queries, like “INSERT INTO T … SELECT FROM T …” statements. This was achieved by extending Omid’s traditional Snapshot Isolation consistency model, which provides single-read-point-single-write-point semantics, with multiple read and write points. Performance improvements Phoenix extensively employs stored procedures implemented as HBase filters in order to eliminate the overhead of multiple round-trips to the data store. We integrated Omid’s code within such HBase-resident procedures, allowing for a smooth integration with Phoenix and also reduced the overhead of transactional reads (for example, filtering out redundant data versions). We collaborated closely with the Phoenix developer community while working on this project, and contributed code to Phoenix that made Omid’s integration possible. We look forward to seeing Omid’s adoption through a wide range of Phoenix applications. We always welcome new developers to join the community and help push Omid forward!

A New Chapter for Omid

December 6, 2018
Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th November 27, 2018
November 27, 2018
Share

Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th

Hi Vespa Community, If you are in Seattle on November 29th, please join Jon Bratseth (Distinguished Architect, Oath) at a machine learning meetup hosted by Zillow. Jon will share a Vespa overview and answer any questions about Oath’s open source big data serving engine. Eric Ringger (Director of Machine Learning for Personalization, Zillow) will discuss some of the models used to help users find homes, including collaborative filtering, a content-based model, and deep learning. Learn more and RSVP here. Hope you can join! The Vespa Team

Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th

November 27, 2018
Oath’s VP of AI invites you to learn how to build a Terabyte Scale Machine Learning Application at TDA Conference November 26, 2018
November 26, 2018
Share

Oath’s VP of AI invites you to learn how to build a Terabyte Scale Machine Learning Application at TDA Conference

By Ganesh Harinath, VP Engineering, AI Platform & Applications, Oath If you’re attending the upcoming Telco Data Analytics and AI Conference in San Francisco, make sure to join my keynote talk. I’ll be presenting “Building a Terabyte Scale Machine Learning Application” on November 28th at 10:10 am PST. You’ll learn about how Oath builds AI platforms at scale. My presentation will focus on our approach and experience at Oath in architecting and using frameworks to build machine learning models at terabyte scale, near real-time. I’ll also highlight Trapezium, an open source framework based on Spark, developed by Oath’s Big Data and Artificial Intelligence (BDAI) team. I hope to catch you at the conference. If you would like to connect, reach out to me. If you’re unable to attend the conference and are curious about the topics shared in my presentation, follow @YDN on Twitter and we’ll share highlights during and after the event.

Oath’s VP of AI invites you to learn how to build a Terabyte Scale Machine Learning Application at TDA Conference

November 26, 2018
Introducing the Dash Open Podcast, sponsored by Yahoo Developer... November 19, 2018
November 19, 2018
Share

Introducing the Dash Open Podcast, sponsored by Yahoo Developer...

Introducing the Dash Open Podcast, sponsored by Yahoo Developer Network By Ashley Wolf, Principal Technical Program Manager, Oath Is open source the wave of the future, or has it seen its best days already? Which Big Data and AI trends should you be aware of and why? What is 5G and how will it impact the apps you enjoy using? You’ve got questions and we know smart people; together we’ll get answers. Introducing the Dash Open podcast, sponsored by the Yahoo Developer Network and produced by the Open Source team at Oath. Dash Open will share interesting conversations about tech and the people who spend their day working in tech. We’ll look at the state of technology through the lens of open source; keeping you up-to-date on the trends we’re seeing across the internet. Why Dash Open? Because it’s like a command line argument reminding the command to be open. What can you expect from Dash Open? Interviews with interesting people, occasional witty banter, and a catchy theme song. In the first episode, Rosalie Bartlett, Open Source community manager at Oath, interviews Gil Yehuda, Senior Director of Open Source at Oath. Tune in to hear one skeptic’s journey from resisting the open source movement to heading one of the more prolific Open Source Program Offices (OSPO). Gil highlights the benefits of open source to companies and provides actionable advice on how technology companies can start or improve their OSPO. Give Dash Open a listen and tell us what topics you’d like to hear next. – Ashley Wolf manages the Open Source Program at Oath/Verizon Media Group.

Introducing the Dash Open Podcast, sponsored by Yahoo Developer...

November 19, 2018
Git Shallow Clone November 12, 2018
November 12, 2018
Share

Git Shallow Clone

Previously, Screwdriver would clone the entire commit tree of a Git repository. In most cases, this was unnecessary since most builds only require the latest single commit. For repositories containing immense commit trees, this behavior led to unnecessarily long build times. To address this issue, Screwdriver now defaults to shallow cloning Git repositories with a depth of 50. Screwdriver will also enable the --no-single-branch flag by default in order enable access to other branches in the repository. To disable shallow cloning, simply set the GIT_SHALLOW_CLONE environment variable to false. Example jobs: main: environment: GIT_SHALLOW_CLONE: false image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Here is a comparison of the build speed improvement for a repository containing over ~160k commits. Before: After: For more information, please consult the Screwdriver V4 FAQ. Compatibility List In order to use the new build cache feature, you will need these minimum versions: - screwdrivercd/screwdriver:v0.5.501Contributors Thanks to the following people for making this feature possible: - Filbird Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support

Git Shallow Clone

November 12, 2018
Hadoop Contributors Meetup at Oath November 8, 2018
November 8, 2018
amberwilsonla
Share

Hadoop Contributors Meetup at Oath

yahoodevelopers: By Scott Bush, Director, Hadoop Software Engineering, Oath On Tuesday, September 25, we hosted a special day-long Hadoop Contributors Meetup at our Sunnyvale, California campus. Much of the early Hadoop development work started at Yahoo, now part of Oath, and has continued over the past decade. Our campus was the perfect setting for this meetup, as we continue to make Hadoop a priority. More than 80 Hadoop users, contributors, committers, and PMC members gathered to hear talks on key issues facing the Hadoop user community. Speakers from Ampool, Cloudera, Hortonworks, Microsoft, Oath, and Twitter detailed some of the challenges and solutions pertinent to their parts of the Hadoop ecosystem. The talks were followed by a number of parallel, birds of a feather breakout sessions to discuss HDFS, Tez, containers and low latency processing. The day ended with a reception and consensus that the event went well and should be repeated in the near future. Presentation recordings (YouTube playlist) and slides (links included in the video description) are available here: - Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda Tan, Hortonworks - Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara, Botong Huang - “HDFS Scalability and Security”, Daryn Sharp, Senior Engineer, Oath - The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool - Moving the Oath Grid to Docker, Eric Badger, Software Developer Engineer, Oath - Vespa: Open Source Big Data Serving Engine, Jon Bratseth, Distinguished Architect, Oath - Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shane Kumpf, Hortonworks - How Twitter Hadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu Thank you to all the presenters and the attendees both in person and remote! P.S. We’re hiring! Learn more about career opportunities at Oath.

Hadoop Contributors Meetup at Oath

November 8, 2018
Build Cache November 7, 2018
November 7, 2018
Share

Build Cache

Screwdriver now has the ability to cache and restore files and directories from your builds for use in other builds! This feature gives you the option to cache artifacts in builds using Gradle, NPM, Maven etc. so subsequent builds can save time on commonly-run steps such as dependency installation and package build. You can now specify a top-level setting in your screwdriver.yaml called cache that contains file paths from your build that you would like to cache. You can limit access to the cache at a pipeline, event, or job-level scope. Scope guide - pipeline-level: all builds in the same pipeline (across different jobs and events) - event-level: all builds in the same event (across different jobs) - job-level: all builds for the same job (across different events in the same pipeline) Example cache: event: - $SD_SOURCE_DIR/node_modules pipeline: - ~/.gradle job: test-job: [/tmp/test] In the above example, we cache the .gradle folder so that subsequent builds in the pipeline can save time on gradle install. Without cache: With cache: Compatibility List In order to use the new build cache feature, you will need these minimum versions: - screwdrivercd/queue-worker:v2.2.2 - screwdrivercd/screwdriver:v0.5.492 - screwdrivercd/launcher:v5.0.37 - screwdrivercd/store:v3.3.11 Note: Please ensure the store service has sufficient available memory to handle the payload. For cache cleanup, we use AWS S3 Lifecycle Management. If your store service is not configured to use S3, you might need to add a cleanup mechanism. Contributors Thanks to the following people for making this feature possible: - d2lam - pranavrc Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support

Build Cache

November 7, 2018
Announcing the 2nd Annual Moloch Conference: Learn how to augment your current security infrastructure October 29, 2018
October 29, 2018
amberwilsonla
Share

Announcing the 2nd Annual Moloch Conference: Learn how to augment your current security infrastructure

yahoodevelopers: We’re excited to share that the 2nd Annual MolochON will be Thursday, Nov. 1, 2018 in Dulles, Virginia, at the Oath campus. Moloch is a large-scale, open source, full packet capturing, indexing and database system. There’s no cost to attend the event and we’d love to see you there! Feel free to register here. We’ll be joined by many fantastic speakers from the Moloch community to present on the following topics: Moloch: Recent Changes & Upcoming Features by Andy Wick, Sr Princ Architect, Oath & Elyse Rinne, Software Dev Engineer, Oath Since the last MolochON, many new features have been added to Moloch. We will review some of these features and demo how to use them. We will also discuss a few desired upcoming features. Speaker Bios Andy is the creator of Moloch and former Architect of AIM. He joined the security team in 2011 and hasn’t looked back. Elyse is the UI and full stack engineer for Moloch. She revamped the UI to be more user-friendly and maintainable. Now that the revamp has been completed, Elyse is working on implementing awesome new Moloch features! Small Scale at Large Scale: Putting Moloch on the Analyst’s Desk by Phil Hagen, SANS Senior Instructor, DFIR Strategist, Red Canary I’ve been excited to add Moloch to the FOR572 class, Advanced Network Forensics at the SANS Institute. In FOR572, we cover Moloch with nearly 1,000 students per year, via classroom discussions and hands-on labs. This presents an interesting engineering problem, in that we provide a self-contained VMware image for the classroom lab, but it is also suitable for use in forensic casework. In this talk, I’ll cover some of what we did to make a single VM into a stable and predictable environment, distributed to hundreds of students across the world. Speaker Bio Phil is a Senior Instructor with the SANS Institute and the DFIR Strategist at Red Canary. He is the course lead for SANS FOR572, Advanced Network Forensics, and has been in the information security industry for over 20 years. Phil is also the lead for the SOF-ELK project, which provides a free, open source, ready-to-use Elastic Stack appliance to aid and optimize security operations and forensic processing. Networking is in his blood, dating back to a 2400 baud modem in an Apple //e, which he still has. Oath Deployments by Andy Wick, Sr Princ Architect, Oath The formation of Oath gave us an opportunity to rethink and create a new visibility stack. In this talk, we will be sharing our process for designing our stack for both office and data center deployments and discussing the technologies we decided to use. Speaker Bio Andy is the creator of Moloch and former Architect of AIM. He joined the security team in 2011 and hasn’t looked back. Centralized Management and Deployment with Docker and Ansible by Taylor Ashworth, Cybersecurity Analyst I will focus on how to use Docker and Ansible to deploy, update, and manage Moloch along with other tools like Suricata, WISE, and ES. I will explain the time-saving benefits of Ansible and the workload reduction benefits of Docker,and I will also cover the topic “Pros and cons of using Ansible tower/AWX over Ansible in CLI.” If time permits, I’ll discuss “Using WISE for data enrichment.” Speaker Bio Taylor is a cybersecurity analyst who was tired of the terrible tools he was presented with and decided to teach himself how to set up tools to successfully do his job. Automated Threat Intel Investigation Pipeline by Matt Carothers, Principal Security Architect, Cox Communications I will discuss integrating Moloch into an automated threat intel investigation pipeline with MISP. Speaker Bio Matt enjoys sunsets, long hikes in the mountains and intrusion detection. After studying Computer Science at the University of Oklahoma, he accepted a position with Cox Communications in 2001 under the leadership of renowned thought leader and virtuoso bass player William “Wild Bill” Beesley, who asked to be credited in this bio. There, Matt formed Cox’s abuse department, which he led for several years, and today he serves as Cox’s Principal Security Architect. Using WISE by Andy Wick, Sr Princ Architect, Oath We will review how to use WISE and provide real-life examples of features added since the last MolochON. Speaker Bio Andy is the creator of Moloch and former Architect of AIM. He joined the security team in 2011 and hasn’t looked back. Moloch Deployments by Srinath Mantripragada, Linux Integrator, SecureOps I will present a Moloch deployment with 20+ different Moloch nodes. A range will be presented, including small, medium, and large deployments that go from full hardware with dedicated capture cards to virtualized point-of-presence and AWS with transit network. All nodes run Moloch, Suricata and Bro. Speaker Bio Srinath has worked as a SysAdmin and related positions for most of his career. He currently works as an Integrator/SysAdmin/DevOps for SecureOps, a Security Services company in Montreal, Canada. Elasticsearch for Time-series Data at Scale by Andrew Selden, Solution Architect, Elastic Elasticsearch has evolved beyond search and logging to be a first-class, time-series metric store. This talk will explore how to achieve 1 million metrics/second on a relatively modest cluster. We will take a look at issues such as data modeling, debugging, tuning, sharding, rollups and more. Speaker Bio Andrew Selden has been running Elasticsearch at scale since 2011 where he previously led the search, NLP, and data engineering teams at Meltwater News and later developed streaming analytics solutions for BlueKai’s advertising platform (acquired by Oracle). He started his tenure at Elastic as a core engineer and for the last two years has been helping customers architect and scale. After the conference, enjoy a complimentary happy hour, sponsored by Arista. Hope to see you there!

Announcing the 2nd Annual Moloch Conference: Learn how to augment your current security infrastructure

October 29, 2018
Sharing Vespa at the SF Big Analytics Meetup October 19, 2018
October 19, 2018
Share

Sharing Vespa at the SF Big Analytics Meetup

By Jon Bratseth, Distinguished Architect, Oath I had the wonderful opportunity to present Vespa at the SF Big Analytics Meetup on September 26th, hosted by Amplitude. Several members of the Vespa team (Kim, Frode and Kristian) also attended. We all enjoyed meeting with members of the Big Analytics community to discuss how Vespa could be helpful for their companies. Thank you to Chester Chen, T.J. Bay, and Jin Hao Wan for planning the meetup, and here’s our presentation, in case you missed it (slides are also available here):

Sharing Vespa at the SF Big Analytics Meetup

October 19, 2018
Sharing Vespa (Open Source Big Data Serving Engine) at the SF Big Analytics Meetup October 17, 2018
October 17, 2018
amberwilsonla
Share

Sharing Vespa (Open Source Big Data Serving Engine) at the SF Big Analytics Meetup

yahoodevelopers: By Jon Bratseth, Distinguished Architect, Oath I had the wonderful opportunity to present Vespa at the SF Big Analytics Meetup on September 26th, hosted by Amplitude. Several members of the Vespa team (Kim, Frode and Kristian) also attended. We all enjoyed meeting with members of the Big Analytics community to discuss how Vespa could be helpful for their companies. Thank you to Chester Chen, T.J. Bay, and Jin Hao Wan for planning the meetup, and here’s our presentation, in case you missed it (slides are also available here): Largely developed by Yahoo engineers, Vespa is our big data processing and serving engine, available as open source on GitHub. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance and Oath Ads Platforms.  Vespa use is growing even more rapidly; since it is open source under a permissive Apache license, Vespa can power other external third-party apps as well.  A great example is Zedge, which uses Vespa for search and recommender systems to support content discovery for personalization of mobile phones (Android, iOS, and Web). Zedge uses Vespa in production to serve millions of monthly active users. Visit https://vespa.ai/ to learn more and download the code. We encourage code contributions and welcome opportunities to collaborate.

Sharing Vespa (Open Source Big Data Serving Engine) at the SF Big Analytics Meetup

October 17, 2018
Open-Sourcing Panoptes, Oath’s distributed network telemetry collector October 4, 2018
October 4, 2018
amberwilsonla
Share

Open-Sourcing Panoptes, Oath’s distributed network telemetry collector

yahoodevelopers: By Ian Flint, Network Automation Architect and Varun Varma, Senior Principal Engineer The Oath network automation team is proud to announce that we are open-sourcing Panoptes, a distributed system for collecting, enriching and distributing network telemetry.   We developed Panoptes to address several issues inherent in legacy polling systems, including overpolling due to multiple point solutions for metrics, a lack of data normalization, consistent data enrichment and integration with infrastructure discovery systems.   Panoptes is a pluggable, distributed, high-performance data collection system which supports multiple polling formats, including SNMP and vendor-specific APIs. It is also extensible to support emerging streaming telemetry standards including gNMI. Architecture The following block diagram shows the major components of Panoptes: Panoptes is written primarily in Python, and leverages multiple open-source technologies to provide the most value for the least development effort. At the center of Panoptes is a metrics bus implemented on Kafka. All data plane transactions flow across this bus; discovery publishes devices to the bus, polling publishes metrics to the bus, and numerous clients read the data off of the bus for additional processing and forwarding. This architecture enables easy data distribution and integration with other systems. For example, in preparing for open-source, we identified a need for a generally available time series datastore. We developed, tested and released a plugin to push metrics into InfluxDB in under a week. This flexibility allows Panoptes to evolve with industry standards. Check scheduling is accomplished using Celery, a horizontally scalable, open-source scheduler utilizing a Redis data store. Celery’s scalable nature combined with Panoptes’ distributed nature yields excellent scalability. Across Oath, Panoptes currently runs hundreds of thousands of checks per second, and the infrastructure has been tested to more than one million checks per second. Panoptes ships with a simple, CSV-based discovery system. Integrating Panoptes with a CMDB is as simple as writing an adapter to emit a CSV, and importing that CSV into Panoptes. From there, Panoptes will manage the task of scheduling polling for the desired devices. Users can also develop custom discovery plugins to integrate with their CMDB and other device inventory data sources. Finally, any metrics gathering system needs a place to send the metrics. Panoptes’ initial release includes an integration with InfluxDB, an industry-standard time series store. Combined with Grafana and the InfluxData ecosystem, this gives teams the ability to quickly set up a fully-featured monitoring environment. Deployment at Oath At Oath, we anticipate significant benefits from building Panoptes. We will consolidate four siloed polling solutions into one, reducing overpolling and the associated risk of service interruption. As vendors move toward streaming telemetry, Panoptes’ flexible architecture will minimize the effort required to adopt these new protocols. There is another, less obvious benefit to a system like Panoptes. As is the case with most large enterprises, a massive ecosystem of downstream applications has evolved around our existing polling solutions. Panoptes allows us to continue to populate legacy datastores without continuing to run the polling layers of those systems. This is because Panoptes’ data bus enables multiple metrics consumers, so we can send metrics to both current and legacy datastores. At Oath, we have deployed Panoptes in a tiered, federated model. We install the software in each of our major data centers and proxy checks out to smaller installations such as edge sites.  All metrics are polled from an instance close to the devices, and metrics are forwarded to a centralized time series datastore. We have also developed numerous custom applications on the platform, including a load balancer monitor, a BGP session monitor, and a topology discovery application. The availability of a flexible, extensible platform has greatly reduced the cost of producing robust network data systems. Easy Setup Panoptes’ open-source release is packaged for easy deployment into any Linux-based environment. Deployment is straightforward, so you can have a working system up in hours, not days. We are excited to share our internal polling solution and welcome engineers to contribute to the codebase, including contributing device adapters, metrics forwarders, discovery plugins, and any other relevant data consumers.   Panoptes is available at https://github.com/yahoo/panoptes, and you can connect with our team at network-automation@oath.com.

Open-Sourcing Panoptes, Oath’s distributed network telemetry collector

October 4, 2018
Configurable Build Resources October 2, 2018
October 2, 2018
Share

Configurable Build Resources

We’ve expanded build resource configuration options for Screwdriver! Screwdriver allows users to specify varying tiers of build resources via annotations. Previously, users were able to configure cpu and ram between the three tiers: micro, low(default), and high. In our recent change, we are introducing a new configurable resource, disk, which can be set to either low (default) or high. Furthermore, we are adding an extra tier turbo to both the cpu and ram resources! Please note that although Screwdriver provides default values for each tier, their actual values are determined by the cluster admin. Resources tier: Screwdriver Users In order to use these new settings, Screwdriver users can configure their screwdriver.yamls using annotations as shown below: Example: jobs: main: annotations: screwdriver.cd/cpu: TURBO screwdriver.cd/disk: HIGH screwdriver.cd/ram: MICRO image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Cluster Admins Screwdriver cluster admins can refer to the following issues to set up turbo and disk resources properly. - Turbo resources: https://github.com/screwdriver-cd/screwdriver/issues/1318#issue-364993739 - Disk resources: https://github.com/screwdriver-cd/screwdriver/issues/757#issuecomment-425589405Compatibility List In order to use these new features, you will need these minimum versions: - screwdrivercd/queue-worker:v2.2.2Contributors Thanks to the following people for making this feature possible: - Filbird - minz1027 Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support

Configurable Build Resources

October 2, 2018
Apache Pulsar graduates to Top-Level Project September 25, 2018
September 25, 2018
amberwilsonla
Share

Apache Pulsar graduates to Top-Level Project

yahoodevelopers: By Joe Francis, Director, Storage & Messaging We’re excited to share that The Apache Software Foundation announced today that Apache Pulsar has graduated from the incubator to a Top-Level Project. Apache Pulsar is an open-source distributed pub-sub messaging system, created by Yahoo in June 2015 and submitted to the Apache Incubator in June 2017. Apache Pulsar is integral to the streaming data pipelines supporting Oath’s core products including Yahoo Mail, Yahoo Finance, Yahoo Sports and Oath Ad Platforms. It handles hundreds of billions of data events each day and is an integral part of our hybrid cloud strategy. It enables us to stream data between our public and private clouds and allows data pipelines to connect across the clouds.   Oath continues to support Apache Pulsar, with contributions including best-effort messaging, load balancer and end-to-end encryption. With growing data needs handled by Apache Pulsar at Oath, we’re focused on reducing memory pressure in brokers and bookkeepers, and creating additional connectors to other large-scale systems. Apache Pulsar’s future is bright and we’re thrilled to be part of this great project and community. P.S. We’re hiring! Learn more here.

Apache Pulsar graduates to Top-Level Project

September 25, 2018
Pipeline pagination on the Search page September 20, 2018
September 20, 2018
Share

Pipeline pagination on the Search page

We’ve recently added pagination to the pipelines on the Search page! Before pipeline pagination, when a user visited the Search page (e.g. /search), all pipelines were fetched from the API and sorted alphabetically in the UI. In order to improve the total page load time, we moved the burden of pagination from the UI to the API. Now, when a user visits the Search page, only the first page of pipelines is fetched by default. Clicking the Show More button triggers the fetching of the next page of pipelines. All the pagination and search logic is moved to the datastore, so the overall load time for fetching a page of search results is under 2 seconds now as compared to before where some search queries could take more than 10 seconds.Screwdriver Cluster Admins In order to use these latest changes fully, Screwdriver cluster admins will need to do some SQL queries to migrate data from scmRepo to the new name field. This name field will be used for sorting and searching in the Search UI. Without migrating If no migration is done, pipelines will show up sorted by id in the Search page. Pipelines will not be returned in search results until a sync or update is done on them (either directly from the UI or by interacting with the pipeline in some way in the UI). Steps to migrate 1. Pull in the new API (v0.5.466). This is necessary for the name column to be created in the DB. 2. Take a snapshot or backup your DB. 3. Set pipeline name. This requires two calls in postgres: one to extract the pipeline name data, the second to remove the curly braces ({ and }) injected by the regexp call. In postgresql, run: UPDATE public.pipelines SET name = regexp_matches("scmRepo", '.*name":"(.*)",.*') UPDATE public.pipelines SET name = btrim(name, '{}') 4. Pull in the new UI (v1.0.331). 5. Optionally, you can post a banner at to let users know they might need to sync their pipelines if it is not showing up in search results. Make an API call to POST /banners with proper auth and body like: { "message": "If your pipeline is not showing up in Search results, go to the pipeline Options tab and Sync the pipeline.", "isActive": true, "type": "info" } Compatibility List The Search page pipeline pagination requires the following minimum versions of Screwdriver: - API: v0.5.466 - UI: v1.0.331 Contributors Thanks to the following people who made this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Pipeline pagination on the Search page

September 20, 2018
Introducing HaloDB, a fast, embedded key-value storage engine written in Java September 19, 2018
September 19, 2018
amberwilsonla
Share

Introducing HaloDB, a fast, embedded key-value storage engine written in Java

yahoodevelopers: By Arjun Mannaly, Senior Software Engineer  At Oath, multiple ad platforms use a high throughput, low latency distributed key-value database that runs in data centers all over the world. The database stores billions of records and handles millions of read and write requests per second at millisecond latencies. The data we have in this database must be persistent, and the working set is larger than what we can fit in memory. Therefore, a key component of the database performance is a fast storage engine. Our current solution had served us well, but it was primarily designed for a read-heavy workload and its write throughput started to be a bottleneck as write traffic increased. There were other additional concerns as well; it took hours to repair a corrupted DB, or iterate over and delete records. The storage engine also didn’t expose enough operational metrics. The primary concern though was the write performance, which based on our projections, would have been a major obstacle for scaling the database. With these concerns in mind, we began searching for an alternative solution. We searched for a key-value storage engine capable of dealing with IO-bound workloads, with submillisecond read latencies under high read and write throughput. After concluding our research and benchmarking alternatives, we didn’t find a solution that worked for our workload, thus we were inspired to build HaloDB. Now, we’re glad to announce that it’s also open source and available to use under the terms of the Apache license. HaloDB has given our production boxes a 50% improvement in write capacity while consistently maintaining a submillisecond read latency at the 99th percentile. Architecture HaloDB primarily consists of append-only log files on disk and an index of keys in memory. All writes are sequential writes which go to an append-only log file and the file is rolled-over once it reaches a configurable size. Older versions of records are removed to make space by a background compaction job. The in-memory index in HaloDB is a hash table which stores all keys and their associated metadata. The size of the in-memory index, depending on the number of keys, can be quite large, hence for performance reasons, is stored outside the Java heap, in native memory. When looking up the value for a key, corresponding metadata is first read from the in-memory index and then the value is read from disk. Each lookup request requires at most a single read from disk. Performance   The chart below shows the results of performance tests with real production data. The read requests were kept at 50,000 QPS while the write QPS was increased. HaloDB scaled very well as we increased the write QPS while consistently maintaining submillisecond read latencies at the 99th percentile. The chart below shows the 99th percentile latency from a production server before and after migration to HaloDB.  If HaloDB sounds like a helpful solution to you, please feel free to use it, open issues, and contribute!

Introducing HaloDB, a fast, embedded key-value storage engine written in Java

September 19, 2018
Join us in San Francisco on September 26th for a Meetup September 18, 2018
September 18, 2018
Share

Join us in San Francisco on September 26th for a Meetup

Hi Vespa Community, Several members from our team will be traveling to San Francisco on September 26th for a meetup and we’d love to chat with you there. Jon Bratseth (Distinguished Architect) will present a Vespa overview and answer any questions. To learn more and RSVP, please visit: https://www.meetup.com/SF-Big-Analytics/events/254461052/. Hope to see you! The Vespa Team

Join us in San Francisco on September 26th for a Meetup

September 18, 2018
Build step logs download September 18, 2018
September 18, 2018
Share

Build step logs download

Downloading Step Logs We have added a Download button in the top right corner of the build log console. Upon clicking the button, the browser will query all or the rest of the log content from our API and compose a client-side downloadable text blob by leveraging the URL.createObjectURL() Web API. Minor Improvement On Workflow Graph Thanks to s-yoshika, the link edge is no longer covering the name text of the build node. Also, for build jobs with names that exceed 20 characters, it will be automatically ellipsized to avoid being clipped off by the containing DOM element. Compatibility List These UI improvements require the following minimum versions of Screwdriver: - screwdrivercd/ui: v1.0.329Contributors Thanks to the following people for making this feature possible: - DekusDenial - s-yoshika Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Build step logs download

September 18, 2018
Introducing Oak: an Open Source Scalable Key-Value Map for Big Data Analytics September 13, 2018
September 13, 2018
amberwilsonla
Share

Introducing Oak: an Open Source Scalable Key-Value Map for Big Data Analytics

yahoodevelopers: By Dmitry Basin, Edward Bortnikov, Anastasia Braginsky, Eshcar Hillel, Idit Keidar, Hagar Meir, Gali Sheffi Real-time analytics applications are on the rise. Modern decision support and machine intelligence engines strive to continuously ingest large volumes of data while providing up-to-date insights with minimum delay. For example, in Flurry Analytics, an Oath service which provides mobile developers with rich tools to explore user behavior in real time, it only takes seconds to reflect the events that happened on mobile devices in its numerous dashboards. The scalability demand is immense – as of late 2017, the Flurry SDK was installed on 2.6B devices and monitored 1M+ mobile apps. Mobile data hits the Flurry backend at a huge rate, updates statistics across hundreds of dimensions, and becomes queryable immediately. Flurry harnesses the open-source distributed interactive analytics engine named Druid to ingest data and serve queries at this massive rate. In order to minimize delays before data becomes available for analysis, technologies like Druid should avoid maintaining separate systems for data ingestion and query serving, and instead strive to do both within the same system. Doing so is nontrivial since one cannot compromise on overall correctness when multiple conflicting operations execute in parallel on modern multi-core CPUs. A promising approach is using concurrent data structure (CDS) algorithms which adapt traditional data structures to multiprocessor hardware. CDS implementations are thread-safe – that is, developers can use them exactly as sequential code while maintaining strong theoretical correctness guarantees. In recent years, CDS algorithms enabled dramatic application performance scaling and became popular programming tools. For example, Java programmers can use the ConcurrentNavigableMap JDK implementations for the concurrent ordered key-value map abstraction that is instrumental in systems like Druid. Today, we are excited to share Oak, a new open source project from Oath, available under the Apache License 2.0. The project was created by the Scalable Systems team at Yahoo Research. It extends upon our earlier research work, named KiWi. Oak is a Java package that implements OakMap – a concurrent ordered key-value map. OakMap’s API is similar to Java’s ConcurrentNavigableMap. Java developers will find it easy to switch most of their applications to it. OakMap provides the safety guarantees specified by ConcurrentNavigableMap’s programming model. However, it scales with the RAM and CPU resources well beyond the best-in-class ConcurrentNavigableMap implementations. For example, it compares favorably to Doug Lea’s seminal ConcurrentSkipListMap, which is used by multiple big data platforms, including Apache HBase, Druid, EVCache, etc. Our benchmarks show that OakMap harnesses 3x more memory, and runs 3x-5x faster on analytics workloads. OakMap’s implementation is very different from traditional implementations such as  ConcurrentSkipListMap. While the latter maintains all keys and values as individual Java objects, OakMap stores them in very large memory buffers allocated beyond the JVM-managed memory heap (hence the name Oak - abbr. Off-heap Allocated Keys). The access to the key-value pairs is provided by a lightweight two-level on-heap index. At its lower level, the references to keys are stored in contiguous chunks, each responsible for a distinct key range. The chunks themselves, which dominate the index footprint, are accessed through a lightweight top-level ConcurrentSkipListMap. The figure below illustrates OakMap’s data organization. OakMap structure. The maintenance of OakMap’s chunked index in a concurrent setting is the crux of its complexity as well as the key for its efficiency. Experiments have shown that our algorithm is advantageous in multiple ways: 1. Memory scaling. OakMap’s custom off-heap memory allocation alleviates the garbage collection (GC) overhead that plagues Java applications. Despite the permanent progress, modern Java GC algorithms do not practically scale beyond a few tens of GBs of memory, whereas OakMap scales beyond 128GB of off-heap RAM. 2. Query speed. The chunk-based layout increases data locality, which speeds up both single-key lookups and range scans. All queries enjoy efficient, cache-friendly access, in contrast with permanent dereferencing in object-based maps. On top of these basic merits, OakMap provides safe direct access to its chunks, which avoids an extra copy for rebuilding the original key and value objects. Our benchmarks demonstrate OakMap’s performance benefits versus ConcurrentSkipListMap: A) Up to 2x throughput for ascending scans. B) Up to 5x throughput for descending scans. C) Up to 3x throughput for lookups. 3. Update speed. Beyond avoiding the GC overhead typical for write-intensive workloads, OakMap optimizes the incremental maintenance of big complex values – for example, aggregate data sketches, which are indispensable in systems like Druid. It adopts in situ computation on objects embedded in its internal chunks to avoid unnecessary data copy, yet again. In our benchmarks, OakMap achieves up to 1.8x data ingestion rate versus ConcurrentSkipListMap. With key-value maps being an extremely generic abstraction, it is easy to envision a variety of use cases for OakMap in large-scale analytics and machine learning applications – such as unstructured key-value storage, structured databases, in-memory caches, parameter servers, etc. For example, we are already working with the Druid community on rebuilding Druid’s core Incremental Index component around OakMap, in order to boost its scalability and performance. We look forward to growing the Oak community! We invite you to explore the project, use OakMap in your applications, raise issues, suggest improvements, and contribute code. If you have any questions, please feel free to send us a note on the Oak developers list: oakproject@googlegroups.com. It would be great to hear from you!

Introducing Oak: an Open Source Scalable Key-Value Map for Big Data Analytics

September 13, 2018
Improvement on perceived performance September 12, 2018
September 12, 2018
Share

Improvement on perceived performance

In an effort to improve Screwdriver user experience, the Screwdriver team identified two major components on the UI that needed improvement with respect to load time — the event pipeline and build step log. To improve user-perceived performance on those components, we decided to adopt two corresponding UX approaches — pagination and lazy loading. Event Pipeline Before our pagination change, when a user visited the pipeline events page (e.g. /pipelines/{id}/events), all events and their builds were fetched from the API then artificially paginated in the UI. In order to improve the total page load time, it was important to move the burden of pagination from the UI to the API. Now, when a user visits the pipeline events page, only the latest page of events and builds are fetched by default. Clicking the Show More button triggers the fetching of the next page of events and builds. Since there is no further processing of the API data by the UI, the overall load time for fetching a page of events and their corresponding build info is well under a second now as compared to before where some pipelines could take more than ten seconds. Build Step Log As for the build step log, instead of chronologically fetching pages of completed step logs one page at a time until the entire log is fetched, it is now fetched in a reverse chronologically order and only a reasonable amount of logs is fetched and loaded lazily as the user scrolls up the log console. This change is meant to compensate for builds that generate tens of thousands lines of log. Since users had to wait for the entire log to load before they could interact with it, the previous implementation was extremely time consuming as the size of step logs increased. Now, the first page of a step log takes roughly two seconds or less to load. To put the significance of the change into perspective, consider a step that generates a total of 98743 lines of log: it would have taken 90 seconds to load and almost 10 seconds to fully render on the UI; now it takes less than 2 seconds to load and less than 1 second to render. Compatibility List These UI improvements require the following minimum versions of Screwdriver: - screwdrivercd/screwdriver: v0.5.460 - screwdrivercd/ui: v1.0.327Contributors Thanks to the following people for making this feature possible: - DekusDenial - jithin1987 - minz1027 - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Improvement on perceived performance

September 12, 2018
Vespa at Zedge - providing personalization content to millions of iOS, Android & web users September 3, 2018
September 3, 2018
Share

Vespa at Zedge - providing personalization content to millions of iOS, Android & web users

This blog post describes Zedge’s use of Vespa for search and recommender systems to support content discovery for personalization of mobile phones (Android, iOS and Web). Zedge is now using Vespa in production to serve millions of monthly active users. See the architecture below.What is Zedge? Zedge’s main product is an app - Zedge Ringtones & Wallpapers - that provides wallpapers, ringtones, game recommendations and notification sounds customized for your mobile device.  Zedge apps have been downloaded more than 300 million times combined for iOS and Android and is used by millions of people worldwide each month. Zedge is traded on NYSE under the ticker ZDGE. People use Zedge apps for self-expression. Setting a wallpaper or ringtone on your mobile device is in many ways similar to selecting clothes, hairstyle or other fashion statements. In fact people try a wallpaper or ringtone in a similar manner as they would try clothes in a dressing room before making a purchase decision, they try different wallpapers or ringtones before deciding on one they want to keep for a while. The decision for selecting a wallpaper is not taken lightly, since people interact and view their mobile device screen (and background wallpaper) a lot (hundreds of times per day). Why Zedge considered Vespa Zedge apps - for iOS, Android and Web - depend heavily on search and recommender services to support content discovery. These services have been developed over several years and constituted of multiple subsystems - both internally developed and open source - and technologies for both search and recommender serving. In addition there were numerous big data processing jobs to build and maintain data for content discovery serving. The time and complexity of improving search and recommender services and corresponding processing jobs started to become high, so simplification was due. Vespa seemed like a promising open source technology to consider for Zedge, in particular since it was proven in several ways within Oath (Yahoo): 1. Scales to handle very large systems, e.g.  2. Flickr with billions of images and 3. Yahoo Gemini Ads Platform with more than one hundred thousand request per second to serve ads to 1 billion monthly active users for services such as Techcrunch, Aol, Yahoo!, Tumblr and Huffpost. 4. Runs stable and requires very little operations support - Oath has a few hundred - many of them large - Vespa based applications requiring less than a handful operations people to run smoothly.  5. Rich set of features that Zedge could gain from using 6. Built-in tensor processing support could simplify calculation and serving of related wallpapers (images) & ringtones/notifications (audio) 7. Built-in support of Tensorflow models to simplify development and deployment of machine learning based search and recommender ranking (at that time in development according to Oath). 8. Search Chains 9. Help from core developers of VespaThe Vespa pilot project Given the content discovery technology need and promising characteristics of Vespa we started out with a pilot project with a team of software engineers, SRE and data scientists with the goals of: 1. Learn about Vespa from hands-on development  2. Create a realistic proof of concept using Vespa in a Zedge app 3. Get initial answers to key questions about Vespa, i.e. enough to decide to go for it fully 4. Which of today’s API services can it simplify and replace? 5. What are the (cloud) production costs with Vespa at Zedge’s scale? (OPEX) 6. How will maintenance and development look like with Vespa? (future CAPEX) 7. Which new (innovation) opportunities does Vespa give? The result of the pilot project was successful - we developed a good proof of concept use of Vespa with one of our Android apps internally and decided to start a project transferring all recommender and search serving to Vespa. Our impression after the pilot was that the main benefit was by making it easier to maintain and develop search/recommender systems, in particular by reducing amount of code and complexity of processing jobs. Autosuggest for search with Vespa Since autosuggest (for search) required both low latency and high throughput we decided that it was a good candidate to try for production with Vespa first. Configuration wise it was similar to regular search (from the pilot), but snippet generation (document summary) requiring access to document store was superfluous for autosuggest. A good approach for autosuggest was to: 1. Make all document fields searchable with autosuggest of type (in-memory) attribute 2. https://docs.vespa.ai/documentation/attributes.html  3. https://docs.vespa.ai/documentation/reference/search-definitions-reference.html#attribute  4. https://docs.vespa.ai/documentation/search-definitions.html (basics) 5. Avoid snippet generation and using the document store by overriding the document-summary setting in search definitions to only access attributes 6. https://docs.vespa.ai/documentation/document-summaries.html  7. https://docs.vespa.ai/documentation/nativerank.html The figure above illustrates the autosuggest architecture. When the user starts typing in the search field, we fire a query with the search prefix to the Cloudflare worker - which in case of a cache hit returns the result (possible queries) to the client. In case of a cache miss the Cloudflare worker forwards the query to our Vespa instance handling autosuggest. Regarding external API for autosuggest we use Cloudflare Workers (supporting Javascript on V8 and later perhaps multiple languages with Webassembly) to handle API queries from Zedge apps in front of Vespa running in Google Cloud. This setup allow for simple close-to-user caching of autosuggest results. Search, Recommenders and Related Content with Vespa Without going into details we had several recommender and search services to adapt to Vespa. These services were adapted by writing custom Vespa searchers and in some cases search chains: - https://docs.vespa.ai/documentation/searcher-development.html  - https://docs.vespa.ai/documentation/chained-components.html  The main change compared to our old recommender and related content services was the degree of dynamicity and freshness of serving, i.e. with Vespa more ranking signals are calculated on the fly using Vespa’s tensor support instead of being precalculated and fed into services periodically. Another benefit of this was that the amount of computational (big data) resources and code for recommender & related content processing was heavily reduced. Continuous Integration and Testing with Vespa A main focus was to make testing and deployment of Vespa services with continuous integration (see figure below). We found that a combination of Jenkins (or similar CI product or service) with Docker Compose worked nicely in order to test new Vespa applications, corresponding configurations and data (samples) before deploying to the staging cluster with Vespa on Google Cloud. This way we can have a realistic test setup - with Docker Compose - that is close to being exactly similar to the production environment (even at hostname level).Monitoring of Vespa with Prometheus and Grafana For monitoring we created a tool that continuously read Vespa metrics, stored them in Prometheus (a time series database) and visualized them them with Grafana. This tool can be found on https://github.com/vespa-engine/vespa_exporter. More information about Vespa metrics and monitoring: - https://docs.vespa.ai/documentation/reference/metrics-health-format.html - https://docs.vespa.ai/documentation/jdisc/metrics.html - https://docs.vespa.ai/documentation/operations/admin-monitoring.htmlConclusion The team quickly got up to speed with Vespa with its good documentation and examples, and it has been running like a clock since we started using it for real loads in production. But this was only our first step with Vespa - i.e. consolidating existing search and recommender technologies into a more homogeneous and easier to maintain form. With Vespa as part of our architecture we see many possible paths for evolving our search and recommendation capabilities (e.g. machine learning based ranking such as integration with Tensorflow and ONNX). Best regards, Zedge Content Discovery Team

Vespa at Zedge - providing personalization content to millions of iOS, Android & web users

September 3, 2018
Private channel support for Slack notifications August 27, 2018
August 27, 2018
Share

Private channel support for Slack notifications

In January, we introduced Slack notifications for build statuses in public channels. This week, we are happy to announce that we also support Slack notifications for private channels as well!Usage for a Screwdriver.cd User Slack notifications can be configured the exact same way as before, but private repos are now supported. First, you must invite the Screwdriver Slack bot (most likely screwdriver-bot), created by your admin, to your Slack channel(s). Then, you must configure your screwdriver.yaml file, which stores all your build settings: settings: slack: channels: - channel_A # public - channel_B # private statuses: # statuses to notify on - SUCCESS - FAILURE - ABORTED statuses denote the build statuses that trigger a notification. The full possible list of statuses to listen on can be found in our data-schema. If omitted, it defaults to only notifying you when a build returns a FAILURE status. See our previous Slack blog post and Slack user documentation and cluster admin documentation for more information.Compatibility List Private channel support for Slack notifications requires the following minimum versions of Screwdriver: - screwdrivercd/screwdriver: v0.5.451Contributors Thanks to the following people for making this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Private channel support for Slack notifications

August 27, 2018
User configurable shell August 22, 2018
August 22, 2018
Share

User configurable shell

Previously, Screwdriver ran builds in sh. This caused problems for some users that have bash syntax in their steps. With version LAUNCHER v5.0.13 and above, users can run builds in the shell of their choice by setting the environment variable USER_SHELL_BIN. This value can also be the full path such as /bin/bash. Example screwdriver.yaml (can be found under the screwdriver-cd-test/user-shell-example repo): shared: image: node:6 jobs: # This job will fail because `source` is not available in sh test-sh: steps: - fail: echo "echo hello" > /tmp/test && source /tmp/test requires: [~pr, ~commit] # This job will pass because `source` is available in bash test-bash: # Set USER_SHELL_BIN to bash to run the in bash environment: USER_SHELL_BIN: bash steps: - pass: echo "echo hello" > /tmp/test && source /tmp/test requires: [~pr, ~commit] Compatibility List User-configurable shell support requires the following minimum versions of Screwdriver: - screwdrivercd/launcher: v5.0.13Contributors Thanks to the following people for making this feature possible: - d2lam Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

User configurable shell

August 22, 2018
Introducing JSON queries August 8, 2018
August 8, 2018
Share

Introducing JSON queries

We recently introduced a new addition to the Search API - JSON queries. The search request can now be executed with a POST request, which includes the query-parameters within its payload. Along with this new query we also introduce a new parameter SELECT with the sub-parameters WHERE and GROUPING, which is equivalent to YQL. The new query With the Search APIs newest addition, it is now possible to send queries with HTTP POST. The query-parameters has been moved out of the URL and into a POST request body - therefore, no more URL-encoding. You also avoid getting all the queries in the log, which can be an advantage. This is how a GET query looks like: GET /search/?param1=value1¶m2=value2&... The general form of the new POST query is: POST /search/ { param1 : value1, param2 : value2, ... } The dot-notation is gone, and the query-parameters are now nested under the same key instead. Let’s take this query: GET /search/?yql=select+%2A+from+sources+%2A+where+default+contains+%22bad%22%3B&ranking.queryCache=false&ranking.profile=vespaProfile&ranking.matchPhase.ascending=true&ranking.matchPhase.maxHits=15&ranking.matchPhase.diversity.minGroups=10&presentation.bolding=false&presentation.format=json&nocache=true and write it in the new POST request-format, which will look like this: POST /search/ { "yql": "select * from sources * where default contains \"bad\";", "ranking": { "queryCache": "false", "profile": "vespaProfile", "matchPhase": { "ascending": "true", "maxHits": 15, "diversity": { "minGroups": 10 } } }, "presentation": { "bolding": "false", "format": "json" }, "nocache": true } With Vespa running (see Quick Start or Blog Search Tutorial), you can try building POST-queries with the new querybuilder GUI at http://localhost:8080/querybuilder/, which can help you build queries with e.g. autocompletion of YQL: The Select-parameter The SELECT-parameter is used with POST queries and is the JSON equivalent of YQL queries, so they can not be used together. The query-parameter will overwrite SELECT, and decide the query’s querytree. Where The SQL-like syntax is gone and the tree-syntax has been enhanced. If you’re used to the query-parameter syntax you’ll feel right at home with this new language. YQL is a regular language and is parsed into a query-tree when parsed in Vespa. You can now build that tree in the WHERE-parameter with JSON. Lets take a look at the yql: select * from sources * where default contains foo and rank(a contains "A", b contains "B");, which will create the following query-tree: You can build the tree above with the WHERE-parameter, like this: { "and" : [ { "contains" : ["default", "foo"] }, { "rank" : [ { "contains" : ["a", "A"] }, { "contains" : ["b", "B"] } ]} ] } Which is equivalent with the YQL. Grouping The grouping can now be written in JSON, and can now be written with structure, instead of on the same line. Instead of parantheses, we now use curly brackets to symbolise the tree-structure between the different grouping/aggregation-functions, and colons to assign function-arguments. A grouping, that will group first by year and then by month, can be written as such: | all(group(time.year(a)) each(output(count()) all(group(time.monthofyear(a)) each(output(count()))) and equivalentenly with the new GROUPING-parameter: "grouping" : [ { "all" : { "group" : "time.year(a)", "each" : { "output" : "count()" }, "all" : { "group" : "time.monthofyear(a)", "each" : { "output" : "count()" }, } } } ] Wrapping it up In this post we have provided a gentle introduction to the new Vepsa POST query feature, and the SELECT-parameter. You can read more about writing POST queries in the Vespa documentation. More examples of the POST query can be found in the Vespa tutorials. Please share experiences. Happy searching!

Introducing JSON queries

August 8, 2018
Introducing Screwdriver Commands for sharing binaries July 30, 2018
July 30, 2018
Share

Introducing Screwdriver Commands for sharing binaries

Oftentimes, there are small scripts or commands that people will use in multiple jobs that are not complex enough to warrant creating a Screwdriver template. Options such as Git repositories, yum packages, or node modules exist, but there was no clear way to share binaries or scripts across multiple jobs. Recently, we have released Screwdriver Commands (also known as sd-cmd) which solves this problem, allowing users to easily share binary commands or scripts across multiple containers and jobs. Using a command The following is an example of using an sd-cmd. You can configure any commands or scripts in screwdriver.yaml like this: Example: jobs: main: requires: [~pr, ~commit] steps: - exec: sd-cmd exec foo/bar@1 -baz sample Format for using sd-cmd: sd-cmd exec /@ - namespace/name - the fully-qualified command name - version - a semver-compatible format or tag - arguments - passed directly to the underlying command In this example, Screwdriver will download the command “foobar.sh” from the Store, which is defined by namespace, name, and version, and will execute it with args “-baz sample”. Actual command will be run as: $ /opt/sd/commands/foo/bar/1.0.1/foobar.sh -baz sample Creating a command Next, this section covers how to publish your own binary commands or scripts. Commands or scripts must be published using a Screwdriver pipeline. The command will then be available in the same Screwdriver cluster. Writing a command yaml To create a command, create a repo with a sd-command.yaml file. The file should contain a namespace, name, version, description, maintainer email, format, and a config that depends on a format. Optionally, you can set the usage field, which will replace the default usage set in the documentation in the UI. Example sd-command.yaml: Binary example: namespace: foo # Namespace for the command name: bar # Command name version: '1.0' # Major and Minor version number (patch is automatic), must be a string description: | Lorem ipsum dolor sit amet. usage: | # Optional usage field for documentation purposes sd-cmd exec foo/bar@

Introducing Screwdriver Commands for sharing binaries

July 30, 2018
User teardown steps July 12, 2018
July 12, 2018
Share

User teardown steps

Users can now specify their own teardown steps in Screwdriver, which will always run regardless of build status. These steps need to be defined at the end of the job and start with teardown-. Note: These steps run in separate shells. As a result, environment variables set by previous steps will not be available. Update 8/22/2018: Environment variables set by user steps are now available in teardown steps. Example screwdriver.yaml jobs: main: image: node:8 steps: - fail: command-does-not-exist - teardown-step1: echo hello - teardown-step2: echo goodbye requires: - ~commit - ~pr In this example, the steps teardown-step1 and teardown-step2 will run even though the build fails: Compatibility List User teardown support requires the following minimum versions of Screwdriver: - screwdrivercd/launcher: v4.0.116 - screwdrivercd/screwdriver: v0.5.405Contributors Thanks to the following people for making this feature possible: - d2lam - tk3fftk (from Yahoo! JAPAN) Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

User teardown steps

July 12, 2018
Pipeline API Tokens in Screwdriver July 9, 2018
July 9, 2018
Share

Pipeline API Tokens in Screwdriver

We released pipeline-scoped API Tokens, which enable your scripts to interact with a specific Screwdriver pipeline. You can use these tokens with fine-grained access control for each pipeline instead of User Access Tokens. Creating Tokens If you go to Screwdriver’s updated pipeline Secrets page, you can find a list of all your pipeline access tokens along with the option to modify, refresh, or revoke them. At the bottom of the list is a form to generate a new token. Enter a name and optional description, then click Add. Your new pipeline token value will be displayed at the top of the Access Tokens section, but it will only be displayed once, so make sure you save it somewhere safe! This token provides admin-level access to your specific pipeline, so treat it as you would a password. Using Tokens to Authenticate To authenticate with your pipeline’s newly-created token, make a GET request to https://${API_URL}/v4/auth/token?api_token=${YOUR_PIPELINE_TOKEN_VALUE}. This returns a JSON object with a token field. The value of this field will be a JSON Web Token, which you can use in an Authorization header to make further requests to the Screwdriver API. This JWT will be valid for 2 hours, after which you must re-authenticate. Example: Starting a Specific Pipeline You can use a pipeline token similar to how you would a user token. Here’s a short example written in Python showing how you can use a Pipeline API token to start a pipeline. This script will directly call the Screwdriver API. # Authenticate with token auth_request = get('https://api.screwdriver.cd/v4/auth/token?api_token=%s' % environ['SD_KEY']) jwt = auth_request.json()['token'] # Set headers headers = { 'Authorization': 'Bearer %s' % jwt } # Get the jobs in the pipeline jobs_request = get('https://api.screwdriver.cd/v4/pipelines/%s/jobs' % pipeline_id, headers=headers) jobId = jobs_request.json()[0]['id'] # Start the first job start_request = post('https://api.screwdriver.cd/v4/builds', headers=headers, data=dict(jobId=jobId)) Compatibility List For pipeline tokens to work, you will need these minimum versions: - screwdrivercd/screwdriver: v0.5.389 - screwdrivercd/ui: v1.0.290Contributors Thanks to the following people for making this feature possible: - kumada626 (from Yahoo! JAPAN) - petey - s-yoshika (from Yahoo! JAPAN) Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Pipeline API Tokens in Screwdriver

July 9, 2018
Multibyte Artifact Name Support July 6, 2018
July 6, 2018
Share

Multibyte Artifact Name Support

A multibyte character is a character composed of sequences of one or more bytes. It’s often used in Asia (e.g. Japanese, Chinese, Thai). Screwdriver now supports reading artifacts that contain multibyte characters.Example screwdriver.yaml jobs: main: image: node:8 requires: [ ~pr, ~commit ] steps: - touch_multibyte_artifact: echo 'foo' > $SD_ARTIFACTS_DIR/日本語ファイル名さんぷる.txt In this example, we are writing an artifact, 日本語ファイル名さんぷる, which means Japanese file name sample. The artifact name includes Kanji, Katakana, and Hiragana, which are multibyte characters. The artifacts of this example pipeline: The result from clicking the artifact link:Compatibility List Multibyte artifact name support requires the following minimum versions of Screwdriver: - screwdrivercd/screwdriver: v0.5.309Contributors Thanks to the following people for making this feature possible: - minz1027 - sakka2 (from Yahoo! JAPAN) - Zhongtang Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Multibyte Artifact Name Support

July 6, 2018
Introducing Template Namespaces June 29, 2018
June 29, 2018
Share