developer

Latest Blogposts

Stories and updates you can see

Reset

Filter Events

Image Date Details*
Deep Dive into Yahoo's Semantic Search Suggestions: From Challenges to Effective Implementation October 26, 2023
October 26, 2023
Share

Deep Dive into Yahoo's Semantic Search Suggestions: From Challenges to Effective Implementation

The Pervasive Problem of Semantic Search In the expansive digital age where information is not only vast but grows at an exponential rate, the quest for accurate and relevant search results has never been more critical. Within this context, Yahoo Mail, serving millions of users, understood the transformative potential of semantic search. By leveraging the prowess of OpenAI embeddings, we embarked on a journey to provide search results that would understand and match user intent, going beyond the conventional keyword-based approach. And while the results were commendable, they weren't devoid of hurdles: 1. Performance Bottlenecks: The integration of OpenAI embeddings, though powerful, significantly slowed down our search process. 2. User Experience: The new system demanded users to type extensively, often more than they were used to, leading to potential user dissatisfaction. 3. Habit Change: Introducing a paradigm shift in search behaviors meant we were not just altering algorithms but challenging years of user habits. Our objective was crystal clear yet daunting: We wanted to augment the semantic search with suggestions that were rapid, economically viable, and seamlessly integrated into the user's natural search behavior. Approach: Exploration Phase Enticed by the idea of real-time suggestions via large language models (LLMs), we soon realized the impracticality of such an approach, primarily due to the speed constraints. The challenge demanded a solution that operated offline but mirrored the capabilities of real-time systems. Our exploration led us to task the LLM to frame and answer all conceivable questions for every email a user received. While theoretically sound, the financial implications were prohibitive. Moreover, the risk of the LLM generating "hallucinations" or inaccurate results couldn't be ignored. It was amidst this exploration that a revelatory idea emerged. We were already equipped with a sophisticated extraction pipeline capable of gleaning crucial information from emails. This was achieved using a blend of human curated regex parsing and meticulously fine-tuned AI models. This became the key to powering our search suggestions. Implementation Challenges: Transitioning from Conceptualization to Real-World Application 1. The Intricacies of Indexing: One of the more pronounced challenges we encountered revolved around the intricacies of over-indexing. Let's delve into a hypothetical yet common scenario to elucidate this. Imagine a user intending to search for the term "staples." As they begin their search with the initial letters "sta", an all-encompassing approach to indexing, which takes into account every conceivable keyword, might mistakenly steer the user towards unrelated terms like "statement." Such deviations, although seemingly minor, can significantly hamper the user experience. Recognizing the paramount importance of ensuring that our search suggestions remained razor-sharp in their precision and highly relevant, we embarked on a methodical approach. Our resolution was to meticulously handpick and index only a curated set of keywords, ensuring that every suggestion offered was in perfect alignment with the user's intent. 2. The Quest for Relevance in Suggestions: Another challenge that frequently emerged was ensuring the highest degree of relevance in our search suggestions. This challenge becomes particularly pronounced when one considers a situation where a user's inbox is populated with multiple items that bear a resemblance to each other, say multiple flight confirmations. The conundrum we faced was discerning which of these similar items was of immediate interest to the user. Our breakthrough came in the form of an innovative approach centered on the extraction card date. Rather than basing our suggestions on the date the email was received, we shifted our focus to the date of the event described within the email, like a flight's departure date. This nuanced change enabled us to consistently zero in on and prioritize the most timely and pertinent result for the user. 3. Embracing Dynamism and Adaptability: When we first conceptualized our approach, our methodology was anchored in generating questions and answers during the email delivery phase, which were then indexed. However, as we delved deeper, it became evident that this approach, while robust, was somewhat inflexible and lacked the dynamism that modern search paradigms demand. Determined to infuse our system with greater adaptability, we pioneered the Just-in-Time question generation mechanism. With this refined approach, while the foundational search indexes are crafted at the point of delivery, the actual questions are dynamically constructed in real-time, tailored to the user's specific queries and the prevailing temporal context. This rejuvenated approach not only elevated the flexibility quotient of our system but also enhanced operational efficiency, ensuring that users always received the most pertinent suggestions.ImplementationAt delivery time - Here we extract important information, create cards from the emails and save it in our BE store. Semantic Search Indexing - Fetch/Update the extracted cards from BE DB. Index by extracting the keywords and storing in Semantic Search Index - DB Retrieval - When the user makes the search, we make a server call which inturn will find the best matching extraction card for the query. - This will then be used for generating the suggestions for the semantic search. Conclusion Our innovative foray into enhancing search suggestions bore fruit in a remarkably short span of 30 days, even as we navigated the intricacies of a completely new tech stack. The benefits were manifold, an enriched user experience and 10% of semantic search traffic handled by search suggestions. In the rapidly evolving realm of AI, challenges are omnipresent. However, our journey at Yahoo underscores the potential of lateral thinking and a commitment to User Experience. Through our experiences, we hope to galvanize the broader tech community, encouraging them to ideate and implement solutions that are not just effective, but also economically prudent. Contributors Kevin Patel(patelkev@yahooinc.com) + Renganathan Dhanogopal(renga@yahooinc.com) - Architecture + Tech Implementation Josh Jacobson + Sam Bouguerra(sbouguerra@yahooinc.com) - Product Author Kevin Patel(patelkev@yahooinc.com) - Director of Engineering Yahoo

Deep Dive into Yahoo's Semantic Search Suggestions: From Challenges to Effective Implementation

October 26, 2023
Latest updates - March 2023 March 28, 2023
March 28, 2023
Share

Latest updates - March 2023

Happy Spring! The Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.New Features UI - UI codebase has been upgraded to use Ember.js 4.4 - Build detail page to display the Template in use - Links in the event label are now clickable - PR title shows on PR build page - Job list to display a build’s start & end times on hover Bug Fixes UI - Job list view to handle job display name as expected - Artifacts with & in name are now loaded properly API - Fixed data loss when adding Templates from multiple browser tabs - Add API endpoints to add or remove one or more pipelines in a collectionInternals - Fix for Launcher putting invalid characters on log linesCompatibility List In order to have these improvements, you will need these minimum versions: - API - v6.0.9 - UI - v1.0.790 - Store - v5.0.2 - Queue-Service - v3.0.2 - Launcher - v6.0.180 - Build Cluster Worker - v3.0.3Contributors Thanks to the following contributors for making these features possible: - Alan - Anusha - Haruka - Ibuki - Keisuke - Pritam - Sagar - Yuki - YutaQuestions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack. Author Jithin Emmanuel, Director Of Engineering, Yahoo

Latest updates - March 2023

March 28, 2023
Latest updates - December 2022 December 30, 2022
December 30, 2022
Share

Latest updates - December 2022

Happy Holidays! Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.New Features UI - Enable deleting disconnected Child Pipelines from UI. This will give users more awareness and control over SCM URLs that are removed from child pipelines list. API - Cluster admins can configure different bookends for individual build clusters. - Add more audit logs for Cluster admins to track API usage.Bug Fixes UI - Collections sorting enhancements.  - Create Pipeline flow now displays all Templates properly. API - Pipeline badges have been refactored to reduce resource usage.. - Prevent artifact upload errors due to incorrect retry logic. Queue Service - Prevent archived jobs from running periodic jobs if cleanup fails at any point.Internals - Update golang version to 1.19 across all golang projects. - Node.js has been upgraded to v18 for Store, Queue Service & Build Cluster Worker. - Feature flag added to Queue Service to control Redis Table usage to track periodic builds.Compatibility List In order to have these improvements, you will need these minimum versions: - API - v5.0.12 - UI - v1.0.759 - Store - v5.0.2 - Queue-Service - v3.0.0 - Launcher - v6.0.178 - Build Cluster Worker - v3.0.2Contributors Thanks to the following contributors for making this feature possible: - Alan - Anusha - Kevin - Haruka - Ibuki - Masataka - Pritam - Sagar - Tiffany - Yoshiyuki - Yuki - YutaQuestions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack. Author Jithin Emmanuel, Director Of Engineering, Yahoo

Latest updates - December 2022

December 30, 2022
New bug fixes and features - October 2022 October 31, 2022
October 31, 2022
Share

New bug fixes and features - October 2022

Latest Updates - October 2022 Happy Halloween! Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - Add sorting on branch and status for Collections - Able to select timestamp format in user preferences - Click on User profile in upper right corner, select User Settings - Select dropdown for Timestamp Format, pick preferred format - Click Save - Soft delete for child pipelines - still need to ask a Screwdriver admin to remove completely - Notify Screwdriver pipeline developers if pipeline is missing admin - Add audit log of operations performed on the Pipeline Options page - Screwdriver admins should see more information in API logs - API to reset user settings - Support Redis cluster connection - Add default event meta in launcher - set event.creator properly - New gitversion binary with multiple branch support - added homebrew formula and added parameter –merged (to consider only versions on the current branch) Bug Fixes - UI - Show error message when unauthorized users change job state - Job state should be updated properly  for delayed API response - Gray out the Restart button for jobs that are disabled - Modify toggle text to work in both directions - Display full pipeline name in Collections - Allow reset of Pipeline alias - Remove default pipeline alias name - Add tooltip for build history in Collections - API - Admins can sync on any pipeline - Refactor unzipArtifactsEnabled configuration - Check permissions before running startAll on child pipelines - ID schema for pipeline get latestBuild Internals - Models - Refactor syncStages to fail early - Pull Request sync only returns PRs relevant to the pipeline - Add more logs to stage creation - Data-schema - Display JobNameLength in user settings - Remove old unique constraint for stages table - SCM GitHub - Get open pull requests - override the default limit (30) to return up to 100) - Change wget to curl for downloading sd-repo - Builds cannot be started if a pipeline has more than 5 invalid admins - Coverage-sonar - Use correct job name for PR with job scope - Queue-Service - Remove laabr - Launcher - Update Github link for grep - Update build status if SIGTERM is received - build status will be updated to Failure when soft evict. Then buildCluster-queue-worker can send a delete request to clean up the build pod Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.297 - UI - v1.0.732 - Store - v4.2.5 - Queue-Service - v2.0.42 - Launcher - v6.0.171 - Build Cluster Worker - v2.24.3 Contributors Thanks to the following contributors for making this feature possible: - Alan - Anusha - Kevin - Haruka - Ibuki - Masataka - Pritam - Sagar - Sheridan - Shota - Tiffany - Yoshiyuki - Yuki - Yuta Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack. Author Tiffany Kyi, Sr Software Dev Engineer, Yahoo

New bug fixes and features - October 2022

October 31, 2022
Open Sourcing Subdomain Sleuth October 21, 2022
October 21, 2022
Share

Open Sourcing Subdomain Sleuth

Subdomain Sleuth is a new open source project built by the Yahoo DNS team, designed to help you defend your infrastructure against subdomain takeover attacks. This type of attack is especially dangerous for phishing attacks and cookie theft. It reads your zone files, identifies multiple types of possible takeovers, and generates a report of the dangerous records. If you work with DNS or security, I encourage you to keep reading. A subdomain takeover is when an attacker is able to take control of the target of an existing DNS record. This is normally the result of what is called a “dangling record”, which is a record that points to something that doesn’t exist. That could be a broken CNAME or a bad NS record. It could also be a reference to a service that resolves but that you don’t manage. In either case, a successful takeover can allow the attacker to serve any content they want under that name. The surface area for these attacks grows proportionally to the adoption of cloud and other managed services. Let’s consider an example. One of your teams creates an exciting new app called groundhog, with the web site at groundhog.example.com. The content for the site is hosted in a public AWS S3 bucket, and groundhog.example.com is a CNAME to the bucket name. Now the product gets rebranded, and the team creates all new web site content. The old S3 bucket gets deleted, but nobody remembers to remove the CNAME. If an attacker finds it, they can register the old bucket name in their account and host their own content under groundhog.example.com. They could then launch a phishing campaign against the users, using the original product name. We’ve always had some subdomain takeover reports come through our Bug Bounty program. We couldn’t find many tools intended for defenders - most were built for either security researchers or attackers, focused on crawling web sites or other data sources for hostnames to check, or focused on specific cloud providers. We asked ourselves “how hard could it be to automatically detect these?”. That question ultimately led to Subdomain Sleuth. Subdomain Sleuth reads your zone files and performs a series of checks against each individual record. It can handle large zone files with hundreds of thousands of records, as well as tens of thousands of individual zones. We regularly scan several million records at a time. The scan produces a JSON report, which includes the name of each failed record, the target resource, which check it failed, and a description of the failure. We currently support three different check types. The CNAME check looks for broken CNAMEs. CNAMEs can be chained together, so the check will identify a break at any CNAME in the chain. The NS check looks for bad delegations where the server doesn’t exist, isn’t reachable, or doesn’t answer for the particular zone that was delegated. The HTTP check looks for references to known external resources that could be claimed by an attacker. It does this by sending an HTTP request and looking for known signatures of unclaimed resources. For example, if it sees a CNAME that points to an AWS S3 bucket, it will send an HTTP request to the name. If the response contains “no such bucket”, it is a target for an attacker. Subdomain Sleuth is easy to use. All you need is a recent Go compiler and a copy of your zone files. The extra utilities require a Python 3 interpreter. The README contains details about how to build the tools and examples of how to use them. If you’re interested in contributing to the project, we’d love to hear from you. We’re always open to detecting new variations of subdomain takeovers, whether by new checks or new HTTP fingerprints. If you participate in a bug bounty program, we’d especially love to have you feeding your findings back to the project. We’re also open to improvements in the core code, whether it’s bug fixes, unit tests, or efficiency improvements. We would also welcome improvements to the supporting tools. We hope that you take a few minutes to give the tools a try. The increase in cloud-based services calls for more vigilance than ever. Together we can put an end to subdomain takeovers. https://github.com/yahoo/SubdomainSleuth

Open Sourcing Subdomain Sleuth

October 21, 2022
Moving from Mantle to Swift for JSON Parsing October 10, 2022
October 10, 2022
Share

Moving from Mantle to Swift for JSON Parsing

We recently converted one of our internal libraries from all Objective-C to all Swift. Along the way, we refactored how we parse JSON, moving from using the third-party Mantle library to the native JSON decoding built into the Swift language and standard library. In this post, I'll talk about the motivation for converting, the similarities and differences between the two tools, and challenges we faced, including: - Handling nested JSON objects - Dealing with JSON objects of unknown types - Performing an incremental conversion - Continuing to support Objective-C users - Dealing with failures Introduction Swift is Apple's modern programming language for building applications on all of their platforms. Introduced in June 2014, it succeeds Objective-C, an object-oriented superset of the C language from the early 80's. The design goals for Swift were similar to a new crop of modern languages, such as Rust and Go, that provide a safer way to build applications, where the compiler plays a larger role in enforcing correct usage of types, memory access, collections, nil pointers, and more. At Yahoo, adoption of Swift started slow, judiciously waiting for the language to mature. But in the last few years, Swift has become the primary language for new code across the company. This is important not only for the safety reasons mentioned, but also for a better developer experience. Many that started developing for iOS after 2014 have been using primarily Swift, and it's important to offer employees modern languages and codebases to work in. In addition to new code, the mobile org has been converting existing code when possible, both in apps and SDK's. One recent migration was the MultiplexStream SDK. MultiplexStream is an internal library that fetches, caches, and merges streams of content. There is a subspec of the library specialized to fetch streams of Yahoo news articles and convert the returned JSON to data models. During a Swift conversion, we try to avoid any refactoring or re-architecting, and instead aim for a line-for-line port. Even a one-to-one translation can introduce new bugs, and adding a refactor at the same time is risky. But sometimes rewriting can be unavoidable. JSON Encoding and Decoding The Swift language and its standard library have evolved to add features that are practical for application developers. One addition is native JSON encoding and decoding support. Creating types that can be automatically encoded and decoded from JSON is a huge productivity boost. Previously, developers would either manually parse JSON or use a third-party library to help reduce the tedious work of unpacking values, checking types, and setting the values on native object properties.Mantle MultiplexStream relied on the third-party Mantle SDK to help with parsing JSON to native data model objects. And Mantle is great -- it has worked well in a number of Yahoo apps for a long time. However, Mantle relies heavily on the dynamic features of the Objective-C language and runtime, which are not always available in Swift, and can run counter to the static, safe, and strongly-typed philosophy of Swift. In Objective-C, objects can be dynamically cast and coerced from one type to another. In Swift, the compiler enforces strict type checking and type inference, making such casts impossible. In Objective-C, methods can be called on objects at runtime whether they actually respond to them or not. In Swift, the compiler ensures that types will implement methods being called. In Objective-C, collections, such as Arrays and Dictionaries, can hold any type of object. In Swift, collections are homogeneous and the compiler guarantees they will only hold values of a pre-declared type. For example, in Objective-C, every object has a -(id)getValueForKey:(NSString*)key method that, given a string matching a property name of the object, returns the value for the property from the instance. But two things can go wrong here: 1. The string may not reference an actual property of the object. This crashes at runtime. 2. Notice the id return type. This is the generic "could be anything" placeholder. The caller must cast the id to what they expect it to be. But if you expect it to be a string, yet somehow it is a number instead, calling string methods on the number will crash at runtime. Similarly, every Objective-C object has a -(void)setValue:(id)value, forKey:(NSString*)key method that, again, takes a string property name and an object of any type. But use the wrong string or wrong value type and, again, boom. Mantle uses these dynamic Objective-C features to support decoding from JSON payloads, essentially saying, "provide me with the string keys you expect to see in your JSON, and I'll call setValueForKey on your objects for each value in the JSON." Whether it is the type you are expecting is another story. Back-end systems work hard to fulfill their API contracts, but it isn't unheard of in a JSON object to receive a string instead of a float. Or to omit keys you expected to be present. Swift wanted to avoid these sorts of problems. Instead, the deserialization code is synthesized by the compiler at compile time, using language features to ensure safety. Nested JSON Types Our primary data model object, Article, represents a news article. Its API includes all the things you might expect, such as: Public interface: class Article { var id: String var headline: String var author: String var imageURL: String } The reality is that these values come from various objects deeply nested in the JSON object structure. JSON: { "id": "1234", "content": { "headline":"Apple Introduces Swift Language", "author": { "name":"John Appleseed", "imageURL":"..." }, "image": { "url":"www..." } } } In Mantle, you would supply a dictionary of keypaths that map JSON names to property names: { "id":"id", "headline":"content.headline", "author":"content.author.name", "imageURL":"content.image.url" } In Swift, you have multiple objects that match 1:1 the JSON payload: class Article: Codable { var id: String var content: Content } class Content: Codable { var headline: String var author: Author var image: Image } class Author: Codable { var name: String var imageURL: String } class Image: Codable { var url: String } We wanted to keep the Article interface the same, so we provide computed properties to surface the same API and handle the traversal of the object graph: class Article { var id: String private var content: Content var headline: String { content.headline } var author: String { content.author.name } var imageURL: String { content.image.url } } This approach increases the number of types you create, but gives a clearer view of what the entities look like on the server. But for the client, the end result is the same: Values are easy to access on the object, abstracting away the underlying data structure. JSON Objects of Unknown Type In a perfect world, we know up front the keys and corresponding types of every value we might receive from the server. However, this is not always the case. In Mantle, we can specify a property to be of type NSDictionary and call it a day. We could receive a dictionary of [String:String], [String:NSNumber], or even [String: NSDictionary]. Using Swift’s JSON decoding, the types need to be specified up front. If we say we expect a Dictionary, we need to specify "a dictionary of what types?" Others have faced this problem, and one of the solutions that has emerged in the Swift community is to create a type that can represent any type of JSON value. Your first thought might be to write a Dictionary of [String:Any]. But for a Dictionary to be Codable, its keys and values must also be Codable. Any is not Codable: it could be a UIView, which clearly can't be decoded from JSON. So instead we want to say, “we expect any type that is itself Codable.” Unfortunately there is no AnyCodable type in Swift. But we can write our own! There are a finite number of types the server can send as JSON values. What is good for representing finite choices in Swift? Enums. Let’s model those cases first: enum AnyDecodable { case int case float case bool case string case array case dictionary case none } So we can say we expect a Dictionary of String: AnyDecodable. The enum case will describe the type that was in the field. But what is the actual value? Enums in Swift can have associated values! So now our enum becomes: enum AnyDecodable { case int(Int) case float(Float) case bool(Bool) case string(String) case array([AnyDecodable]) case dictionary([String:AnyDecodable]) case none } We're almost done. Just because we have described what we would like to see, doesn't mean the system can just make it happen. We're outside the realm of automatic synthesis here. We need to implement the manual encode/decode functions so that when the JSONDecoder encounters a type we've said to be AnyDecodable, it can call the encode or decode method on the type, passing in what is essentially the untyped raw data: extension AnyDecodable: Codable { init(from decoder: Decoder) throws { let container = try decoder.singleValueContainer() if let int = try? container.decode(Int.self) { self = .int(int) } else if let string = try? container.decode(String.self) { self = .string(string) } else if let bool = try? container.decode(Bool.self) { self = .bool(bool) } else if let float = try? container.decode(Float.self) { self = .float(float) } else if let array = try? container.decode([AnyDecodable].self) { self = .array(array) } else if let dict = try? container.decode([String:AnyDecodable].self) { self = .dictionary(dict) } else { self = .none } } func encode(to encoder: Encoder) throws { var container = encoder.singleValueContainer() switch self { case .int(let int): try container.encode(int) case .float(let float): try container.encode(float) case .bool(let bool): try container.encode(bool) case .string(let string): try container.encode(string) case .array(let array): try container.encode(array) case .dictionary(let dictionary): try container.encode(dictionary) case .none: try container.encodeNil() } } } We've implemented functions that, at runtime, can deal with a value of unknown type, test to find out what type it actually is, and then associate it into an instance of our AnyDecodable type, including the actual value. We can now create a Codable type such as: struct Article: Codable { var headline: String var sportsMetadata: AnyDecodable } In our use case, as a general purpose SDK, we don't know much about sportsMetadata. It is a part of the payload defined between the Sports app and their editorial staff. When the Sports app wants to use the sportsMetadata property, they must switch over it and unwrap the associated value. So if they expect it to be a String: switch article.metadata { case .string(let str): label.text = str default: break } Or using "if case let" syntax: if case let AnyDecodable.string(str) = article.metadata { label.text = str } Incremental Conversion During conversion it was important to migrate incrementally. Pull requests should be fairly small, tests should continue to run and pass, build systems should continue to verify building on all supported platforms in various configurations. We identified the tree structure of the SDK and began converting the leaf nodes first, usually converting a class or two at a time. But for the data models, converting the leaf nodes from using Mantle to Codable was not possible. You cannot easily mix the two worlds: specifying a root object as Mantle means all of the leaves need to use Mantle also. Likewise for Codable objects. Instead, we created a parallel set of Codable models with an _Swift suffix, and as we added them, we also added unit tests to verify our work in progress. Once we finished creating a parallel set of objects, we deleted the old objects and removed the Swift suffix from the new. Because the public API remained the same, the old tests didn’t need to change. Bridging Some Swift types cannot be represented in Objective-C: @objcMembers class Article: NSObject { ... var readTime: Int? } Bridging the Int to Obj-C results in a value type of NSInteger. But optionality is expressed in Objective-C with nil pointers, and only NSObjects, as reference types, have pointers. So the existing Objective-C API might look like this: @property (nonatomic, nullable, strong) NSNumber *readTime; Since we can't write var readTime: Int?, and NSNumber isn't Codable, we can instead write a computed property to keep the same API: @objcMembers class Article: NSObject { private var _readTime: Int? public var readTime: NSNumber? { if let time = _readTime { return NSNumber(integerLiteral: time) } else { return nil } } } Lastly, we need to let the compiler know to map our private _readTime variable to the readTime key in the JSON dictionary. We achieve this using CodingKeys: @objcMembers class Article: NSObject { private var _readTime: Int? public var readTime: NSNumber? { if let time = _readTime { return NSNumber(integerLiteral: time) } else { return nil } } enum CodingKeys: String, CodingKey { case _readTime = "readTime" ... } } Failures Swift's relentless focus on safety means there is no room for error. An article struct defined as having a non-optional headline must have one. And if one out of 100 articles in a JSON response is missing a headline, the entire parsing operation will fail. People may think (myself included), "just omit the one article that failed." But there are cases where the integrity of the data falls apart if it is incomplete. A bank account payload that states a balance of $100, yet the list of transactions sums to $99 because we skipped one that didn't have a location field, would be a bad experience. The solution here is to mark fields that may or may not be present as optional. It can lead to messier code, with users constantly unwrapping values, but it better reflects the reality that fields can be missing. If a type declares an article identifier to be an integer, and the server sends a String instead, the parsing operation will throw an error. Swift will not do implicit type conversion. The good news is that these failures do not crash, but instead throw (and provide excellent error diagnostics about what went wrong).Conclusion A conversion like this really illustrates some of the fundamental differences between Objective-C and Swift. While some things may appear to be easier in Objective-C, such as dealing with unknown JSON types, the cost is in sharp edges that can cut in production. I do not mind paying a bit more at development time to save in the long run. The unit tests around our model objects were a tremendous help. Because we kept the same API, once the conversion was complete, they verified everything worked as before. These tests used static JSON files of server responses and validated our objects contained correct values. The Swift version of MultiplexStream shipped in the Yahoo News app in April 2022. So far, no one has noticed (which was the goal). But hopefully the next developer that goes in to work on MultiplexStream will.Resources Apple Article on Encoding and Decoding Custom Types Apple Migration Doc Obj-C to Swift Interop Swift to Obj-C InteropAuthorJason Howlin Senior Software Mobile Apps Engineer

Moving from Mantle to Swift for JSON Parsing

October 10, 2022
New bug fixes and features - August 2022 August 30, 2022
August 30, 2022
Share

New bug fixes and features - August 2022

Latest Updates - August 2022 Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - Collections supports sorting by: last time a job was run in a pipeline or build history based on number of failed events/jobs. To sort by one of these fields, click the up/down caret to the right of the field names. - Collections supports displaying a human-readable alias for a Pipeline (in List view). To set the alias for a pipeline, go to your pipeline Options tab. Under Pipeline Preferences, type the alias in the Rename pipeline field. Hit enter. Go to your Collections dashboard to see the new alias. - Screwdriver Admins can perform Sync on any pipeline from the pipeline options UI - If there is no pipeline admin, periodic build jobs will not run and Screwdriver will notify(if Slack or email notifications are configured) - Pull Request Comments are now supported from individual PR jobs - Support for self-hosted SonarQube for individual Pipelines - Meta CLI - Meta CLI can now be installed as homebrew formula - Allow shebang lua commands to have parameters with dashes in them Updates - User preference to display job name length has now been moved under User Settings. Now you can configure your preference globally for all pipelines. Click on your username in the top right corner to show the dropdown, select User Settings. (Alternatively, navigate directly to https://YOUR_URL/user-settings/preferences). Under the User Preferences tab, click the arrows or type to adjust preferred Display Name Length.  Before: After: Bug Fixes - API - Pull Requests jobs added via a pull request should work - Prevent disabled Pull Request jobs from executing - Prevent API crash for Pipelines with large number of Pull Requests - queue-service - Prevent periodic jobs getting dropped due to API connection instabilities and improve error handling - UI - Even in PR workflow-graph job states show up - Build not found redirects to intended pipeline page - Improve the description of the parameter - More consistent restart method when using listViewDisplay message when manually executing jobs for non-latest events - Emphasize non-latest sha warning when manually executing jobs - Use openresty as base image for M1 use - Show error message when unauthorized users change job state - Gray out restart button for jobs that are disabled - Modify toggle text to work in both directions - Collections and pipeline options improvements - Launcher: - Add SD_STEP_NAME env variable Internals - sd-cmd: - Create command binary atomically - Add configuration to README.md, local configuration improvements - Fix sd-cmd not to slurp all input - buildcluster-queue-worker: - Upgrade amqplib from 0.8.0 to 0.10.0 Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.282 - UI - v1.0.718 - Store - v4.2.5 - Queue-Service - v2.0.40 - Launcher - v6.0.165 - Build Cluster Worker - v2.24.3 Contributors Thanks to the following contributors for making this feature possible: - Alan - Anusha - Haruka - Ibuki - Jacob - Jithin - Kazuyuki - Keisuke - Kenta - Kevin - Naoaki - Pritam - Sagar - Sheridan - Tatsuya - Tiffany - Yoshiyuki - Yuichi - Yuki - Yuta Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack. Author Alan Dong, Sr Software Dev Engineer, Yahoo

New bug fixes and features - August 2022

August 30, 2022
Writing Lua scripts with meta August 24, 2022
August 24, 2022
Share

Writing Lua scripts with meta

Sheridan Rawlins, Architect, Yahoo Summary In any file ending in .lua with the executable bit set (chmod a+x), putting a “shebang” line like the following lets you run it, and even pass arguments to the script that won’t be swallowed by meta hello-world.lua #!/usr/bin/env meta print("hello world") Screwdriver’s meta tool is provided to every job, regardless of which image you choose. This means that you can write Screwdriver commands or helper scripts as Lua programs. It was inspired by (but unrelated to) etcd’s bolt, as meta is a key-value store of sorts, and its boltcli which also provides a lua runner that interfaces with bolt. Example script or sd-cmd run.lua #!/usr/bin/env meta meta.set("a-plain-string-key", "somevalue") meta.set("a-key-for-json-value", { name = "thename", num = 123, array = { "foo", "bar", "baz" } }) What is included? 1. A Lua 5.1 interpreter written in go (gopher-lua) 2. meta CLI commands are exposed as methods on the meta object meta get local foo_value = meta.get('foo') meta set -- plain string meta.set('key', 'value')` -- json number meta.set('key', 123)` -- json array meta.set('key', { 'foo', 'bar', 'baz' })` -- json map meta.set('key', { foo = 'bar', bar = 'baz' })` meta dump local entire_meta_tree = meta.dump() 3. meta get local foo_value = meta.get('foo') 4. meta set -- plain string meta.set('key', 'value')` -- json number meta.set('key', 123)` -- json array meta.set('key', { 'foo', 'bar', 'baz' })` -- json map meta.set('key', { foo = 'bar', bar = 'baz' })` 5. meta dump local entire_meta_tree = meta.dump() 6. Libraries (aka “modules”) included in gopher-lua-libs - while there are many to choose from here, some highlights include: argparse - when writing scripts, this is a nice CLI parser inspired from the python one. Encoding modules: json, yaml, and base64 allow you to decode or encode values as needed. String helper modules: strings, and shellescape http client - helpful if you want to use the Screwdriver REST API possibly using os.getenv with the environment vars provided by screwdriver - SD_API_URL, SD_TOKEN, SD_BUILD_ID can be very useful. plugin - is an advanced technique for parallelism by firing up several “workers” or “threads” as “goroutines” under the hood and communicating via go channels. More than likely overkill for normal use-cases, but it may come in handy, such as fetching all artifacts from another job by reading its manifest.txt and fetching in parallel. 7. argparse - when writing scripts, this is a nice CLI parser inspired from the python one. 8. Encoding modules: json, yaml, and base64 allow you to decode or encode values as needed. 9. String helper modules: strings, and shellescape 10. http client - helpful if you want to use the Screwdriver REST API possibly using os.getenv with the environment vars provided by screwdriver - SD_API_URL, SD_TOKEN, SD_BUILD_ID can be very useful. 11. plugin - is an advanced technique for parallelism by firing up several “workers” or “threads” as “goroutines” under the hood and communicating via go channels. More than likely overkill for normal use-cases, but it may come in handy, such as fetching all artifacts from another job by reading its manifest.txt and fetching in parallel.Why is this interesting/useful? meta is atomic When invoked, meta obtains an advisory lock via flock. However, if you wanted to update a value from the shell, you might perform two commands and lose the atomicity: # Note, to treat the value as an integer rather than string, use -j to indicate json declare -i foo_count="$(meta get -j foo_count)" meta set -j foo_count "$((++foo_count))" While uncommon, if you write builds that do several things in parallel (perhaps a Makefile run with make -j $(nproc)), making such an update in parallel could hit race conditions between the get and set. Instead, consider this script (or sd-cmd) increment-key.lua #!/usr/bin/env meta local argparse = require 'argparse' local parser = argparse(arg[0], 'increment the value of a key') parser:argument('key', 'The key to increment') local args = parser:parse() local value = tonumber(meta.get(args.key)) or 0 value = value + 1 meta.set(args.key, value) print(value) Which can be run like so, and will be atomic ./increment-key.lua foo 1 ./increment-key.lua foo 2 ./increment-key.lua foo 3 meta is provided to every job The meta tool is made available to all builds, regardless of the image your build chooses - including minimal jobs intended for fanning in several jobs to a single one for further pipeline job-dependency graphs (i.e. screwdrivercd/noop-container) Screwdrivers commands can help share common tasks between jobs within an organization. When commands are written in bash, then any callouts it makes such as jq must either exist on the images or be installed by the sd-cmd. While writing in meta’s lua is not completely immune to needing “other things”, at least it has proper http and json support for making and interpreting REST calls. running “inside” meta can workaround system limits Occasionally, if the data you put into meta gets very large, you may encounter Limits on size of arguments and environment, which comes from UNIX systems when invoking executables. Imagine, for instance, wanting to put a file value into meta (NOTE: this is not a recommendation to put large things in meta, but, on the occasions where you need to, it can be supported). Say I have a file foobar.txt and want to put it into some-key. This code: foobar="$(< foobar.txt)" meta set some-key "$foobar" May fail to invoke meta at all if the args get too big. If, instead, the contents are passed over redirection rather than an argument, this limit can be avoided: load-file.lua #!/usr/bin/env meta local argparse = require 'argparse' local parser = argparse(arg[0], 'load json from a file') parser:argument('key', 'The key to put the json in') parser:argument('filename', 'The filename') local args = parser:parse() local f, err = io.open(args.filename, 'r') assert(not err, err) local value = f:read("*a") -- Meta set the key to the contents of the file meta.set(args.key, value) May be invoked with either the filename or, if the data is in memory with the named stdin device # Direct from the file ./load-file.lua some-key foobar.txt # If in memory using "Here String" (https://www.gnu.org/software/bash/manual/bash.html#Here-Strings) foobar="$(< foobar.txt)" ./load-file.lua some-key /dev/stdin <<<"$foobar" Additional examples Using http module to obtain the parent id get-parent-build-id.lua #!/usr/bin/env meta local http = require 'http' local json = require 'json' SD_BUILD_ID = os.getenv('SD_BUILD_ID') or error('SD_BUILD_ID environment variable is required') SD_TOKEN = os.getenv('SD_TOKEN') or error('SD_TOKEN environment variable is required') SD_API_URL = os.getenv('SD_API_URL') or error('SD_API_URL environment variable is required') local client = http.client({ headers = { Authorization = "Bearer " .. SD_TOKEN } }) local url = string.format("%sbuilds/%d", SD_API_URL, SD_BUILD_ID) print(string.format("fetching buildInfo from %s", url)) local response, err = client:do_request(http.request("GET", url)) assert(not err, err) assert(response.code == 200, "error code not ok " .. response.code) local buildInfo = json.decode(response.body) print(tonumber(buildInfo.parentBuildId) or 0) Invocation examples: # From a job that is triggered from another job declare -i parent_build_id="$(./get-parent-build-id.lua)" echo "$parent_build_id" 48242862 # From a job that is not triggered by another job declare -i parent_build_id="$(./get-parent-build-id.lua)" echo "$parent_build_id" 0 Larger example to pull down manifests from triggering job in parallel This advanced script creates 3 argparse “commands” (manifest, copy, and parent-id) to help copying manifest files from parent job (the job that triggers this one). it demonstrates advanced argparse features, http client, and the plugin module to create a “boss + workers” pattern for parallel fetches: - Multiple workers fetch individual files requested by a work channel - The “boss” (main thread) filters relevent files from the manifest which it sends down the work channel - The “boss” closes the work channel, then waits for all workers to complete tasks (note that a channel will still deliver any elements before a receive() call reports not ok This improves throughput considerably when fetching many files - from a worst case of the sum of all download times with one at a time, to a best case of just the maximum download time when all are done in parallel and network bandwidth is sufficient. manifest.lua #!/usr/bin/env meta -- Imports argparse = require 'argparse' plugin = require 'plugin' http = require 'http' json = require 'json' log = require 'log' strings = require 'strings' filepath = require 'filepath' goos = require 'goos' -- Parse the request parser = argparse(arg[0], 'Artifact operations such as fetching manifest or artifacts from another build') parser:option('-l --loglevel', 'Set the loglevel', 'info') parser:option('-b --build-id', 'Build ID') manifestCommand = parser:command('manifest', 'fetch the manifest') manifestCommand:option('-k --key', 'The key to set information in') copyCommand = parser:command('copy', 'Copy from and to') copyCommand:option('-p --parallelism', 'Parallelism when copying multiple artifacts', 4) copyCommand:flag('-d --dir') copyCommand:argument('source', 'Source file') copyCommand:argument('dest', 'Destination file') parentIdCommand = parser:command("parent-id", "Print the parent-id of this build") args = parser:parse() -- Setup logs is shared with workers when parallelizing fetches function setupLogs(args) -- Setup logs log.debug = log.new('STDERR') log.debug:set_prefix("[DEBUG] ") log.debug:set_flags { date = true } log.info = log.new('STDERR') log.info:set_prefix("[INFO] ") log.info:set_flags { date = true } -- TODO(scr): improve log library to deal with levels if args.loglevel == 'info' then log.debug:set_output('/dev/null') elseif args.loglevel == 'warning' or args.loglevel == 'warning' then log.debug:set_output('/dev/null') log.info:set_output('/dev/null') end end setupLogs(args) -- Globals from env function setupGlobals() SD_API_URL = os.getenv('SD_API_URL') assert(SD_API_URL, 'missing SD_API_URL') SD_TOKEN = os.getenv('SD_TOKEN') assert(SD_TOKEN, 'missing SD_TOKEN') client = http.client({ headers = { Authorization = "Bearer " .. SD_TOKEN } }) end setupGlobals() -- Functions -- getBuildInfo gets the build info json object from the buildId function getBuildInfo(buildId) if not buildInfo then local url = string.format("%sbuilds/%d", SD_API_URL, buildId) log.debug:printf("fetching buildInfo from %s", url) local response, err = client:do_request(http.request("GET", url)) assert(not err, err) assert(response.code == 200, "error code not ok " .. response.code) buildInfo = json.decode(response.body) end return buildInfo end -- getParentBuildId gets the parent build ID from this build’s info function getParentBuildId(buildId) local parentBuildId = getBuildInfo(buildId).parentBuildId assert(parentBuildId, string.format("could not get parendId for %d", buildId)) return parentBuildId end -- getArtifact gets and returns the requested artifact function getArtifact(buildId, artifact) local url = string.format("%sbuilds/%d/artifacts/%s", SD_API_URL, buildId, artifact) log.debug:printf("fetching artifact from %s", url) local response, err = client:do_request(http.request("GET", url)) assert(not err, err) assert(response.code == 200, string.format("error code not ok %d for url %s", response.code, url)) return response.body end -- getManifestLines returns an iterator for the lines of the manifest and strips off leading ./ function getManifestLines(buildId) return coroutine.wrap(function() local manifest = getArtifact(buildId, 'manifest.txt') local manifest_lines = strings.split(manifest, '\n') for _, line in ipairs(manifest_lines) do line = strings.trim_prefix(line, './') if line ~= '' then coroutine.yield(line) end end end) end -- fetchArtifact fetches the artifact "source" and writes to a local file "dest" function fetchArtifact(buildId, source, dest) log.info:printf("Copying %s to %s", source, dest) local sourceContent = getArtifact(buildId, source) local dest_file = io.open(dest, 'w') dest_file:write(sourceContent) dest_file:close() end -- fetchArtifactDirectory fetches all the artifacts matching "source" from the manifest and writes to a folder "dest" function fetchArtifactDirectory(buildId, source, dest) -- Fire up workers to run fetches in parallel local work_body = [[ http = require 'http' json = require 'json' log = require 'log' strings = require 'strings' filepath = require 'filepath' goos = require 'goos' local args, workCh setupLogs, setupGlobals, fetchArtifact, getArtifact, args, workCh = unpack(arg) setupLogs(args) setupGlobals() log.debug:printf("Starting work %p", _G) local ok, work = workCh:receive() while ok do log.debug:print(table.concat(work, ' ')) fetchArtifact(unpack(work)) ok, work = workCh:receive() end log.debug:printf("No more work %p", _G) ]] local workCh = channel.make(tonumber(args.parallelism)) local workers = {} for i = 1, tonumber(args.parallelism) do local worker_plugin = plugin.do_string(work_body, setupLogs, setupGlobals, fetchArtifact, getArtifact, args, workCh) local err = worker_plugin:run() assert(not err, err) table.insert(workers, worker_plugin) end -- Send workers work to do log.info:printf("Copying directory %s to %s", source, dest) local source_prefix = strings.trim_suffix(source, filepath.separator()) .. filepath.separator() for line in getManifestLines(buildId) do log.debug:print(line, source_prefix) if source == '.' or source == '' or strings.has_prefix(line, source_prefix) then local dest_dir = filepath.join(dest, filepath.dir(line)) goos.mkdir_all(dest_dir) workCh:send { buildId, line, filepath.join(dest, line) } end end -- Close the work channel to signal workers to exit log.debug:print('Closing workCh') err = workCh:close() assert(not err, err) -- Wait for workers to exit log.debug:print('Waiting for workers to finish') for _, worker in ipairs(workers) do local err = worker:wait() assert(not err, err) end log.info:printf("Done copying directory %s to %s", source, dest) end -- Normalize/help the buildId by getting the parent build id as a convenience if not args.build_id then SD_BUILD_ID = os.getenv('SD_BUILD_ID') assert(SD_BUILD_ID, 'missing SD_BUILD_ID') args.build_id = getParentBuildId(SD_BUILD_ID) end -- Handle the command if args.manifest then local value = {} for line in getManifestLines(args.build_id) do table.insert(value, line) if not args.key then print(line) end end if args.key then meta.set(args.key, value) end elseif args.copy then if args.dir then fetchArtifactDirectory(args.build_id, args.source, args.dest) else fetchArtifact(args.build_id, args.source, args.dest) end elseif args['parent-id'] then print(getParentBuildId(args.build_id)) end Testing In order to test this, bats testing system was used to invoke manifest.lua with various arguments and the return code, output, and side-effects checked. For unit tests, an http server was fired up to serve static files in a testdata directory, and manifest.lua was actually invoked within this test.lua file so that the http server and the manifest.lua were run in two separate threads (via the plugin module) but the same process (to avoid being blocked by meta’s locking mechanism, if run in two processes) test.lua #!/usr/bin/env meta -- Because Meta locks, run the webserver as a plugin in the same process, then invoke the actual file under test. local plugin = require 'plugin' local filepath = require 'filepath' local argparse = require 'argparse' local http = require 'http' local parser = argparse(arg[0], 'Test runner that serves http test server') parser:option('-d --dir', 'Dir to serve', filepath.join(filepath.dir(arg[0]), "testdata")) parser:option('-a --addr', 'Address to serve on', "localhost:2113") parser:argument('rest', "Rest of the args") :args '*' local args = parser:parse() -- Run an http server on the requested (or default) addr and dir local http_plugin = plugin.do_string([[ local http = require 'http' local args = unpack(arg) http.serve_static(args.dir, args.addr) ]], args) http_plugin:run() -- Wait for http server to be running and serve status.html local wait_plugin = plugin.do_string([[ local http = require 'http' local args = unpack(arg) local client = http.client() local url = string.format("http://%s/status.html", args.addr) repeat local response, err = client:do_request(http.request("GET", url)) until not err and response.code == 200 ]], args) wait_plugin:run() -- Wait for it to finish up to 2 seconds local err = wait_plugin:wait(2) assert(not err, err) -- With the http server running, run the actual file under test -- Run with a plugin so that none of the plugins used by _this file_ are loaded before invoking dofile local run_plugin = plugin.do_string([[ arg[0] = table.remove(arg, 1) dofile(arg[0]) ]], unpack(args.rest)) run_plugin:run() -- Wait for the run to complete and report errors, if any local err = run_plugin:wait() assert(not err, err) -- Stop the http server for good measure http_plugin:stop() And the bats test looked something like: #!/usr/bin/env bats load test-helpers function setup() { mk_temp_meta_dir export SD_META_DIR="$TEMP_SD_META_DIR" export SD_API_URL="http://localhost:2113/" export SD_TOKEN=SD_TOKEN export SD_BUILD_ID=12345 export SERVER_PID="$!" } function teardown() { rm_temp_meta_dir } @test "artifacts with no command is an error" { run "${BATS_TEST_DIRNAME}/run.lua" echo "$status" echo "$output" ((status)) } @test "manifest gets a few files" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" manifest echo "$status" echo "$output" ((!status)) grep foo.txt <<<"$output" grep bar.txt <<<"$output" grep manifest.txt <<<"$output" } @test "copy foo.txt myfoo.txt writes it properly" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" copy foo.txt "${TEMP_SD_META_DIR}/myfoo.txt" echo "$status" echo "$output" ((!status)) [[ $(<"${TEMP_SD_META_DIR}/myfoo.txt") == "foo" ]] } @test "copy bar.txt mybar.txt writes it properly" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" copy bar.txt "${TEMP_SD_META_DIR}/mybar.txt" echo "$status" echo "$output" ((!status)) [[ $(<"${TEMP_SD_META_DIR}/mybar.txt") == "bar" ]] } @test "copy -b 101010 -d somedir mydir writes it properly" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" -l debug copy -b 101010 -d somedir "${TEMP_SD_META_DIR}/mydir" echo "$status" echo "$output" ((!status)) ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep one.txt ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep two.txt (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/one.txt") == 1 )) (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/two.txt") == 2 )) } @test "copy -b 101010 -d . mydir gets all artifacts" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" -l debug copy -b 101010 -d . "${TEMP_SD_META_DIR}/mydir" echo "$status" echo "$output" ((!status)) ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep one.txt ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep two.txt (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/one.txt") == 1 )) (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/two.txt") == 2 )) [[ $(<"${TEMP_SD_META_DIR}/mydir/abc.txt") == abc ]] [[ $(<"${TEMP_SD_META_DIR}/mydir/def.txt") == def ]] (($(find "${TEMP_SD_META_DIR}/mydir" -type f | wc -l) == 5)) } @test "copy -b 101010 -d . -p 1 mydir gets all artifacts" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" -l debug copy -b 101010 -d . -p 1 "${TEMP_SD_META_DIR}/mydir" echo "$status" echo "$output" ((!status)) ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep one.txt ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep two.txt (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/one.txt") == 1 )) (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/two.txt") == 2 )) [[ $(<"${TEMP_SD_META_DIR}/mydir/abc.txt") == abc ]] [[ $(<"${TEMP_SD_META_DIR}/mydir/def.txt") == def ]] (($(find "${TEMP_SD_META_DIR}/mydir" -type f | wc -l) == 5)) } @test "parent-id 12345 gets 99999" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" parent-id -b 12345 echo "$status" echo "$output" ((!status)) (( $output == 99999 )) }

Writing Lua scripts with meta

August 24, 2022
New bug fixes and features - May 2022 May 3, 2022
May 3, 2022
Share

New bug fixes and features - May 2022

Latest Updates - May 2022 Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - Show base branch name on pipeline graph nav - Relaxing blockedBy for same job - You can optionally run the same job at the same time in different events using the annotations `screwdriver.cd/blockedBySameJob` and `screwdriver.cd/blockedBySameJobWaitTime` - Add resource limit environment variables to build pod template: `CONTAINER_CPU_LIMIT`, `CONTAINER_MEMORY_LIMIT` - Add environment variable for private pipeline - `SD_PRIVATE_PIPELINE` will be set to `true` if private pipeline, otherwise `false` - Add job enable or disable toggle on pipeline tooltipOption to filter out events that have no builds from workflow graph in UI Bug Fixes - API: Use non-readOnly DB to get latest build for join - API: Return 404 error when GitHub api returns 404 - API: Multi-platform builds - API: The build parameter should not be polluted by another pipeline - API: Return 404 in openPr branch not found - API: Update promster hapi version - queue-service: Multi-platform builds - UI: Multi-platform builds - UI: Unify checkbox expansion behavior on pipeline creation page - UI: Switch from power icon to info icon - UI: Wait for rendering - UI: Toggle checkbox when label text clicked - Store: Multi-platform builds - Store: Add function to delete zip files - Store: Enable to Upload and Download artifact files by unzip worker scope token - Launcher: Support ARM64 binary for sd-step - Launcher: Build docker image for multiple platforms - Launcher: Add buildkit flag - Launcher: Use automatic platform args - Launcher: Make launcher docker file multi-arch compatible Internals - homepage: Use tinyurl instead of git.io - sd-cmd: Support arm64 - sd-local: Use latest patch version of golang 1.17 - meta-cli: Ensure that the jobName exists (before it was looking up “null”) - meta-cli: Make meta get parameters behave like it does for children (i.e. apply the job overrides) - meta-cli: Upgrade gopher-lua-libs for base64 support (and json/yaml file-io encoder/decoder) Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.239 - UI - v1.0.687 - Store - v4.2.5 - Queue-Service - v2.0.35 - Launcher - v6.0.161 - Build Cluster Worker - v2.24.0 Contributors Thanks to the following contributors for making this feature possible: - Alan - Dekus - Haruka - Hiroki - Kazuyuki - Keisuke - Kevin - Naoaki - Pritam - Sheridan - Teppei - Tiffany - Yuki - Yuta Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack. Author Tiffany Kyi, Sr Software Dev Engineer, Yahoo

New bug fixes and features - May 2022

May 3, 2022
New Bug Fixes and Features - March 2022 March 30, 2022
March 30, 2022
Share

New Bug Fixes and Features - March 2022

Latest Updates - March 2022 Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - (GitLab) Group owners can create pipelines for projects they have admin access to - Option to filter out events that have no builds from workflow graph in UI Bug Fixes - API: Error fix in removeJoinBuilds - API: Error code when parseUrl failed - API: Source directory can be 2 characters or less - API: New functional tests for parent event, source directory, branch-specific job, restrict PR setting, skip build - queue-service: Region map value name - queue-service: Do not retry when processHooks times out - UI: Update validator with provider field - UI: Change color code to be more colorblind-friendly - UI: Properly prompt and sync no-admin pipelines - UI: Add string case for provider for validator - UI: Disable click Start when set annotations - Launcher: Do not include parameters from external builds during remote join - buildcluster-queue-worker: Create package-lock.json - buildcluster-queue-worker: Fix health check processing error - buildcluster-queue-worker: Do not requeue when executor returns 403 or 404 error Internals - sd-cmd: Restrict debug store access log by verbose option - template-main: Requires >=node:12 - toolbox: Add logs to troubleshoot release files - guide: Update Gradle example Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.224 - UI - v1.0.680 - Store - v4.2.3 - Queue-Service - v2.0.30 - Launcher - v6.0.149 - Build Cluster Worker - v2.23.3 Contributors Thanks to the following contributors for making this feature possible: - Alan - Harura - Ibuki - Jithin - Joe - Keisuke - Kenta - Naoaki - Pritam - Ryosuke - Sagar - Shota - Tiffany - Teppei - Yoshiyuki - Yuki - Yuta Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack. Author Tiffany Kyi, Sr Software Dev Engineer, Yahoo

New Bug Fixes and Features - March 2022

March 30, 2022
Latest Updates - February 2022 February 15, 2022
February 15, 2022
Share

Latest Updates - February 2022

Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - Multi-tenant AWS Builds using AWS CodeBuild or EKS - Micro Service to process SCM webhooks asynchronously. Bug Fixes - UI: Hide stop button for unknown events. - UI: Properly update workflow graph for a running pipeline - API: Prevent status change for a finished build. - API: Return proper response code when Pipeline has no admins. - API: Pull Request which spans multiple pipelines sometimes fail to start all jobs. - API: Blocked By for the same job is not always working. - API: Restarting build can fail sometimes when job parameters are used. - API: Join job did not start when restarting a failed job. - sd-local: Support for changing build user. Internals - API: Reduce Database calls during workflow processing. Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.206 - UI - v1.0.670 - Store - v4.2.3 - Queue-Service - v2.0.26 - Launcher - v6.0.147 - Build Cluster Worker - v2.23.0 Contributors Thanks to the following contributors for making this feature possible: - Alan - Jithin - Ibuki - Harura - Kenta - Keisuke - Kevin - Naoaki - Pritam - Sagar - Tiffany - Yoshiyuki - Yuichi - Yuki - Yuta Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack. Author Jithin Emmanuel, Director of Engineering, Yahoo

Latest Updates - February 2022

February 15, 2022
Introducing YChaos - The resilience testing framework January 27, 2022
January 27, 2022
Share

Introducing YChaos - The resilience testing framework

Shashank Sharma, Software Engineer, Yahoo We, the resilience team, are glad to announce the release of YChaos, an end-to-end resilience testing framework to inject real time failures onto the systems and verify the system’s readiness to handle these failures. YChaos provides an easy to understand, quick to setup tool to perform a predefined chaos on a system. YChaos started as “Gru”, a tool that uses Yahoo’s internal technologies to run “Minions” on a predefined target system that creates a selected chaos on the system and restores the system to normal state once the testing is complete. YChaos has evolved a lot since then with better architecture, keeping the essence of Gru, catering to the use case of open source enthusiasts simultaneously supporting technologies used widely in Yahoo like Screwdriver CI/CD, Athenz etc. Get Started The term chaos is intriguing. To know more about YChaos, you can start by installing the YChaos package pip install ychaos[chaos] The above installs the latest stable YChaos package (Chaos subpackage) on your machine. To install the latest beta version of the package, you can install from the test.pypi index pip install -i https://test.pypi.org/simple/ ychaos[chaos] To install the actual attack modules that cause chaos on the system, install the agents subpackage. If you are planning to create chaos onto a remote target, this is not needed. pip install ychaos[agents] That’s all. You are now ready to create your first test plan and run the tool. To know more, head over to our documentation Design and Architecture YChaos is developed keeping in mind the Chaos Engineering principles. The framework provides a method to verify a system is in a condition that supports performing chaos on it along with providing “Agents” that are the actual chaos modules that inject a predefined failure on the system. The tool can also effectively be used to monitor and verify the system is back to normal once the chaos is complete. YChaos Test Plan Most of the modules of YChaos require a structured document that defines the actual chaos/verification plan that the user wants to perform. This is termed as the test plan. The test plan can be written in JSON or YAML format adhering to the schema given by the tool. The test plan provides a number of attributes that can be configured including verification plugins, agents, etc. Once the tool is fed with this test plan, the tool takes this configuration for anything it wants to do going forward. If you have installed YChaos, you can check the validity of the test plan you have created by running ychaos testplan validate /tmp/testplan.yaml YChaos Verification Plugins YChaos provides various plugins within the framework to verify the system state before, during and after the chaos. This can be used to determine if the system is in a state good enough to perform an attack, verify the system is behaving as expected during the attack and if the system has returned back to normal once the attack is done. YChaos currently bundles the following plugins ready to be used by the users 1. Python Module : Self configured plugin 2. Requests : Verify the latency of an API call 3. SDv4 : Remotely trigger a configured Screwdriver v4 pipeline and mark its completion as a criteria of verification. We are currently working on adding metrics based verification to verify a specific metric from the OpenTSDB server and to provide different criteria (Numerical and Relative) to verify the system is in an expected state. To know more about YChaos Verification and how to run verification, visit our documentation. The documentation provides a way to configure a simple python_module plugin and run verification. YChaos Target Executor The target executor or just Executor is the one determining the necessary steps to run the Agent Coordinator. The target defines the place where the chaos takes place. Executor determines the right configuration to reach the actual target and thereby making the target available for Agent Coordinator to run the Agents Currently, YChaos supports MachineTarget executor to SSH to a particular host and run the Agents on it. The other targets like Kubernetes/Docker, Self are also under consideration. YChaos Agent Coordinator The agent coordinator prepares the agents configured in the test plan to run on the target. It also takes care of monitoring the lifecycle of each agent so that all of the agents run in a structured way and also ensures the agents are teardown before ending the execution. The agent coordinator acts as a one point control of all the agents running on the target. YChaos Agents (Formerly Minions) The agents are the actual attack modules that can be configured to create a specific chaos on the target. For example, CPU Burn Agent is specifically designed to burn up the CPU cores for a configured amount of time. The agents are bundled with an Agent Configuration that provides attributes that can be configured by the user. For example, CPU Burn Agent configuration provides the cores_pct which can be configured by the user to run the process on a percentage of CPU cores on the target. YChaos Agents are designed in such a way that it is possible to run them independently without any intermediates like a coordinator. This helps in quick development and testing of agents. Agents follow a sequence in their execution called the lifecycle methods like setup, run, teardown and monitor. The setup initializes the prerequisites for an agent to execute. Run actually contains the program logic required to perform a chaos on the system. Once the run executes successfully, the teardown can be triggered to restore back the system from the chaos created by that particular agent. Acknowledgement We would like to thank all the contributors to YChaos as an idea, concept or code. We extend our gratitude to all those supporting the project from “Gru” to “YChaos”. Summary This post introduced a new Chaos Engineering or Resilience testing tool YChaos, how to get started with it and briefly discussed the design and architecture of the components that make up YChaos along with some quick examples to start your journey with YChaos with. References and Links 1. YChaos Codebase : https://github.com/yahoo/ychaos 2. YChaos Documentation : https://yahoo.github.io/ychaos 3. Our Presence on PyPi 4. https://test.pypi.org/project/ychaos/ 5. https://pypi.org/project/ychaos/

Introducing YChaos - The resilience testing framework

January 27, 2022
Latest Updates - December 2021 December 22, 2021
December 22, 2021
Share

Latest Updates - December 2021

Happy Holidays. Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - Build parameters can be defined for jobs. - UI: Show confirmation dialog when setting private pipelines public. - UI: Option to always show pipeline triggers. - UI: Option to display events by chronological order. - UI: Unified UX for Pull Requests. - Executor: Cluster admins can provide data into the build environment. Bug Fixes - UI: Properly start jobs in list view with parameters. - UI: Properly close tool tips. - API: Builds in blocked status can sometimes appear stuck - API: Cleanup `subscribe` configuration properly. - API: Speed up large pipeline deletion - API: Pipeline creation sometimes fails due to job name length in “requires” configuration. - API: Sonarqube configuration was not automatically created - API: Redlock setting customization was not working. - API: Template/Command publish was failing without specifying minor version. - API: Unable to publish latest tag to template in another namespace. - Queue Service: Properly handle API failures. - Launcher: Handle jq install properly. Internals - Remove dependency on deprecated “request” npm package. - Meta-cli download via go get now works as expected. - Semantic release library updated to v17 - Launcher: Support disabling habitat in build environment. - Adding more functional tests to the API. Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.179 - UI - v1.0.668 - Store - v4.2.3 - Queue-Service - v2.0.18 - Launcher - v6.0.147 - Build Cluster Worker - v2.23.0 Contributors Thanks to the following contributors for making this feature possible: - Alan - Jithin - Ibuki - Harura - Kazuyuki - Kenta - Keisuke - Kevin - Naoaki - Om - Pritam - Ryosuke - Sagar - Tiffany - Yoshiyuki - Yuichi - Yuki Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack. Author Jithin Emmanuel, Director of Engineering, Yahoo

Latest Updates - December 2021

December 22, 2021
Apache Pulsar: Seamless Storage Evolution and Ultra-High Performance with Persistent Memory October 26, 2021
October 26, 2021
Share

Apache Pulsar: Seamless Storage Evolution and Ultra-High Performance with Persistent Memory

Rajan Dhabalia, Sr. Principal Software Engineer, Yahoo Joe Francis, Apache Pulsar PMC Introduction  We have been using Apache Pulsar as a managed service in Yahoo! since 2014. After open-sourcing Pulsar in 2016, entering the Apache Incubator in 2017, and graduating as an Apache Top-Level Project in 2018, there have been a lot of improvements made and many companies have started using Pulsar for their messaging and streaming needs. At Yahoo, we run Pulsar as a hosted service, and more and more use cases run on Pulsar for different application requirements such as low latency, retention, cold reads, high fanout, etc. With the rise of the number of tenants and traffic in the cluster, we are always striving for a system that is both multi-tenant and can use the latest storage technologies to enhance performance and throughput without breaking the budget. Apache Pulsar provides us that true multi-tenancy by handling noisy-neighbor syndrome and serving users to achieve their SLA without impacting each other in a shared environment. Apache Pulsar also has a distinct architecture that allows Pulsar to adopt the latest storage technologies from time to time to enhance system performance by utilizing the unique characteristics of each technology to get the best performance out of it.  In this blog post, we are going to discuss two important characteristics of Apache Pulsar, multi-tenancy and adoption of next-generation storage technologies like NVMe and Persistent memory to achieve optimum performance with very low-cost overhead. We will also discuss benchmark testing of Apache Pulsar with persistent memory that shows we have achieved 5x more throughput with Persistent memory and also reduced the overall cost of the storage cluster.   What is Multi-Tenancy? Multi-tenancy can be easily understood with the real-estate analogy and by understanding the difference between an apartment building and a single residence home. In apartment buildings, resources (exterior wall, utility, etc.) are shared among multiple tenants whereas in a single residence only one tenant consumes all resources of the house. When we use this analogy in technology, it describes multi-tenancy in a single instance of hardware or software that has more than one resident. And it's important that all residents on a shared platform operate their services without impacting each other.  Apache Pulsar has an architecture distinct from other messaging systems. There is a clear separation between the compute layer (which does message processing and dispatching) and the storage layer (that handles persistent storage for messages using Apache BookKeeper). In BookKeeper, bookies (individual BookKeeper storage nodes) are designed to use three separate I/O paths for writes, tailing reads, and backlog reads. Separating these paths is important because writes and tailing reads use-cases require predictable low latency while throughput is more important for backlog reads use cases.  Real-time applications such as databases and mission-critical online services need predictable low latency. These systems depend on low-latency messaging systems. In most messaging systems, under normal operating conditions, dispatch of messages occurs from in-memory caches. But when a message consumer falls behind, multiple interdependent factors get triggered. The first is storage backlog. Since the system guarantees delivery, messages need to be persistently stored until delivery, and a slow reader starts building a storage backlog. Second, when the slow consumer comes back online, it starts to consume messages from where it left off. Since this consumer is now behind, and older messages have been aged out of the in-memory cache, messages need to be read back from disk storage, and cold reads on the message store will occur. This backlog reads on the storage device will cause I/O contention with writes to persist incoming messages to storage getting published currently. This leads to general performance degradation for both reads and writes. In a system that handles many independent message topics, the backlog scenario is even more relevant, as backlogged topics will cause unbalanced storage across topics and I/O contention. Slow consumers force the storage system to read the data from the persistent storage medium, which could lead to I/O thrashing and page cache swap-in-and-out. This is worse when the storage I/O component shares a single path for writes, caught-up reads, and backlog reads.  A true test of any messaging system should be a test of how it performs under backlog conditions. In general, published throughput benchmarks don't seem to account for these conditions and tend to produce wildly unrealistic numbers that cannot be scaled or related to provisioning a production system. Therefore, the benchmark testing that we are presenting in this blog is performed with random cold reads by draining backlog across multiple topics.   BookKeeper and I/O Isolation Apache BookKeeper stores log streams as segmented ledgers in bookie hosts. These segments (ledgers) are replicated to multiple bookies. This maximizes data placement options, which yields several benefits, such as high write availability, I/O load balancing, and a simplified operational experience. Bookies manage data in a log-structured way using three types of files: Journal contains BookKeeper transaction logs. Before any update to a ledger takes place, the bookie ensures that a transaction describing the update is written to non-volatile storage. Entry log (Data-File) aggregates entries from different ledgers (topics) and writes sequentially and asynchronously. It is also known as Data File. Entry log index manages an index of ledger entries so that when a reader wants to read an entry, the BookKeeper locates the entry in the appropriate entry log and offset using this index. With two separate file systems, Journal and Data-file, BookKeeper is designed to use separate I/O paths for writes, caught-up reads, and backlog reads. BookKeeper does sequential writes into journal files and performs cold reads from data files for the backlog draining.   [Figure 1: Pulsar I/O Isolation Architecture Diagram]   Adoption of Next-Generation Storage Technologies In the last decade, storage technologies have evolved with different types of devices such as HDD, SSD, NVMe, persistent memory, etc. and we have been using these technologies for Pulsar storage as time changes. Adoption of the latest technologies is helpful in Pulsar to enhance system performance but it’s also important to design a system that can fully use a storage device based on its characteristics and squeeze the best performance out of each kind of storage. Table 2. shows how each device can fit into the BookKeeper model to achieve optimum performance.  [Table 2: BookKeeper adaptation based on characteristics of storage devices]   Hard Disk Drive (HDD) From the 80s until a couple of years ago, database systems have relied on magnetic disks as secondary storage. The primary advantages of a hard disk drive are affordability from a capacity perspective and reasonably good sequential performance. As we have already discussed, bookies append transactions to journals and always write to journals sequentially. So, a bookie can use hard disk drives (HDDs) with a RAID controller and a battery-backed write cache to achieve writes at lower latency than latency expectations from a single HDD. Bookie also writes entry log files sequentially to the data device. Bookies do random reads when multiple Pulsar topics are trying to read backlogged messages. So, in total, there will be an increased I/O load when multiple topics read backlog messages from bookies. Having journal and entry log files on separate devices ensures that this read I/O is isolated from writes. Thus Pulsar can always achieve higher effective throughput and low latency writes with HDDs.  There are other messaging systems that use a single file to write and read data for a given stream. Such systems have to do a lot of random reads if consumers from multiple streams start reading backlog messages at the same time. In a multi-tenant environment, it’s not feasible for such systems to use HDDs to achieve consistent low-write latency along with backlog consumer reads because in HDD, random reads can directly impact both write and read latencies and eventually writes have to suffer due to random cold reads on the disk.   SATA Solid State Drives (SSD) Solid-state disks (SSD)-based on NAND flash media have transformed the performance characteristics of secondary storage. SSDs are built from multiple individual flash chips wired in parallel to deliver tens of thousands of IOPS and latency in the hundred-microsecond range, as opposed to HDDs with hundreds of IOPS and latencies in milliseconds. Our experience (Figure 3) shows that SSD provides higher throughput and better latency for sequential writes compared to HDDs. We have seen significant bookie throughput improvements by replacing SSDs with HDD for just journal devices.   Non-Volatile Memory Express (NVMe) SSD Non-Volatile Memory Express (NVMe) is another of the current technology industry storage choices. The reason is that NVMe creates parallel, low-latency data paths to underlying media to provide substantially higher performance and lower latency. NVMe can support multiple I/O queues, up to 64K with each queue having 64K entries. So, NVMe’s extreme performance and peak bandwidth will make it the protocol of choice for today’s latency-sensitive applications. However, in order to fully utilize the capabilities of NVMe, an application has to perform parallel I/O by spreading I/O loads to parallel processes.  With BOOKKEEPER-963 [2], the bookie can be configured with multiple journals. Each individual thread sequentially writes to its dedicated journal. So, bookies can write into multiple journals in parallel and achieve parallel I/O based on NVMe capabilities. Pulsar performs 2x-3x better with NVMe compared to SATA/SAS drives when the bookie is configured to write to multiple journals.   Persistent Memory There is a large performance gap between DRAM memory technology and the highest-performing block storage devices currently available in the form of solid-state drives. This gap can be reduced by a novel memory module solution called Intel Optane DC Persistent Memory (DCPMM) [1]. The DCPMM is a byte-addressable cache coherent memory module device that exists on the DDR4 memory bus and permits Load/Store accesses without page caching.  DCPMM is a comparatively expensive technology on unit storage cost to use for the entirety of durable storage. However, BookKeeper provides a near-perfect option to use this technology in a very cost-effective manner. Since the journal is short-lived and does not demand much storage, a small-sized DCPMM can be leveraged as the journal device. Since journal entries are going to be ultimately flushed to ledgers, the size of the journal device and hence the amount of persistent memory needed is in the tens of GB. Adding a small capacity DCPMM on bookie increases the total cost of bookie 5 - 10%, but it gives significantly better performance by giving more than 5x throughput while maintaining low write latency.   Endurance Considerations of Persistent Memory vs SSD Due to the guarantees needed on the data persistence, journals need to be synced often. On a high-performance Pulsar cluster, with SSDs as the journal device to achieve lower latencies, this eats into the endurance budget, thus shortening the useful lifespan of NAND flash-based media. So for high performance and low latency Pulsar deployment, storage media needs to be picked carefully. This issue can, however, be easily addressed by taking advantage of persistent memory. Persistent memory has significantly higher endurance, and the write throughput required for a journal should be handled by this device. A small amount of persistent memory is cheaper than an SSD with equivalent endurance. So from the endurance perspective, Pulsar can take advantage of persistent memory technology at a lower cost.  [Figure 3: Latency vs Throughput with Different Journal Device in Bookie] Figure 3 shows the latency vs performance graph when we use different types of storage devices to store journal files. It illustrates that the Journal with NVMe device gives 350MB throughput and the PMEM device gives 900MB throughput by maintaining consistently low latency p99 5ms. As we discussed earlier, this benchmark testing is performed under a real production situation and the test was performed under backlog conditions. Our primary focus for this test is (a) system throughput and (b) system latency. Most of the applications in our production environment have SLA of p99 5ms publish latency. Therefore, our benchmark setup tests throughput and latency of Apache Pulsar with various storage devices (HDD, SSD, NVMe, and Persistent memory) and with a mixed workload of writes, tail reads, and random cold reads across multiple topics. In the next section, let’s discuss the benchmark test setup and performance results in detail.   Benchmarking Pulsar Performance for Production Use Cases   Workload  We measured the performance of Pulsar for a typical mixed workload scenario. In terms of throughput, higher numbers are achievable (up to the network limit), but those numbers don't help in decision-making for building production systems. There is no one-size-fits-all recommended configuration available for any system. The configuration depends on various factors such as hardware resources of brokers (memory, CPU, network bandwidth, etc.) and bookies (storage disk types, network bandwidth, memory, CPU, etc.), replication configurations (ensembleSize, writeQuorum, ackQuorum), traffic pattern, etc.  The benchmark test configuration is set up to fully utilize system capabilities. Pulsar benchmark test includes various configurations such as a number of topics, message size, number of producers, and consumer processes. More importantly, we make an effort to ensure that cold-reads occur, which forces the system to read messages from the disk. This is typical for systems that do a replay, have downstream outages, and have multiple use cases with different consumption patterns. In Verizon Media (Yahoo), most of our use cases are latency-sensitive and they have a publish latency SLA of p99 5ms. Hence these results are indicative of the throughput limits with that p99 limit, and not the absolute throughput that can be achieved with the setup. We evaluated the performance of Pulsar using different types of storage devices (HDD, SSD, NVMe, and PMEM) for BookKeeper Journal devices. However, NVMe and PMEM are more relevant to current storage technology trends. Therefore, our benchmark setup and results will be more focused on NVMe and PMEM to use them for BookKeeper journal devices.    Quorum Count, Write Availability, and Device Tail Latencies Pulsar has various settings to ensure durability vs availability tradeoffs. Unlike other messaging systems, Pulsar does not halt writes to do recovery in a w=2/a=2 setup. It does not require a w=3/a=2 setup to ensure write availability during upgrades or single node failure. Writing to 2 nodes (writeQuorum=2) and waiting for 2 acknowledgements (ackQuorum=2), provides write availability in Pulsar under those scenarios. In this setup (w=2/a=2),  when a single node fails, writes can proceed without interruption instantaneously, while recovery executes in the background to restore the replication factor.   Other messaging systems halt writes, while doing recovery under these scenarios. While failure may be rare, the much more common scenario of a rolling upgrade is seamlessly possible with a Pulsar configuration of (w=2/a=2).  We consider this a marked benefit out of the box, as we are able to get by with a data replication factor of 2 instead of 3 to handle these occasions, with storage provisioned for 2 copies.   Test Setup  We use 3 Brokers, 3 Bookies, and 3 application clients.   Application Configuration:  3 Namespaces, 150 Topics Producer payload 100KB Consumers: 100 Topics with consumers doing hot reads, 50 topics with consumers doing cold reads (disk access)   Broker Configuration:  96GB RAM, 25Gb NIC  Pulsar settings: bookkeeperNumberOfChannelsPerBookie=200 [4] JVM settings: -XXMaxDirectMemorySize=60g -Xmx30g    Bookie Configuration: 1 (Journal Device: NVMe(Device-1), Ledger/Data Device: NVMe(Device-2)) 64GB RAM, 25Gb NIC BookkeeperNumberofChannelsperbookie=200 Journal disk: Micron NVMe SSD 9300 Journal directories: 2 (Bookie configuration: journalDirectories) Data disk: Micron NVMe SSD 9300 Ledger directories: 2 (Bookie configuration: ledgerDirectories) JVM settings: -XXMaxDirectMemorySize=30g -Xmx30g    Bookie Configuration: 2 (Journal Device: PMEM, Ledger/Data Device: NVMe) 64GB RAM, 25Gb NIC BookkeeperNumberofChannelsperbookie=200 PMEM journal device: 2 DIIMs, each with 120GB, mounted as 2 devices Journal directories: 4  (2 on each device) (Bookie configuration: journalDirectories) Data disk: Micron NVMe SSD 9300 Ledger directories: 2  (Bookie configuration: ledgerDirectories) JVM settings: -XXMaxDirectMemorySize=30g -Xmx30g    Client Setup  The Pulsar performance tool[3]: was used to run the benchmark test.    Results The performance test was performed on two separate Bookie configurations: Bookie configuration-1 uses two separate NVMe each for Journal and Data device and Bookie configuration-2 uses PMEM as Journal and NVMe as a Device device. [Table 4: Pulsar Performance Evaluation] As noted before, read/write latency variations occur when an NVMe SSD controller is busy with media management tasks such as Garbage Collection, Wear Leveling, etc. The p99 NVMe disk latency goes high with certain workloads, and that impacts the Pulsar p99 latency, under a replication configuration: e=2, w=2, a=2.   (The p95 NVMe disk latency is not affected, and so Pulsar 95 latencies are still under 5ms )  The impact of the NVME wear leveling and garbage collection can be mitigated by a replication configuration of e=3, w=3, and a=2, which helps flatten out the pulsar p99 latency graph across 3 bookies and achieves higher throughput while maintaining low 5ms p99 latency. We don’t see such improvements in the PMEM journal device set up with such a replication configuration. The results demonstrate that Bookie with NVMe or PMEM storage devices gives fairly high throughput at around 900MB by maintaining low 5ms p99 latency. While performing benchmark tests on NVMe journal device setup with replication configuration e=3,w=3,ack=2, we have captured io-stats of each bookie. Figure 5 shows that Bookie with a PMEM device provides 900MB write throughput with consistent low latency ( < 5ms). [Figure 5: Latency Vs Time (PMEM Journal Device with 900MB Throughput)]   [Figure 6: Pulsar Bookie IO Stats] IO stats (Figure 6) shows that the journal device serves around 900MB writes and no reads. Data device also serves 900MB avg writes while serving 350MB reads from each bookie.   Performance & User Impact  The potential user impact of software-defined storage is best understood in the context of the performance, scale, and latency that characterize most distributed systems today. You can determine if a software solution is using storage resources optimally in several different ways, and two important metrics are throughput and latency. We have been using Bookies with PMEM journal devices in production for some time by replacing HDD-RAID devices. Figure 7 shows the write throughput vs latency bucket graph for Bookies with HDD-RAID journal device and Figure 8 shows for PMEM journal device. Bookies with HDD-RAID configuration have high write latency with the spike in traffic and it shows that requests having > 50ms write-latency increase with the higher traffic. On the other hand, Bookies with PMEM journal device provides stable and consistent low latency with higher traffic and serves user requests within SLA. These graphs explain the user impact of PMEM which allows Bookies to serve latency-sensitive applications and meet their SLA with the spike in traffic as well.   [Figure 7. Bookie Publish Latency Buckets with HDD-RAID Bookie Journal Device]   [Figure 8. Bookie Publish Latency Buckets with PMEM Bookie Journal Device]   Final Thoughts Pulsar architecture can accommodate different types of hardware which allows users to balance performance and cost based on required throughput and latency. Pulsar has the capability to adapt to the next generation of storage devices to achieve better performance. We have also seen that persistent memory excels in the race to achieving higher write throughput by maintaining low latency.    Appendix [1] DC Persistent Memory Module. https://www.intel.com/content/www/us/en/architectureand-technology/optane-dc-persistent-memory.html  [2] Multiple Journal Support: https://issues.apache.org/jira/browse/BOOKKEEPER963. [3] Pulsar Performance Tool: http://pulsar.apache.org/docs/en/performance-pulsar-perf/. [4] Per Bookie Configurable Number of Channels: https://github.com/apache/pulsar/pull/7910.

Apache Pulsar: Seamless Storage Evolution and Ultra-High Performance with Persistent Memory

October 26, 2021
Latest Updates - August 2021 August 27, 2021
August 27, 2021
Share

Latest Updates - August 2021

Jithin Emmanuel, Director of Engineering, Verizon Media Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - Pipeline Visualizer tool to view connected pipelines in a single UI. - Offline queue processing to detect and fail builds early with Kubernetes executor. - Screwdriver now uses Docker in Docker to publish images to Docker Hub. - Build artifacts to be streamed via API to speed up artifact rendering. - Update eslint rules to latest across libraries and applications. - Executors should be able to mount custom data into the build environment. - UI to streamline display of Start/Stop buttons for Pull Requests. Bug Fixes - Launcher: Fix for not able to update build parameters when restarting builds. - Queue Service: QUEUED notification is sent twice. - API: QUEUED build status notification was not being sent. - API: Validate input when updating user settings. - UI: Fix for Template/Command title breadcrumbs not working. - UI: Validate event URL path parameters. Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.140 - UI - v1.0.655 - Store - v4.2.2 - Queue-Service - v2.0.11 - Launcher - v6.0.137 - Build Cluster Worker - v2.20.2 Contributors Thanks to the following contributors for making this feature possible: - Alan - Jithin - Ibuki - Harura - Kazuyuki - Kenta - Keisuke  - Kevin - Naoaki - Mansoor - Om - Pritam - Tiffany - Yoshiyuki - Yuichi Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Latest Updates - August 2021

August 27, 2021
Latest Updates - June 2021 June 30, 2021
June 30, 2021
Share

Latest Updates - June 2021

Jithin Emmanuel, Engineering Manager, Verizon Media Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - Read protections for Pipelines for Private SCM Repositories. - Support Read-Only SCM for mirrored Source Repositories. - API: Allow Pipeline tokens to list secrets. - API: Support PENDING Pull Request status check using metadata. - UI: Link to the Pipeline which published a Template/Command - Queue Worker: Add offline processing to verify if builds have started.  Bug Fixes - Launcher: Fix metadata getting overwritten - Launcher: Fix broken builds. - Launcher: Prevent Shared Commands from logging by default. - Launcher: UUID library used is no longer supported. - UI: PR job name in the build page should link to the Pull Requests tab. - UI: Do not show remove option for Pipelines in default collections. - UI: List view is slow for pipelines  with large numbers of job. - Store: Return proper Content-Type for artifacts. - API: Fix broken tests due to higher memory usage. - API: Job description is missing when templates are used.. Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.120 - UI - v1.0.644 - Store - v4.1.11 - Queue-Service - v2.0.7 - Launcher - v6.0.133 - Build Cluster Worker - v2.15.0 Contributors Thanks to the following contributors for making this feature possible: - Alan - Jithin - Ibuki - Harura - Kazuyuki - Kenta - Keisuke  - Kevin - Naoaki - Mansoor - Om - Pritam - Tiffany - Yoshiyuki - Yuichi Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Latest Updates - June 2021

June 30, 2021
Latest Updates - May 2021 June 2, 2021
June 2, 2021
Share

Latest Updates - May 2021

Jithin Emmanuel, Engineering Manager, Verizon Media Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - Ability to Abort Frozen builds. - Badges are supported natively without having to connect to an external service. Bug Fixes - UI should skip builds with `CREATED` status when computing event status. - UI setting for graph job name adjustment was not working. - UI: Fix event label overflowing for large values and update Stop button position. - UI: Show restart option for builds in PR Chain. - UI: Tone down the color when the build parameters are changed from default value. - UI: Stop rendering files with binary content. - UI: Fix validation for Git repository during pipeline creation. - API: Fix for trusted templates not getting carried over to new versions. - API: Fix validation when a step name is defined that duplicates an automatically generated step. - API: Streamline remove command tag API reposone. - Store: Fix for large file downloads failing. - Launcher: Fix metadata getting overwritten Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.96 - Queue-Service - v2.0.6 - UI - v1.0.629 - Store - v4.1.7 - Launcher - v6.0.128 - Build Cluster Worker - v2.10.0 Contributors Thanks to the following contributors for making this feature possible: - Alan - Dekus - Jithin - Ibuki - Harura - Kazuyuki - Kazuki - Kenta - Keisuke  - Kevin - Lakshminarasimhan - Naoaki - Mansoor - Pritam - Shu - Tiffany - Yoshiyuki - Yuichi Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Latest Updates - May 2021

June 2, 2021
Latest Updates - April 2021 April 22, 2021
April 22, 2021
Share

Latest Updates - April 2021

Jithin Emmanuel, Engineering Manager, Verizon Media Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - External config to have support for Source Directory in child pipelines. - Removing expiry of shared commands. - API to support OR workflow for jobs. - Collections UX improvements. Part of Yahoo Hack Together.        - Proper validation of modal. - Make sure mandatory fields are filled in. - UI: Option to hide PR jobs in event workflow - UI: Hide builds in `CREATED` status to avoid confusion. - store-cli: Support for parallel writes to build cache with locking. - Improvements to sd-local log format - Fix for broken lines. - Non-verbose logging for interactive mode. - New API to remove a command tag. - Warn users if build parameters are different from default values. Bug Fixes - API: Fix for a join build stuck in “CREATED” status due to missing join data. - Queue Service: Enhanced error handling to reduce errors in build periodic processing. - API: Prevent users from overwriting job audit data. - UI : properly validate templates even if there are extra lines above config. - UI : Fix for duplicate event displayed in event list. - Launcher: Support setting pushgateway protocol schema  - Launcher: Enable builds to read metadata from the entire event in addition to immediate parent builds. Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.84 - Queue-Service - v2.0.6 - UI - v1.0.618 - Store - v4.1.3 - Launcher - v6.0.128 - Build Cluster Worker - v2.10.0 Contributors Thanks to the following contributors for making this feature possible: - Alan - Dekus - Jithin - Ibuki - Harura - Kazuyuki - Kazuki - Kenta - Keisuke  - Krishna - Kevin - Lakshminarasimhan - Naoaki - Mansoor - Pritam - Rakshit - Shu - Tiffany - Yoshiyuki - Yuichi Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Latest Updates - April 2021

April 22, 2021
Latest Updates - March 2021 March 8, 2021
March 8, 2021
Share

Latest Updates - March 2021

Jithin Emmanuel, Engineering Manager, Verizon Media Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - sd-local cli enhancements - Compatibility with podman - Make ssh agent it work with non-root user containers. - added User-Agent info on request from sd-local to API to track usage. - Template owners can lock template steps to prevent step override. - UI can be prevented from restarting specific jobs using “manualStartEnabled” annotation. - Added health check endpoint for “buildcluster-queue-worker”. Bug Fixes - Fix for trusted templates not showing up in UI. - Launcher was not terminating the current running step on timeout or abort. - API can now start with default configuration. - Store not starting with memory strategy. - Local development setup was broken. Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.66 - Queue-Service - v2.0.5 - UI - v1.0.604 - Store - v4.1.1 - Launcher - v6.0.122 - Build Cluster Worker - v2.9.0 Contributors Thanks to the following contributors for making this feature possible: - Alan - Jithin - Ibuki - Kazuyuki - Kenta - Keisuke  - Kevin - Lakshminarasimhan - Naoaki - Pritam - Tiffany - Yoshiyuki  Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Latest Updates - March 2021

March 8, 2021
Join Screwdriver at Yahoo Hack Together (Virtual Open Source Hackathon), March 21 - 28 March 4, 2021
March 4, 2021
Share

Join Screwdriver at Yahoo Hack Together (Virtual Open Source Hackathon), March 21 - 28

We’re thrilled to be participating in Yahoo Hack Together, a virtual, open source hackathon, running from March 21 through 28. In addition to Screwdriver, there are several other awesome projects participating. Themes include Data, Design, and Information Security (Defense). The hackathon also includes: - Suggested topics/issues for you to get started - Support channels to reach out to project maintainers - Office Hours to ask questions and get feedback - Verizon Media swag & prizes Eligible contributions include accessibility reviews, coding, design, documentation, translations, user experience, and social media suggestions. We’d love to invite you to join us!

Join Screwdriver at Yahoo Hack Together (Virtual Open Source Hackathon), March 21 - 28

March 4, 2021
cdCon 2021 - Call for Screwdriver Proposals February 17, 2021
February 17, 2021
Share

cdCon 2021 - Call for Screwdriver Proposals

Dear Screwdriver Community, cdCon 2021 (the Continuous Delivery Foundation’s annual flagship event) is happening June 23-24 and its call for papers is open! This is your chance to share what you’ve been doing with Screwdriver. Are you building something cool? Using it to solve real-world problems? Are you making things fast? Secure? Or maybe you’re a contributor and want to share what’s new. In all cases, we want to hear from you! Submit your talk for cdCon 2021 to be part of the conversation driving the future of software delivery for technology teams, enterprise leadership, and open-source communities.Submission Deadline Final Deadline: Friday, March 5 at 11:59 PM PSTTopics Here are the suggested tracks: - Continuous Delivery Ecosystem – This track spans the entire Continuous Delivery ecosystem, from workflow orchestration, configuration management, testing, security, release automation, deployment strategies, developer experience, and more. - Advanced Delivery Techniques – For talks on the very cutting edge of continuous delivery and emerging technology, for example, progressive delivery, observability, and MLOps. - GitOps & Cloud-Native CD – Submit to this track for talks related to continuous delivery involving containers, Kubernetes, and cloud*native technologies. This includes GitOps, cloud-native CD pipelines, chatops, best practices, etc. - Continuous Delivery in Action – This track is for showcasing real-world continuous delivery addressing challenges in specific domains e.g. fintech, embedded, healthcare, retail, etc. Talks may cover topics such as governance, compliance, security, etc. - Leadership Track – Talks for leaders and decision-makers on topics such as measuring DevOps, build vs buy, scaling, culture, security, FinOps, and developer productivity. - Community Track – There is more to open source than code contributions. This track covers topics such as growing open source project communities, diversity & inclusion, measuring community health, project roadmaps, and any other topic around sustaining open source and open source communities. Singular project focus and/or interoperability between: - Jenkins - Jenkins X - Ortelius - Spinnaker - Screwdriver - Tekton - Other – e.g. Keptn, Flagger, Argo, Flux View all tracks and read CFP details here. We look forward to reading your proposal! Submit here [https://events.linuxfoundation.org/cdcon/program/cfp/]

cdCon 2021 - Call for Screwdriver Proposals

February 17, 2021
Latest Updates February 15, 2021
February 15, 2021
Share

Latest Updates

Jithin Emmanuel, Engineering Manager, Verizon Media Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - Group Events in UI to visualize events started by restarting jobs in one place.  - Template Composition to enable Template authors to inherit job configuration from an existing Template - Lua scripting support in meta cli. - Launcher now bundles skopeo binary. - UI to highlight the latest event in the events list. - Notification configuration validations errors can be made into warnings. - Build cache performance enhancement by optimizing compression algorithms.  - Streamline Collection deletion UI flow. Bug Fixes - SonarQube PR analysis setting is not always added. - Session timeout was leading to 404. - Templates & Commands UI page load time is now significantly faster. - Fix Templates permalink. - Clarify directions for build cluster queue setup.  Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.53 - Queue-Service - v2.0.0 - UI - v1.0.598 - Store - v4.0.2 - Launcher - v6.0.115 - Build Cluster Worker - v2.9.0 Contributors Thanks to the following contributors for making this feature possible: - Alan - Dekus - Jithin - Ibuki - Kkawahar - Keisuke  - Kevin - Lakshminarasimhan - Pritam - Sheridan C Rawlins - Tiffany - Yoshiyuki  Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Latest Updates

February 15, 2021
Improvements and updates. January 7, 2021
January 7, 2021
Share

Improvements and updates.

Jithin Emmanuel, Engineering Manager, Verizon Media Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - sd-local tool support for mounting of local ssh-agent & custom volumes. - Teardown steps will run for aborted builds. Users can control duration via annotation terminationGracePeriodSeconds - Properly validate settings configuration in `screwdriver.yaml` - This will break existing pipelines if the setting value is already wrong. - Support exclusions in source paths. - Warn users if template version is not specified when creating pipeline. Bug Fixes - Meta-cli now works with strings of common logarithms. - Jobs with similar names were breaking the pipeline detail page. - Pipeline list view to lazy load data for improved performance. - Fix for slow rendering when rendering Pipeline workflow graph. Compatibility List In order to have these improvements, you will need these minimum versions: - API - v4.1.36 - Queue-Service - v2.0.0 - UI - v1.0.590 - Store - v4.0.2 - Launcher - v6.0.106 - Build Cluster Worker - v2.3.3 Contributors Thanks to the following contributors for making this feature possible: - Alan - Dekus - Jithin - Ibuki - Kenta - Kkawahar - Keisuke  - Kevin - Lakshminarasimhan - Pritam - Tiffany - Yoshiyuki  Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Improvements and updates.

January 7, 2021
Explore Screwdriver at CDCon 2020 September 30, 2020
September 30, 2020
Share

Explore Screwdriver at CDCon 2020

Screwdriver is an open-source build platform for Continuous Delivery. Using Screwdriver, you can easily define the path that your code takes from Pull Request to Production. The Screwdriver team will be presenting three talks at CDCon (Oct 7-8) and would love to have you join! Register to attend CDCon. CDCon has pledged to donate 100% of the proceeds received from CDCon 2020 registration to charitable causes: Black Girls Code, Women Who Code and the CDF Diversity Fund. Registrants indicate which charitable fund they want their 25 USD registration fees to go to during registration. Hope to see you at CDCon! - - - Screwdriver UI Walkthrough Oct 7, 12:40 PM PDT Speakers: Alan Dong, Software Engineer, Verizon Media In this session, Alan will cover the fundamental parts of Screwdriver:  - What is a pipeline?  - How to use a screwdriver to set up a pipeline from scratch  - Integrate with SCM (i.e. GitHub)  - Setup collections for personal preferences  - How to get involved with Screwdriver.cd to get help and contribute back to the community  Case Study: How Yahoo! Japan Uses and Contributes to Screwdriver at Scale Oct 7, 2:20 PM PDT Speakers: Hiroki Takatsuka, Engineering Manager, Yahoo! Japan & Jithin Emmanuel, Sr Mgr, Software Dev Engineering, Verizon Media Yahoo! Japan will share how they use and contribute to Screwdriver, an open-source build platform designed for Continuous Delivery, at scale. Several topics will be covered including: architecture, use cases, usage stats, customization, operational tips, and collaborating directly with Verizon Media’s Screwdriver team to constantly evolve Screwdriver. CI/CD with Open Source Screwdriver Oct 8, 3:50 PM PDT Speakers: Jithin Emmanuel, Sr Mgr, Software Dev Engineering & Tiffany Kyi, Software Development Engineer, Verizon Media Now part of the Continuous Delivery Foundation, Screwdriver is an open source CI/CD platform, originally created and open-sourced by Yahoo/Verizon Media. At Yahoo/Verizon Media, Screwdriver is used to run more than 60,000 software builds every day. Yahoo! Japan also uses and contributes to Screwdriver. In this session, core contributors to Screwdriver will provide an overview of features and capabilities, and how it is used at scale covering use-cases across mobile, web applications, and library development across various programming languages.

Explore Screwdriver at CDCon 2020

September 30, 2020
SonarQube Enterprise Edition Support August 17, 2020
August 17, 2020
Share

SonarQube Enterprise Edition Support

Tiffany Kyi, Software Engineer, Verizon Media We have recently added SonarQube Enterprise Edition support to Screwdriver, which unlocks powerful Pull Request Workflows and improves build analysis performance. Cluster admins can follow instructions in the Cluster Admin Configuration section below to use SonarQube Enterprise. In order to make use of these new Pull Request features and to better utilize our SonarQube license, we will be making the following changes: 1. Sonar Project Key for your build will change from “job:” to “pipeline:”. 2. If your project still needs multiple analysis at job level we will provide you with a job level annotation to get a Sonar Project Key scoped to a Job. These changes will enable Screwdriver to provide a Pull Request Analysis feature for all builds. Note: This will create a new Sonarqube project for your pipeline, however your existing analysis data will not be migrated over to the new Sonarqube project. User configuration 1. If you are relying on the Screwdriver SonarQube integration to publish and view test coverage results in the Screwdriver build detail page, then no change is required. 2. If you have a custom integration where you are manually constructing SonarQube scanner parameters, then you need to rely on $SD_SONAR_PROJECT_KEY & $SD_SONAR_PROJECT_NAME for scanner parameters, which will be available in builds based on your project configuration. We have also added $SD_SONAR_ENTERPRISE to indicate whether the cluster is using Enterprise (true) or open source edition of SonarQube(false). 3. If you absolutely need to have a separate SonarQube analysis for each job, you need to add this annotation screwdriver.cd/coverageScope: job to your job configuration in your “screwdriver.yaml” file: jobs: main: annotations: screwdriver.cd/coverageScope: job requires: [~pr, ~commit] image: node:12 steps: - install: npm install - test: npm test Cluster Admin configuration In order to enable SonarQube Enterprise edition with Screwdriver, do the following steps: 1. Get a SonarQube Enterprise license. 2. Update the SonarQube Enterprise license in the SonarQube UI (https://SONAR_URL/admin/extension/license/app). 3. Then, set COVERAGE_SONAR_ENTERPRISE: true in your config file.Pull Request Decoration To set up Pull Request Decoration in your Github PRs, follow these steps in the link below: https://docs.sonarqube.org/latest/analysis/pr-decoration/ Note: Users will need to manually install the newly created Github app in their organizations and repos, and these will need to be manually configured in SonarQube projects. You should see something like this: Compatibility List In order to have these improvements, you will need these minimum versions: - API - v0.5.972 - Queue-Service - v1.0.22 - UI - v1.0.539 - Launcher - v6.0.87 - Build Cluster Worker - v1.18.8 Contributors Thanks to the following contributors for making this feature possible: - Jithin - Lakshminarasimhan - Tiffany Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

SonarQube Enterprise Edition Support

August 17, 2020
Latest Product Updates August 13, 2020
August 13, 2020
Share

Latest Product Updates

Jithin Emmanuel, Engineering Manager, Verizon Media Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components. New Features - SonarQube enterprise support #1314 - Automatic Deploy Key setup for Github SCM pipelines #1079 - Support for filtering on tag and release names #1994 - Notification Slack channel can be set dynamically in build. Usage instructions here. - Build Parameters to support drop-down selections #2092 - Confirmation dialogue when deleting Pipeline secrets #2117 - Added “PR_BASE_BRANCH_NAME” environment variable for determining Pull Request base branch #2153 - Upgraded Ember.js to the latest LTS for Screwdriver UI Bug Fixes - Child pipelines to work without having to override config pipeline secrets #2125 - Periodic builds configs were not cleaning up on removal #2138 - Template list in “Create Pipeline” view to display namespaces #2140 - Remote trigger to work for Child Pipelines #2148 Compatibility List In order to have these improvements, you will need these minimum versions: - API - v0.5.964 - Queue-Service - v1.0.22 - UI - v1.0.535 - Launcher - v6.0.87 - Build Cluster Worker - v1.18.8 Contributors Thanks to the following contributors for making this feature possible: - Alan - Jithin - Joerg - Ibuki - Kevin - Keisuke - Kenta - Lakshminarasimhan - Pritam - Teppei - Tiffany - Yoshiyuki - Yuichi Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Latest Product Updates

August 13, 2020
Behold! Big Data at Fast Speed! August 6, 2020
August 6, 2020
Share

Behold! Big Data at Fast Speed!

Oak0.2 Release: Significant Improvements to Throughput, Memory Utilization, and User Interface By Anastasia Braginsky, Sr. Research Scientist, Verizon Media Israel Creating an open source software is an ongoing and exciting process. Recently, Oak open-source library delivered a new release: Oak0.2, which summarizes a year of collaboration. Oak0.2 makes significant improvements in throughput, memory utilization, and user interface.  OakMap is a highly scalable Key-Value Map that keeps all keys and values off-heap. The Oak project is designed for Big Data real-time analytics. Moving data off-heap, enables working with huge memory sizes (above 100GB) while JVM is struggling to manage such heap sizes. OakMap implements the industry-standard Java8 ConcurrentNavigableMap API and more. It provides strong (atomic) semantics for read, write, and read-modify-write, as well as (non-atomic) range query (scan) operations, both forward and backward. OakMap is optimized for big keys and values, in particular, for incremental maintenance of objects (update in-place). It is faster and scales better with additional CPU cores than the popular Java’s ConcurrentNavigableMap implementation ConcurrentSkipListMap.  Oak data is written to the off-heap buffers, thus needs to be serialized (converting an object in memory into a stream of bytes). For retrieval, data might be deserialized (object created from the stream of bytes). In addition, to save the cycles spent on deserialization, we allow reading/updating the data directly via OakBuffers. Oak provides this functionality under the ZeroCopy API. If you aren’t already familiar with Oak, this is an excellent starting point to use it! Check it out and let us know if you have any questions. Oak keeps getting better: Introducing Oak0.2 We have made a ton of great improvements to Oak0.2, adding a new stream scanning for improved performance, releasing a ground-up rewrite of our Zero Copy API’s buffers to increase safety and performance, and decreasing the on-heap memory requirement to be less than 3% of the raw data! As an exciting bonus, this release also includes a new version of our off-heap memory management, eliminating memory fragmentation.  Below we dive deeper into sub-projects being part of the release. Stream Data Faster When scanned data is held by any on-heap data structures, each next-step is very easy: get to the next object and return it. To retrieve the data held off-heap, even when using Zero-Copy API, it is required to create a new OakBuffer object to be returned upon each next step. Scanning Big Data that way will create millions of ephemeral objects, possibly unnecessarily, since the application only accesses this object in a short and scoped time in the execution.  To avoid this issue, the user can use our new Stream Scan API, where the same OakBuffer object is reused to be redirected to different keys or values. This way only one element can be observed at a time. Stream view of the data is frequently used for flushing in-memory data to disk, copying, analytics search, etc.  Oak’s Stream Scan API outperforms CSLM by nearly 4x for the ascending case. For the descending case, Oak outperforms CSLM by more than 8x even with less optimized non-stream API. With the Stream API, Oak’s throughput doubles. More details about the performance evaluation can be found here. Safety or Performance? Both! OakBuffers are core ZeroCopy API primitives. Previously, alongside with OakBuffers, OakMap exposed the underlying ByteBuffers directly to the user, for the performance. This could cause some data safety issues such as an erroneous reading of the wrong data, unintentional corrupting of the data, etc. We couldn’t choose between safety and performance, so strived to have both! With Oak0.2, ByteBuffer is never exposed to the user. Users can choose to work either with OakBuffer which is safe or with OakUnsafeDirectBuffer which gives you faster access, but use it carefully. With OakUnsafeDirectBuffer, it is the user’s responsibility to synchronize and not to access deleted data, if the user is aware of those issues, OakUnsafeDirectBuffer is safe as well. Our safe OakBuffer works with the same, great and known, OakMap performance, which wasn’t easy to achieve. However, if the user is interested in even superior speed of operations, any OakBuffer can be cast to OakUnsafeDirectBuffer.  Less (metadata) is more (data) In the initial version of OakMap we had an object named handler that was a gateway to access any value. Handler was used for synchronization and memory management. Handler took about 256 bytes per each value and imposed dereferencing on each value access.  Handler is now replaced with an 8-bytes header located in the off-heap, next to the value. No dereferencing is needed. All information needed for synchronization and memory manager is kept there. In addition, to keep metadata even smaller, we eliminated the majority of the ephemeral object allocations that were used for internal calculations. This means less memory is used for metadata and what was saved goes directly to keep more user data in the same memory budget. More than that, JVM GC has much less reasons to steal memory and CPU cycles, even when working with hundreds of GBs. Fully Reusable Memory for Values As explained above, 8-byte off-heap headers were introduced ahead of each value. The headers are used for memory reclamation and synchronization, and to hold lock data. As thread may hold the lock after a value is deleted, the header’s memory couldn’t be reused. Initially the header’s memory was abandoned, causing a memory leak.  The space allocated for value is exactly the value size, plus header size. Leaving the header not reclaimed, creates a memory “hole” where a new value of the same size can not fit in. As the values are usually of the same size, this was causing fragmentation. More memory was consumed leaving unused spaces behind. We added a possibility to reuse the deleted headers for new values, by introducing a sophisticated memory management and locking mechanism. Therefore the new values can use the place of the old deleted value. With Oak0.2, the scenario of 50% puts and 50% deletes is running with a stable amount of memory and performs twice better than CSLM. We look forward to growing the Oak community! We invite you to explore the project, use OakMap in your applications, raise issues, suggest improvements, and contribute code. If you have any questions, please feel free to send us a note. It would be great to hear from you! Acknowledgements: Liran Funaro, Eshcar Hilel, Eran Meir, Yoav Zuriel, Edward Bortnikov, Yonatan Gottesman

Behold! Big Data at Fast Speed!

August 6, 2020
Apache Storm 2.2.0 Improvements - NUMA Support,
Auto Refreshing SSL Certificates for All Daemons, V2 Tick Backwards Compatibility, Scheduler Improvements, & OutputCollector Thread Safety July 31, 2020
July 31, 2020
Share

Apache Storm 2.2.0 Improvements - NUMA Support, Auto Refreshing SSL Certificates for All Daemons, V2 Tick Backwards Compatibility, Scheduler Improvements, & OutputCollector Thread Safety

Kishor Patil, PMC Chair Apache Storm & Sr. Principal Software Systems Engineer, Verizon Media Last year, we shared with you many of the Apache Storm 2.0 improvements contributed by Verizon Media. At Yahoo/Verizon Media, we’ve been committing to Storm for many years. Today, we’re excited to explore a few of the new features, improvements, and bug fixes we’ve contributed to Storm 2.2.0. NUMA Support  The server hardware is getting beefier and requires worker JVMs to be NUMA (Non-uniform memory access) aware. Without constraining JVMs to NUMA zones, we noticed dramatic degradation in the JVM performance; specifically for Storm where most of the JVM objects are short-lived and continuous GC cycles perform complete heap scan. This feature enables maximizing hardware utilization and consistent performance on asymmetric clusters. For more information please refer to [STORM-3259]. Auto Refreshing SSL Certificates for All Daemons At Verizon Media, as part of maintaining thousands of Storm nodes, refreshing SSL/TLS certificates without any downtime is a priority. So we implemented auto refreshing SSL certificates for all daemons without outages. This becomes a very useful feature for operation teams to monitor and update certificates as part of hassle free continuous monitoring and maintenance. Included in the security related critical bug fixes the Verizon Media team noticed and fixed are: - Kerberos connectivity from worker to Nimbus/Supervisor for RPC heartbeats [STORM-3579] - Worker token refresh causing authentication failure [STORM-3578] - Use UserGroupInformation to login to HDFS only once per process [STORM-3494] - AutoTGT shouldn’t invoke TGT renewal thread [STORM-3606] V2 Tick Backwards Compatibility This allows for deprecated metrics at worker level to utilize messaging and capture V1 metrics. This is a stop-gap giving topology developers sufficient time to switch from V1 metrics to V2 metrics API. The Verizon Media Storm team also provided shortening metrics names to allow for metrics names that conform to more aggregation strategies by dimension [STORM-3627]. We’ve also started removing deprecated metrics API usage within storm-core and storm-client modules and adding new metrics at nimbus/supervisor daemon level to monitor activity. Scheduler Improvements ConstraintSolverStrategy allows for max co-location count at the Component Level. This allows for better spread - [STORM-3585]. Both ResourceAwareScheduler and ConstraintSolverStrategy are refactored for faster performance. Now a large topology of 2500 component topology requesting complex constraints or resources can be scheduled in less than 30 seconds. This improvement helps lower downtime during topology relaunch - [STORM-3600]. Also, the blacklisting feature to detect supervisor daemon unavailability by nimbus is useful for failure detection in this release [STORM-3596]. OutputCollector Thread Safety For messaging infrastructure, data corruption can happen when components are multi-threaded because of non thread-safe serializers. The patch [STORM-3620] allows for Bolt implementations that use OutputCollector in other threads than executor to emit tuples. The limitation is batch size 1. This important implementation change allows for avoiding data corruption without any performance overhead.  Noteworthy Bug Fixes - For LoadAwareShuffle Grouping, we were seeing a worker overloaded and tuples timing out with load aware shuffle enabled. The patch checks for low watermark limits before switching from Host local to Worker local - [STORM-3602]. - For Storm UI, the topology visualization related bugs are fixed so topology DAG can be viewed more easily. - The bug fix to allow the administrator access to topology logs from UI and logviewer. - storm cli bug fixes to accurately process command line options. What’s Next In the next release, Verizon Media plans to contribute container support with Docker and RunC container managers. This should be a major boost with three important benefits - customization of system level dependencies for each topology with container images, better isolation of resources from other processes running on the bare metal, and allowing each topology to choose their worker OS and java version across the cluster. Contributors Aaron Gresch, Ethan Li, Govind Menon, Bipin Prasad, Rui Li

Apache Storm 2.2.0 Improvements - NUMA Support, Auto Refreshing SSL Certificates for All Daemons, V2 Tick Backwards Compatibility, Scheduler Improvements, & OutputCollector Thread Safety

July 31, 2020
Announcing RDFP for Zeek - Enabling Client Telemetry to the Remote Desktop Protocol July 23, 2020
July 23, 2020
Share

Announcing RDFP for Zeek - Enabling Client Telemetry to the Remote Desktop Protocol

Jeff Atkinson, Principal Security Engineer, Verizon Media We are pleased to announce RDFP for Zeek. This project is based off of 0x4D31’s work, the FATT Remote Desktop Client fingerprinting. This technique analyzes client payloads during the RDP negotiation to build a profile of client software. RDFP extends RDP protocol parsing and provides security analysts a method of profiling software used on the network. BlueKeep identified some gaps in visibility spurring us to contribute to Zeek’s RDP protocol analyzer to extract additional details. Please share your questions and suggestions by filing an issue on Github. Technical Details RDFP extracts the following key elements and then generates an MD5 hash.  - Client Core Data - Client Cluster Data - Client Security Data - Client Network Data Here is how the RDFP hash is created: md5(verMajor;verMinor;clusterFlags;encryptionMethods;extEncMethods;channelDef) Client Core Data The first data block handled is Client Core Data. The client major and minor versions are extracted. Other information can be found in this datagram but is more specific to the client configuration and not specific to the client software. Client Cluster Data The Client Cluster Data datagram contains the Cluster Flags. These are added in the order they are seen and will provide information about session redirection and other items - ex: if a smart card was used. Client Security Data The Client Security Data datagram provides the encryptionMethods and extEncryptionMethods. The encryptionMethods details the key that is used and message authentication code. The extEncryptionMethods is a specific flag designated for French locale.  Client Network Data The Client Network Data datagram contains the Channel Definition Structure, (Channel_Def). Channel_Def provides configuration information about how the virtual channel with the server should be set up. This datagram provides details on compression, MCS priority, and channel persistence across transactions. Here is the example rdfp.log generated by the rdfp.zeek script. The log provides all of the details along with the client rdfp_hash. This technique works well, but notice that RDP clients can require TLS encryption. Reference the JA3 fingerprinting technique for TLS traffic analysis. Please refer to Adel’s blog post for additional details and examples about ways to leverage the RDP fingerprinting on the network.  Conclusion Zeek RDFP extends network visibility into client software configurations. Analysts apply logic and detection techniques to these extended fields. Analysts and Engineers can also apply anomaly detection and additional algorithms to profile and alert suspicious network patterns. Please share your questions and suggestions by filing an issue on Github. Additional Reading - John B. Althouse, Jeff Atkinson and Josh Atkins, “JA3 — a method for profiling SSL/TLS clients” - Ben Reardon and Adel Karimi, “HASSH — a profiling method for SSH clients and servers” - Microsoft Corporation, “[MS-RDPBCGR]: Remote Desktop Protocol: Basic Connectivity and Graphics Remoting” - Adel Karimi, “Fingerprint All the Things!” - Matt Bromiley and Aaron Soto, “What Happens Before Hello?” - John Althouse, “TLS Fingerprinting with JA3 and JA3S” - Zeek Package Contest 3rd Place Winner Acknowledgments Special thanks to Adel, #JA3, #HASSH, and W for reminding me there’s always more on the wire.

Announcing RDFP for Zeek - Enabling Client Telemetry to the Remote Desktop Protocol

July 23, 2020
Vespa Product Updates, June 2020: Support for Approximate Nearest Neighbor Vector Search, Streaming Search Speedup, Rank Features, & GKE Sample Application July 16, 2020
July 16, 2020
Share

Vespa Product Updates, June 2020: Support for Approximate Nearest Neighbor Vector Search, Streaming Search Speedup, Rank Features, & GKE Sample Application

Kristian Aune, Tech Product Manager, Verizon Media In the previous update, we mentioned Improved Slow Node Tolerance, Multi-Threaded Rank Profile Compilation, Reduced Peak Memory at Startup, Feed Performance Improvements, and Increased Tensor Performance. This month, we’re excited to share the following updates: Support for Approximate Nearest Neighbor Vector Search  Vespa now supports approximate nearest neighbor search which can be combined with filters and text search. By using a native implementation of the HNSW algorithm, Vespa provides state of the art performance on vector search: Typical single digit millisecond response time, searching hundreds of millions of documents per node, but also uniquely allows vector query operators to be combined efficiently with filters and text search - which is usually a requirement for real-world applications such as text search and recommendation. Vectors can be updated in real-time with a sustained write rate of a few thousand vectors per node per second. Read more in the documentation on nearest neighbor search.  Streaming Search Speedup Streaming Search is a feature unique to Vespa. It is optimized for use cases like personal search and e-mail search - but is also useful in high-write applications querying a fraction of the total data set. With #13508, read throughput from storage increased up to 5x due to better parallelism. Rank Features - The (Native)fieldMatch rank features are optimized to use less CPU query time, improving query latency for Text Matching and Ranking.  - The new globalSequence rank feature is an inexpensive global ordering of documents in a system with stable system state. For a system where node indexes change, this is inaccurate. See globalSequence documentation for alternatives. GKE Sample Application Thank you to Thomas Griseau for contributing a new sample application for Vespa on GKE, which is a great way to start using Vespa on Kubernetes. … About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, June 2020: Support for Approximate Nearest Neighbor Vector Search, Streaming Search Speedup, Rank Features, & GKE Sample Application

July 16, 2020
Aggregated Job list view for Pipeline details July 11, 2020
July 11, 2020
Share

Aggregated Job list view for Pipeline details

Inderbir Singh Hair, student at the University of Waterloo We have recently added a new feature: Aggregated Job list view for Pipeline details. This feature adds a way to view the status of each job in a pipeline as a list and thus provides a way to view the overall status of a pipeline. An example of the aggregated job list view: The list view can be seen by clicking the view toggle (highlighted in red)  on the pipeline events tab: The list view consists of 6 columns: Job, History, Duration, Start Time, Coverage, and Actions.  The Job column (highlighted in red) displays the most recent build status for a job along with the job’s name. The History column (highlighted in red), provides a summary of the last 5 build statuses for the job, with the most recent build on the right: Clicking on a status bubble, whether it be one from the history column or the one in the job column, will take you to the related build’s status page. The Duration column (highlighted in red) displays how long it took to run the most recent build for the associated job. The Start Time column (highlighted in red) displays when the most recent build for the associated job was started. The Coverage column (highlighted in red) gives the SonarQube coverage for the associated job. The Actions column (highlighted in red) allows 3 actions to be run for each job: starting a new build for the associated job (left), aborting the most recent build for the associated job (if it has yet to be completed) (center), and restarting the associated job from its latest build (right). The list view does not have real-time data updates, instead, the refresh button (highlighted in red) can be used to update the list view’s data. Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.924 - UI - v1.0.521 - Store - v3.11.1 - Launcher - v6.0.73Contributors Thanks to the following contributors for making this feature possible: - InderH - adong - jithine - tkyi Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out through our various support channels. You can also visit us on Github and Slack.

Aggregated Job list view for Pipeline details

July 11, 2020
Announcing Spicy Noise - Identify and Monitor WireGuard at Wire Speed July 8, 2020
July 8, 2020
Share

Announcing Spicy Noise - Identify and Monitor WireGuard at Wire Speed

Jeff Atkinson, Principal Security Engineer, Verizon Media Today we are excited to announce the release of Spicy Noise. This open source project was developed to address the need to identify and monitor WireGuard traffic at line speed with Zeek. The Spicy framework was chosen to build the protocol parser needed for this project. Please share your questions and suggestions by filing an issue on Github.  WireGuard was implemented on the Noise Protocol Framework to provide simple, fast, and secure cryptographic communication. Its popularity started within the Linux community due to its ability to run on Raspberry Pi and high end servers. The protocol has now been adopted and is being used cross platform. To explain how Spicy Noise works, let’s look at how Zeek and Spicy help monitor traffic. Zeek is a network monitoring project that is robust and highly scalable. It supports multiple protocol analyzers on a standard install and provides invaluable telemetry for threat hunting and investigations. Zeek has been deployed on 100 gigabit networks. Spicy is a framework provided by the Zeek community to build new protocol analyzers. It is replacing Binpac as a much simpler method to build protocol parsers. The framework has built-in integration with Zeek to enable analysis at line speed. How it works Zeek’s Architecture begins by reading packets from the network. The packets are then routed to “Event Engines” which parse the packets and forward events containing details of the packet. These events are presented to the “Policy Script Interpreter” where the details from the event can be acted upon by Zeek scripts. There are many scripts which ship with Zeek to generate logs and raise notifications. Many of these logs and notifications are forwarded to the SIEM of a SOC for analysis. To build the capability to parse WireGuard traffic a new “Event Engine” has been created. This is done with Spicy by defining how a packet is parsed and how events are created. Packet parsing is defined in a .spicy file. Events are defined in a .evt file which will forward the details extracted by the .spicy parser for the “Policy Script Interpreter”. A dynamic protocol detection signature has to be defined so Zeek knows how to route packets to the new Event Engine. Refer to the diagram below to understand the role of the .spicy and .evt files of the new WireGuard parser or “Event Engine”. Technical Implementation The first step to building a new “Event Engine” is to define how the packet is to be parsed. Referring to the WireGuard protocol specification, there are four main UDP datagram structures. The four datagram structures defined are the Handshake Initiation, Handshake Response, Cookie Reply, and Transport Data. The diagram below depicts how the client and server communicate. We will focus on the first, Handshake Response, but the same method is used to apply to the other three packet structures. The following diagram from the WireGuard whitepaper illustrates the structure of the Handshake Initiation packet. The sections of the packet are defined with their respective sizes. These details are used in the .spicy file to define how Spicy will handle the packet. Note that the first field is the packet type and a value of 1 defines it as a Handshake Initiation structured packet. Below is a code snippet of wg-1.spicy from the repository. A type is created to define the fields and their size or delimiters. Spicy uses wg-1.spicy as the first part of the “Event Engine” to parse packets. The next part needed is to define events in the .evt file. An event is created for each packet type to pass values from the “Event Engine” to the “Policy Script Interpreter”.  The .evt file also includes an “Analyzer Setup” which defines the Analyzer_Name, Transport_Portocol and additional details if needed. The Analyzer_Name is used by dynamic protocol detection (DPD). Zeek reads packets and compares them against DPD signatures to identify which Analyzer or “Event Engine” to use. The Wireguard DPD signature looks for the first byte of a UDP datagram to be 1 followed by the reserved zeros as defined in the protocol specification. Below is the DPD signature created for matching on the WireGuard Handshake_Initiation packet which is the first in the session. Now as Spicy or Zeek parse packets, anytime a packet is parsed by the Handshake_Initiation type it will generate an event. The event will include connection details stored in the $conn variable which is passed from the stream processor portion of the “Event Engine.” The additional fields are extracted from the packet as defined in the corresponding .spicy file type. These events are received by the “Policy Script Interpreter” and can be acted upon to create logs or raise notifications. Zeek scripts define which events to receive and what action is to be taken. The example below shows how the WireGuard::Initiation event can be used to set the service field in Zeek’s conn.log. The conn.log file will now have events with a service of WireGuard. Conclusion Wireguard provides an encrypted tunnel which can be used to circumvent security controls. Zeek and Spicy provide a solution to enhance network telemetry allowing better understanding of the traffic. Standard network analysis can be applied with an understanding that WireGuard is in use and encrypting the traffic.

Announcing Spicy Noise - Identify and Monitor WireGuard at Wire Speed

July 8, 2020
Bindable: Open Source Themeable Design System Built in Aurelia JS for Faster and Easier Web Development July 7, 2020
July 7, 2020
Share

Bindable: Open Source Themeable Design System Built in Aurelia JS for Faster and Easier Web Development

Joe Ipson, Software Dev Engineer, Verizon Media  Luke Larsen, Sr Software Dev Engineer, Verizon Media As part of the Media Platform Video Team we build and maintain a set of web applications that allow customers to manage their video content. We needed a way to be consistent with how we build these applications. Creating consistent layouts and interfaces can be a challenge. There are many areas that can cause bloat or duplication of code. Some examples of this are, coding multiple ways to build the same layout in the app, slight variations of the same red color scattered all over, multiple functions being used to capitalize data returned from the database. To avoid cases like this we built Bindable. Bindable is an open source design system that makes it possible to achieve consistency in colors, fonts, spacing, sizing, user actions, user permissions, and content conversion. We’ve found it helps us be consistent in how we build layouts, components, and share code across applications. By making Bindable open source we hope it will do the same for others. Theming One problem with using a design system or library is that you are often forced to use the visual style that comes with it. With Bindable you can customize how it looks to fit your visual style. This is accomplished through CSS custom properties. You can create your own custom theme by setting these variables and you will end up with your own visual style. Modular Scale Harmony in an application can be achieved by setting all the sizing and spacing to a value on a scale. Bindable has a modular scale built in. You can set the scale to whatever you wish and it will adjust. This means your application will have visual harmony. When you need, you can break out of the modular scale for custom sizing and spacing. Aurelia Aurelia is a simple, powerful, and unobtrusive javascript framework. Using Aurelia allows us to take advantage of its high performance and extensibility when creating components. Many parts of Bindable have added features thanks to Aurelia. Tokens Tokens are small building blocks that all other parts of Bindable use. They are CSS custom properties and set things like colors, fonts, and transitions. Layouts The issue of creating the same layout using multiple methods is solved by Layouts in Bindable. Some of the Layouts in Bindable make it easy to set a grid, sidebar, or cluster of items in a row. Layouts also handle all the spacing between components. This keeps all your spacing orderly and consistent.  Components Sharing these components was one of the principal reasons the library exists. There are over 40 components available, and they are highly customizable depending on your needs. Access Modifiers Bindable allows developers to easily change the state of a component on a variety of conditions. Components can be hidden or disabled if a user lacks permission for a particular section of a page. Or maybe you just need to add a loading indicator to a button. These attributes make it easy to do either (or both!). Value Converters We’ve included a set of value converters that will take care of some of the most basic conversions for you. Things like sanitizing HTML, converting CSV data into an array, escaping a regex string, and even more simple things like capitalizing a string or formatting an ISO Date string. Use, Contribute, & Reach Out Explore the Bindable website for helpful details about getting started and to see detailed info about a given component. We are excited to share Bindable with the open source community. We look forward to seeing what others build with Bindable, especially Aurelia developers. We welcome pull requests and feedback! Watch the project on GitHub for updates. Thanks! Acknowledgements Cam Debuck, Ajit Gauli, Harley Jessop, Richard Austin, Brandon Drake, Dustin Davis

Bindable: Open Source Themeable Design System Built in Aurelia JS for Faster and Easier Web Development

July 7, 2020
Change Announcement - JSON Web Key (JWK) for Public Elliptic-curve (EC) Key July 3, 2020
July 3, 2020
Share

Change Announcement - JSON Web Key (JWK) for Public Elliptic-curve (EC) Key

Ashish Maheshwari, Software Engineer, Verizon Media In this post, we will outline a change in the way we expose the JSON Web Key (JWK) for our public Elliptic-curve (EC) key at this endpoint: https://api.login.yahoo.com/openid/v1/certs, as well as, immediate steps users should take. Impacted users are any clients who parse our JWK to extract the EC public key to perform actions such as verify a signed token. The X and Y coordinates of our EC public key were padded with a sign bit which caused it to overflow from 32 to 33 bytes. While most of the commonly used libraries to parse a JWK to public key can handle the extra length, others might expect a length strictly equal to 32 bytes. This change can be a breaking change for those. Here are the steps affected users should take: - Any code/flow which needs to extract our EC public key from the JWK needs to be tested for this change. Below is our pre and post change production JWK for EC public key. Please verify that your code can successfully parse the new JWK. Notice the change in base64url value of the Y coordinate in the new JWK. We are planning to make this change live on July 20th, 2020. If you have any questions/comments, please tweet @YDN or email us. Current production EC JWK: {“keys”:[{“kty”:“EC”,“alg”:“ES256”,“use”:“sig”,“crv”:“P-256”,“kid”:“3466d51f7dd0c780565688c183921816c45889ad”,“x”:“cWZxqH95zGdr8P4XvPd_jgoP5XROlipzYxfC_vWC61I”,“y”:“AK8V_Tgg_ayGoXiseiwLOClkekc9fi49aYUQpnY1Ay_y”}]} EC JWK after change is live: {“keys”:[{“kty”:“EC”,“alg”:“ES256",“use”:“sig”,“crv”:“P-256",“kid”:“3466d51f7dd0c780565688c183921816c45889ad”,“x”:“cWZxqH95zGdr8P4XvPd_jgoP5XROlipzYxfC_vWC61I”,“y”:“rxX9OCD9rIaheKx6LAs4KWR6Rz1-Lj1phRCmdjUDL_I”}]}

Change Announcement - JSON Web Key (JWK) for Public Elliptic-curve (EC) Key

July 3, 2020
Introducing vSSH - Go Library to Execute Commands Over SSH at Scale June 30, 2020
June 30, 2020
Share

Introducing vSSH - Go Library to Execute Commands Over SSH at Scale

Mehrdad Arshad Rad, Sr. Principal Software Engineer, Verizon Media vSSH is a high performance Go library designed to execute shell commands remotely on tens of thousands of network devices or servers over SSH protocol. The vSSH high-level API provides additional functionality for developing network or server automation. It supports persistent SSH connection to execute shell commands with a warm connection and returns data back quickly. If you manage multiple Linux machines or devices you know how difficult it is to run commands on multiple machines every day, and appreciate the significant value of automation. There are other open source SSH libraries available in a variety of languages but vSSH has great features like persistent SSH connection, the ability to limit sessions, to limit the amount of data transferred, and it handles many SSH connections concurrently while using resources efficiently. Go developers can quickly create the network device, server automation, or tools, by using this library and focusing on the business logic instead of handling SSH connections. vSSH can run on your application asynchronous and then you can call the APIs/methods through your application (safe concurrency). To start, load your clients information and add them to vSSH using a simple method. You can add labels and other optional attributes to each client. By calling the run method, vSSH sends the given command to all available clients or based on your query, it runs the command on the specific clients and the results of the command can be received in streaming (real-time) or the final result. One of the main features of vSSH is a persistent connection to all devices and the ability to manage them. It can connect to all the configured devices/servers, all the time. The connections are simple authenticated connections without session at the first stage. When vSSH needs to run a command, it tries to create a session and it closes the session when it’s completed. If you don’t need the persistence feature then you can disable it, which results in the connection closing at the end. The main advantage of persistence is that it works as a warm connection and once the run command is requested, it just needs to create a session. The main use case is when you need to run commands on the clients continuously or the response time is important. In both cases, vSSH multiplexes sessions at one connection. vSSH provides a DSL query feature based on the provided labels that you can use to select / filter clients. It supports operators like == != or you can also create your own logic. I wrote this feature with the Go abstract syntax tree (AST). This feature is very useful as you can add many clients to the library at the same time and run different commands based on the labels. Here are three features that you can use to control the load on the client and force to terminate the running command: - By limiting the returned data which comes from stdout or stderr in bytes - Terminate the command by defined timeout - Limit the concurrent sessions on the client Use & Contribute To learn more about vSSH, explore github.com/yahoo/vssh and try the vSSH examples at https://pkg.go.dev/github.com/yahoo/vssh.

Introducing vSSH - Go Library to Execute Commands Over SSH at Scale

June 30, 2020
Data Disposal - Open Source Java-based Big Data Retention Tool June 15, 2020
June 15, 2020
Share

Data Disposal - Open Source Java-based Big Data Retention Tool

By Sam Groth, Senior Software Engineer, Verizon Media Do you have data in Apache Hadoop using Apache HDFS that is made available with Apache Hive? Do you spend too much time manually cleaning old data or maintaining multiple scripts? In this post, we will share why we created and open sourced the Data Disposal tool, as well as, how you can use it. Data retention is the process of keeping useful data and deleting data that may no longer be proper to store. Why delete data? It could be too old, consume too much space, or be subject to legal retention requirements to purge data within a certain time period of acquisition. Retention tools generally handle deleting data entities (such as files, partitions, etc.) based on: duration, granularity, or date format. 1. Duration: The length of time before the current date. For example, 1 week, 1 month, etc. 2. Granularity: The frequency that the entity is generated. Some entities like a dataset may generate new content every hour and store this in a directory partitioned by date. 3. Date Format: Data is generally partitioned by a date so the format of the date needs to be used in order to find all relevant entities. Introducing Data Disposal We found many of the existing tools we looked at lacked critical features we needed, such as configurable date format for parsing from the directory path or partition of the data and extensible code base for meeting the current, as well as, future requirements. Each tool was also built for retention with a specific system like Apache Hive or Apache HDFS instead of providing a generic tool. This inspired us to create Data Disposal. The Data Disposal tool currently supports the two main use cases discussed below but the interface is extensible to any other data stores in your use case. 1. File retention on the Apache HDFS. 2. Partition retention on Apache Hive tables. Disposal Process The basic process for disposal is 3 steps: - Read the provided yaml config files. - Run Apache Hive Disposal for all Hive config entries. - Run Apache HDFS Disposal for all HDFS config entries. The order of the disposals is significant in that if Apache HDFS disposal ran first, it would be possible for queries to Apache Hive to have missing data partitions. Key Features The interface and functionality is coded in Java using Apache HDFS Java API and Apache Hive HCatClient API. 1. Yaml config provides a clean interface to create and maintain your retention process. 2. Flexible date formatting using Java’s SimpleDateFormat when the date is stored in an Apache HDFS file path or in an Apache Hive partition key. 3. Flexible granularity using Java’s ChronoUnit. 4. Ability to schedule with your preferred scheduler. The current use cases all use Screwdriver, which is an open source build platform designed for continuous delivery, but using other schedulers like cron, Apache Oozie, Apache Airflow, or a different scheduler would be fine. Future Enhancements We look forward to making the following enhancements: 1. Retention for other data stores based on your requirements. 2. Support for file retention when configuring Apache Hive retention on external tables. 3. Any other requirements you may have. Contributions are welcome! The Data team located in Champaign, Illinois, is always excited to accept external contributions. Please file an issue to discuss your requirements.

Data Disposal - Open Source Java-based Big Data Retention Tool

June 15, 2020
Local build June 3, 2020
June 3, 2020
Share

Local build

Screwdriver.cd offers powerful features such as templates, commands,  secrets, and metadata, which can be used to simplify build settings or as build parameters. However, it’s difficult to reproduce equivalent features for local builds. Although you can use these features by uploading your changes to an SCM such as GitHub, you may feel like it is a pain to upload your changes over and over in order to get a successful build. With sd-local, you can easily make sure the build is not corrupted before uploading changes to SCM and debug the build locally if it fails. Note: Because sd-local works with Screwdriver.cd, it does not work by itself. If you don’t have a Screwdriver.cd cluster, you need to set up it first. See the documentation at https://docs.screwdriver.cd/cluster-management/. How to Install sd-local uses Docker internally, so make sure you have Docker Engine installed locally. https://www.docker.com/ The next step is to install sd-local. Download the latest version of sd-local from the GitHub release page below and grant execute permission to it. https://github.com/screwdriver-cd/sd-local/releases $ mv sd-local_*_amd64 /usr/local/bin/sd-local$ chmod +x /usr/local/bin/sd-local Build configuration Configure to use the templates and commands registered in your Screwdriver.cd cluster from sd-local. sd-local communicates with the following SD components: - API Validating screwdriver.yaml Getting a template - Validating screwdriver.yaml - Getting a template - Store Getting a command - Getting a command $ sd-local config set api-url https:// # e.g. https://api.screwdriver.cd $ sd-local config set store-url https:// # e.g. https://store.screwdriver.cd Set the API token for the above component to authenticate. Please refer the guide how to get an API token. $ sd-local config set token Execute build Please create the following screwdriver.yaml in the current directory. jobs: main: image: node:12 steps: - hello: echo -n "Hello, world!" Run the build with the following command specifying the job name. Note: Builds can only be run at a job level. $ sd-local build main INFO   [0000] Prepare to start build... INFO   [0017] Pulling docker image from node:12... sd-setup-launcher: Screwdriver Launcher information sd-setup-launcher: Version:        v6.0.70 sd-setup-launcher: Pipeline:       #0 sd-setup-launcher: Job:            main sd-setup-launcher: Build:          #0 sd-setup-launcher: Workspace Dir:  /sd/workspace sd-setup-launcher: Checkout Dir:     /sd/workspace/src/screwdriver.cd/sd-local/local-build sd-setup-launcher: Source Dir:     /sd/workspace/src/screwdriver.cd/sd-local/local-build sd-setup-launcher: Artifacts Dir:  /sd/workspace/artifacts sd-setup-launcher: set -e && export PATH=$PATH:/opt/sd && finish() { EXITCODE=$?; tmpfile=/tmp/env_tmp; exportfile=/tmp/env_export; export -p | grep -vi "PS1=" > $tmpfile && mv $tmpfile $exportfile; echo $SD_STEP_ID $EXITCODE; } && trap finish EXIT; sd-setup-launcher: echo ; hello: $ echo -n Hello, world! hello: hello: Hello, world! See the User Guide for more details about the commands. Design Document For more details, check out our design spec. Compatibility List In order to use this feature, you will need these minimum versions: - sd-local - v1.0.1 - launcher - v6.0.70Contributors Thanks to the following contributors for making this feature possible: - sugarnaoming - s-yoshika - kkisic - MysticDoll - yuichi10 - kkokufud - wahapo - tk3fftk - sakka2 - cappyzawa - kumada626Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Local build

June 3, 2020
Vespa Product Updates, May 2020: Improved Slow Node Tolerance, 
Multi-Threaded Rank Profile Compilation, Reduced Peak Memory at Startup, Feed Performance Improvements, & Increased Tensor Performance May 29, 2020
May 29, 2020
Share

Vespa Product Updates, May 2020: Improved Slow Node Tolerance, Multi-Threaded Rank Profile Compilation, Reduced Peak Memory at Startup, Feed Performance Improvements, & Increased Tensor Performance

Kristian Aune, Tech Product Manager, Verizon Media In the April updates, we mentioned Improved Performance for Large Fan-out Applications, Improved Node Auto-fail Handling, CloudWatch Metric Import and CentOS 7 Dev Environment. This month, we’re excited to share the following updates: Improved Slow Node Tolerance To improve query scaling, applications can group content nodes to balance static and dynamic query cost. The largest Vespa applications use a few hundred nodes. This is a great feature to optimize cost vs performance in high-query applications. Since Vespa-7.225.71, the adaptive dispatch policy is made default. This balances load to the node groups based on latency rather than just round robin - a slower node will get less load and overall latency is lower. Multi-Threaded Rank Profile Compilation Queries are using a rank profile to score documents. Rank profiles can be huge, like machine learned models. The models are compiled and validated when deployed to Vespa. Since Vespa-7.225.71, the compilation is multi-threaded, cutting compile time to 10% for large models. This makes content node startup quicker, which is important for rolling upgrades. Reduced Peak Memory at Startup Attributes is a unique Vespa feature used for high feed performance for low-latency applications. It enables writing directly to memory for immediate serving. At restart, these structures are reloaded. Since Vespa-7.225.71, the largest attribute is loaded first, to minimize temporary memory usage. As memory is sized for peak usage, this cuts content node size requirements for applications with large variations in attribute size. Applications should keep memory at less than 80% of AWS EC2 instance size. Feed Performance Improvements At times, batches of documents are deleted. This subsequently triggers compaction. Since Vespa-7.227.2, compaction is blocked at high removal rates, reducing overall load. Compaction resumes once the remove rate is low again.  Increased Tensor Performance  Tensor is a field type used in advanced ranking expressions, with heavy CPU usage. Simple tensor joins are now optimized and more optimizations will follow in June. … About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, May 2020: Improved Slow Node Tolerance, Multi-Threaded Rank Profile Compilation, Reduced Peak Memory at Startup, Feed Performance Improvements, & Increased Tensor Performance

May 29, 2020
Kata Containers in Screwdriver May 28, 2020
May 28, 2020
Share

Kata Containers in Screwdriver

Screwdriver is a scalable CI/CD solution which uses Kubernetes to manage user builds. Screwdriver build workers interfaces with Kubernetes using either “executor-k8s” or “executor-k8s-vm” depending on required build isolation. executor-k8s runs builds directly as Kubernetes pods while executor-k8s-vm  uses HyperContainers along with Kubernetes for stricter build isolation with containerized Virtual Machines (VMs). This setup was ideal for running builds in an isolated, ephemeral, and lightweight environment. However, HyperContainer is now deprecated, has no support, is based on an older Docker runtime and it also required non-native Kubernetes setup for build execution. Therefore, it was time to find a new solution. Why Kata Containers ? Kata Containers is an open source project and community that builds a standard implementation of lightweight virtual machines (VMs) that perform like containers, but provide the workload isolation and security advantages of VMs. It combines the benefits of using a hypervisor, such as enhanced security, along with container orchestration capabilities provided by Kubernetes. It is the same team behind HyperD where they successfully merged the best parts of Intel Clear Containers with Hyper.sh RunV. As a Kubernetes runtime, Kata enables us to deprecate executor-k8s-vm and use executor-k8s exclusively for all Kubernetes based builds. Screwdriver journey to Kata As we faced a growing number of instabilities with the current HyperD - like network and devicemapper issues and IP cleanup workarounds, we started our initial evaluation of Kata in early 2019 (   https://github.com/screwdriver-cd/screwdriver/issues/818#issuecomment-482239236) and identified two major blockers to move ahead with Kata: 1. Security concern for privileged mode (required to run docker daemon in kata) 2. Disk performance. We recently started reevaluating Kata in early 2020 based on a fix to “add flag to overload default privileged host device behaviour” provided by Containerd/cri (https://github.com/containerd/cri/pull/1225), but still we faced issues with disk performance and switched from overlayfs to devicemapper, which yielded significant improvement. With our two major blockers resolved and initial tests with Kata looking promising, we moved ahead with Kata.Screwdriver Build Architecture Replacing Hyper with Kata led to a simpler build architecture. We were able to remove the custom build setup scripts to launch Hyper VM and rely on native Kubernetes setup. Setup To use Kata containers for running user builds in a Screwdriver Kubernetes build cluster, a cluster admin needs to configure Kubernetes to use Containerd container runtime with Cri-plugin. Components: Screwdriver build Kubernetes cluster (minimum version: 1.14+) nodes must have the following components set up for using Kata containers for user builds.  Containerd: Containerd is a container runtime that helps with management of the complete lifecycle of the container. Reference: https://containerd.io/docs/getting-started/ CRI-Containerd plugin: Cri-Containerd is a containerd plugin which implements Kubernetes container runtime interface. CRI plugin interacts with containerd to manage the containers. Reference: https://github.com/containerd/cri Image credit: containerd / cri. Photo licensed under CC-BY-4.0. Architecture: Image credit: containerd / cri. Photo licensed under CC-BY-4.0. Installation: Reference: https://github.com/containerd/cri/blob/master/docs/installation.md, https://github.com/containerd/containerd/blob/master/docs/ops.md Crictl: To debug, inspect, and manage their pods, containers, and container images. Reference: https://github.com/containerd/cri/blob/master/docs/crictl.md Kata: Builds lightweight virtual machines that seamlessly plugin to the containers ecosystem. Architecture: Image credit: kata-containers Project licensed under Apache License Version 2.0 Installation: - https://github.com/kata-containers/documentation/blob/master/Developer-Guide.md#run-kata-containers-with-kubernetes - https://github.com/kata-containers/documentation/blob/master/how-to/containerd-kata.md - https://github.com/kata-containers/documentation/blob/master/how-to/how-to-use-k8s-with-cri-containerd-and-kata.md - https://github.com/kata-containers/documentation/blob/master/how-to/containerd-kata.md#kubernetes-runtimeclass - https://github.com/kata-containers/documentation/blob/master/how-to/containerd-kata.md#configuration Routing builds to Kata in Screwdriver build cluster Screwdriver uses Runtime Class to route builds to Kata nodes in Screwdriver build clusters. The Screwdriver plugin executor-k8s config handles this based on: 1. Pod configuration: apiVersion: v1 kind: Pod metadata:   name: kata-pod   namespace: sd-build-namespace   labels:     sdbuild: “sd-kata-build”     app: screwdriver     tier: builds spec:   runtimeClassName: kata   containers:   - name: “sd-build-container”     image: <>     imagePullPolicy: IfNotPresent 2. Update the plugin to use k8s in your buildcluster-queue-worker configuration https://github.com/screwdriver-cd/buildcluster-queue-worker/blob/master/config/custom-environment-variables.yaml#L4-L83 Performance The below tables compare build setup and overall execution time for Kata and Hyper when the image is pre-cached or not cached. Known problems While the new Kata implementation offers a lot of advantages, there are some known problems we are aware of with fixes or workarounds: - Run images based on Rhel6 containers don’t start and immediately exit - Pre-2.15 glibc: Enabled kernel_params = “vsyscall=emulate” refer kata issue https://github.com/kata-containers/runtime/issues/1916 if trouble running pre-2.15 glibc. - Yum install will hang forever: Enabled kernel_params = “init=/usr/bin/kata-agent” refer kata issue https://github.com/kata-containers/runtime/issues/1916 to get a better boot time, small footprint. - 32-bit executable cannot be loaded refer kata issue  https://github.com/kata-containers/runtime/issues/886: To workaround/mitigate we maintain a container exclusion list and route to current hyperd setup and we have plans to eol these containers by Q4 of this year. - Containerd IO snapshotter - Overlayfs vs devicemapper for storage driver: Devicemapper gives better performance. Overlayfs took 19.325605 seconds to write 1GB, but Devicemapper only took 5.860671 seconds. Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.902 - UI - v1.0.515 - Build Cluster Queue Worker - v1.18.0 - Launcher - v6.0.71 Contributors Thanks to the following contributors for making this feature possible: - Lakshminarasimhan Parthasarathy - Suresh Visvanathan - Nandhakumar Venkatachalam - Pritam Paul - Chester Yuan - Min Zhang Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Kata Containers in Screwdriver

May 28, 2020
Vespa Product Updates, April 2020: Improved Performance for Large Fan-out Applications, Improved Node Auto-fail Handling, CloudWatch Metric Import, & CentOS 7 Dev Environment May 5, 2020
May 5, 2020
Share

Vespa Product Updates, April 2020: Improved Performance for Large Fan-out Applications, Improved Node Auto-fail Handling, CloudWatch Metric Import, & CentOS 7 Dev Environment

Kristian Aune, Tech Product Manager, Verizon Media In the previous update, we mentioned Ranking with LightGBM Models, Matrix Multiplication Performance, Benchmarking Guide, Query Builder and Hadoop Integration. This month, we’re excited to share the following updates: Improved Performance for Large Fan-out Applications Vespa container nodes execute queries by fanning out to a set of content nodes evaluating parts of the data in parallel. When fan-out or partial results from each node is large, this can cause bandwidth to run out. Vespa now provides an optimization which lets you control the tradeoff between the size of the partial results vs. the probability of getting a 100% global result. As this works out, tolerating a small probability of less than 100% correctness gives a large reduction in network usage. Read more. Improved Node Auto-fail Handling Whenever content nodes fail, data is auto-migrated to other nodes. This consumes resources on both sender and receiver nodes, competing with resources used for processing client operations. Starting with Vespa-7.197, we have improved operation and thread scheduling, which reduces the impact on client document API operation latencies when a node is under heavy migration load. CloudWatch Metric Import Vespa metrics can now be pushed or pulled into AWS CloudWatch. Read more in monitoring.  CentOS 7 Dev Environment A development environment for Vespa on CentOS 7 is now available. This ensures that the turnaround time between code changes and running unit tests and system tests is short, and makes it easier to contribute to Vespa. About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, April 2020: Improved Performance for Large Fan-out Applications, Improved Node Auto-fail Handling, CloudWatch Metric Import, & CentOS 7 Dev Environment

May 5, 2020
Yahoo Knowledge Graph Announces COVID-19 Dataset, API, and Dashboard with Source Attribution April 27, 2020
April 27, 2020
Share

Yahoo Knowledge Graph Announces COVID-19 Dataset, API, and Dashboard with Source Attribution

Amit Nagpal, Sr. Director, Software Development Engineering, Verizon Media Among many interesting teams at Verizon Media is the Yahoo Knowledge (YK) team. We build the Yahoo Knowledge Graph; one of the few web scale knowledge graphs in the world. Our graph contains billions of facts and entities that enrich user experiences and power AI across Verizon Media properties. At the onset of the COVID-19 pandemic we felt the need and responsibility to put our web scale extraction technologies to work, to see how we can help. We have started to extract COVID-19 statistics from hundreds of sources around the globe into what we call the YK-COVID-19 dataset. The YK-COVID-19 dataset provides data and knowledge that help inform our readers on Yahoo News, Yahoo Finance, Yahoo Weather, and Yahoo Search. We created this dataset by carefully combining and normalizing raw data provided entirely by government and public health authorities. We provide website level provenance for every single statistic in our dataset, so our community has the confidence it needs to use it scientifically and report with transparency. After weeks of hard work, we are ready to make this data public in an easily consumable format at the YK-COVID-19-Data GitHub repo. A dataset alone does not always tell the full story. We reached out to teams across Verizon Media to get their help in building a set of tools that can help us, and you, build dashboards and analyze the data. Engineers from the Verizon Media Data team in Champaign, Illinois volunteered to build an API and dashboard. The API was constructed using a previously published Verizon Media open source platform called Elide. The dashboard was constructed using Ember.js, Leaflet and the Denali design system. We still needed a map tile server and were able to use the Verizon Location Technology team’s map tile service powered by HERE. We leveraged Screwdriver.cd, our open source CI/CD platform to build our code assets, and our open source Athenz.io platform to secure our applications running in our Kubernetes environment. We did this using our open source K8s-athenz-identity control plane project. You can see the result of this incredible team effort today at https://yahoo.github.io/covid-19-dashboard. Build With Us You can build applications that take advantage of the YK-COVID-19 dataset and API yourself. The YK-COVID-19 dataset is made available under a Creative Commons CC-BY-NC 4.0 license. Anyone seeking to use the YK-COVID-19 dataset for other purposes is encouraged to submit a request. Feature Roadmap Updated multiple times a day, the YK-COVID-19 dataset provides reports of country, state, and county-level data based on the availability of data from our many sources. We plan to offer more coverage, granularity, and metadata in the coming weeks. Why a Knowledge Graph? A knowledge graph is information about real world entities, such as people, places, organizations, and events, along with their relations, organized as a graph. We at Yahoo Knowledge have the capability to crawl, extract, combine, and organize information from thousands of sources. We create refined information used by our brands and our readers on Yahoo Finance, Yahoo News, Yahoo Search and others sites too.  We built our web scale knowledge graph by extracting information from web pages around the globe. We apply information retrieval techniques, natural language processing, and computer vision to extract facts from a variety of formats such as html, tables, pdf, images and videos. These facts are then reconciled and integrated into our core knowledge graph that gets richer every day. We applied some of these techniques and processes relevant in the COVID-19 context to help gather information from hundreds of public and government authoritative websites. We then blend and normalize this information into a single combined COVID-19 specific dataset with some human oversight for stability and accuracy. In the process, we preserve provenance information, so our users know where each statistic comes from and have the confidence to use it for scientific and reporting purposes with attribution. We then pull basic metadata such as latitude, longitude, and population for each location from our core knowledge graph. We also include a Wikipedia id for each location, so it is easy for our community to attach additional metadata, as needed, from public knowledge bases such as Wikimedia or Wikipedia. We’re in this together. So we are publishing our data along with a set of tools that we’re contributing to the open source community. We offer these tools, data, and an invitation to work together on getting past the raw numbers. Yahoo, Verizon Media, and Verizon Location Technology are all part of the family at Verizon.

Yahoo Knowledge Graph Announces COVID-19 Dataset, API, and Dashboard with Source Attribution

April 27, 2020
Dash Open 21: Athenz - Open Source Platform for X.509 Certificate-based Service AuthN & AuthZ April 15, 2020
April 15, 2020
Share

Dash Open 21: Athenz - Open Source Platform for X.509 Certificate-based Service AuthN & AuthZ

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Gil Yehuda (Sr. Director, Open Source) interviews Mujib Wahab (Sr. Director, Software Dev Engineering) and Henry Avetisyan (Distinguished Software Dev Engineer). Mujib and Henry discuss why Verizon Media open sourced Athenz, a platform for X.509 Certificate-based Service Authentication and Authorization. They also share how others can use and contribute to Athenz. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 21: Athenz - Open Source Platform for X.509 Certificate-based Service AuthN & AuthZ

April 15, 2020
Introducing Queue Service April 2, 2020
April 2, 2020
Share

Introducing Queue Service

Pritam Paul, Software Engineer, Verizon Media We have recently made changes to the underlying Screwdriver Architecture for build processing. Previously, the executor-queue was tightly-coupled to the SD API and worked by constantly polling for messages at specific intervals. Due to this design, the queue would block API requests. Furthermore, if the API crashed, scheduled jobs might not be added to the queue, causing cascading failures. Hence, keeping the principles of isolation-of-concerns and abstraction in mind, we designed a more resilient REST-API-based queueing system: the Queue Service. This new service reads, writes and deletes messages from the queue after processing. It also encompasses the former capability of the queue-worker and acts as a scheduler.Authentication The SD API and Queue Service communicate bidirectionally using signed JWT tokens sent via auth headers of each request.Build SequenceDesign Document For more details, check out our design spec.Using Queue Service As a cluster admin, to configure using the queue as an executor, you can deploy the queue-service as a REST API using a screwdriver.yaml and update configuration in SD API to point to the new service endpoint: # config/default.yaml ecosystem:     # Externally routable URL for the User Interface     ui: https://cd.screwdriver.cd     # Externally routable URL for the Artifact Store     store: https://store.screwdriver.cd     # Badge service (needs to add a status and color)     badges: https://img.shields.io/badge/build–.svg     # Internally routable FQDNS of the queue service     queue: http://sdqueuesvc.screwdriver.svc.cluster.local executor:     plugin: queue     queue: “ For more configuration options, see the queue-service documentation.Compatibility List In order to use the new workflow features, you will need these minimum versions: - UI - v1.0.502 - API - v0.5.887 - Launcher - v6.0.56 - Queue-Service - v1.0.11Contributors Thanks to the following contributors for making this feature possible: - adong - klu909 - jithine - parthasl - pritamstyz4ever - tkyi Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Introducing Queue Service

April 2, 2020
Dash Open 20: The Benefits of Presenting at Meetups March 29, 2020
March 29, 2020
Share

Dash Open 20: The Benefits of Presenting at Meetups

By Rosalie Bartlett, Open Source Community, Verizon Media In this episode, Ashley Wolf, Open Source Program Manager, interviews Eran Shapira, Software Development Engineering Manager, Verizon Media. Based in Tel Aviv, Israel, Eran manages the video activation team. Eran shares about his team’s focus, which technology he’s most excited about right now, the value of presenting at meetups, and his advice for being a great team member.  Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify. P.S. Learn more about job opportunities (backend engineer, product manager, research scientist, and many others!) at our Tel Aviv and Haifa offices here.

Dash Open 20: The Benefits of Presenting at Meetups

March 29, 2020
Search COVID-19 Open Research Dataset (CORD-19) using Vespa - Open Source Big Data Serving Engine March 28, 2020
March 28, 2020
Share

Search COVID-19 Open Research Dataset (CORD-19) using Vespa - Open Source Big Data Serving Engine

Kristian Aune, Tech Product Manager, Verizon Media After being made aware of the COVID-19 Open Research Dataset Challenge (CORD-19), where AI experts have been asked to create text and data mining tools that can help the medical community, the Vespa team wanted to contribute.  Given our experience with big data at Yahoo (now Verizon Media) and creating Vespa (open source big data serving engine), we thought the best way to help was to index the dataset, which includes over 44,000 scholarly articles, and to make it available for searching via Vespa Cloud. Now live at https://cord19.vespa.ai, you can get started with a few of the sample queries or for more advanced queries, visit CORD-19 API Query. Feel free to tweet us @vespaengine or submit an issue, if you have any questions or suggestions. Please expect daily updates to the documentation and query features. Contributions are appreciated - please refer to our contributing guide and submit PRs. You can also download the application, index the data set, and improve the service. More info here on how to run Vespa.ai on your own computer.

Search COVID-19 Open Research Dataset (CORD-19) using Vespa - Open Source Big Data Serving Engine

March 28, 2020
Dash Open 19: KDD - Understanding Consumer Journey using Attention-based Recurrent Neural Networks March 23, 2020
March 23, 2020
Share

Dash Open 19: KDD - Understanding Consumer Journey using Attention-based Recurrent Neural Networks

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Shaunak Mishra, Sr. Research Scientist, Verizon Media. Shaunak discusses two papers he presented at Knowledge Discovery and Data Mining (KDD) - “Understanding Consumer Journey using Attention-based Recurrent Neural Networks” and “Learning from Multi-User Activity Trails for B2B Ad Targeting”.  Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 19: KDD - Understanding Consumer Journey using Attention-based Recurrent Neural Networks

March 23, 2020
Introducing Accessible Audio Charts - An Open Source Initiative for Android Apps March 16, 2020
March 16, 2020
Share

Introducing Accessible Audio Charts - An Open Source Initiative for Android Apps

Sukriti Chadha, Senior Product Manager, Verizon Media Finance charts quickly render hundreds of data points making it seamless to analyze a stock’s performance. Charts are great for people who can see well. Those who are visually impaired often use screen readers. For them, the readers announce the data points in a table format. Beyond a few data points, it becomes difficult for users to create a mental image of the chart’s trend. The audio charts project started with the goal of making Yahoo Finance charts accessible to users with visual impairment. With audio charts, data points are converted to tones with haptic feedback and are easily available through mobile devices where users can switch between tones and spoken feedback. The idea for the accessible charts solution was first discussed during a conversation between Sukriti Chadha, from the Yahoo Finance team, and Jean-Baptiste Queru, a mobile architect. After building an initial prototype, they worked with Mike Shebanek, Darren Burton and Gary Moulton from the Accessibility team to run user studies and make improvements based on feedback. The most important lesson learned through research and development was that users want a nuanced, customizable solution that works for them in their unique context, for the given product. Accessible charts were launched on the production versions of the Yahoo Finance Android and iOS apps in 2019 and have since seen positive reception from screen reader users. The open source effort was led by Yatin Kaushal and Joao Birk on engineering, Kisiah Timmons on the Verizon Media accessibility team, and Sukriti Chadha on product. We would love for other mobile app developers to have this solution, adapt to their users’ needs and build products that go from accessible to truly usable. We also envision applications of this approach in voice interfaces and contextual vision limitation scenarios. Open sourcing this version of the solution marks an important first step in this initiative. To integrate the SDK, simply clone or fork the repository. The UI components and audio conversion modules can be used separately and modified for individual use cases. Please refer to detailed instructions on integration in the README. This library is the Android version of the solution, which can be replicated on iOS with similar logic. While this implementation is intended to serve as reference for other apps, we will review requests and comments on the repository. We are so excited to make this available to the larger developer community and can’t wait to see how other applications take the idea forward! Please reach out to finance-android-dev@verizonmedia.com for questions and requests.

Introducing Accessible Audio Charts - An Open Source Initiative for Android Apps

March 16, 2020
Vespa Product Updates, February 2020: Ranking with LightGBM Models, Matrix Multiplication Performance, Benchmarking Guide, Query Builder, and Hadoop Integration March 3, 2020
March 3, 2020
Share

Vespa Product Updates, February 2020: Ranking with LightGBM Models, Matrix Multiplication Performance, Benchmarking Guide, Query Builder, and Hadoop Integration

Kristian Aune, Tech Product Manager, Verizon Media In the January Vespa product update, we mentioned Tensor Operations, New Sizing Guides, Performance Improvements for Matched Elements in Map/Array-of-Struct, and Boolean Query Optimizations. This month, we’re excited to share the following updates: Ranking with LightGBM Models Vespa now supports LightGBM machine learning models in addition to ONNX, Tensorflow and XGBoost. LightGBM is a gradient boosting framework that trains fast, has a small memory footprint, and provides similar or improved accuracy to XGBoost. LightGBM also supports categorical features. Matrix Multiplication Performance Vespa now uses OpenBLAS for matrix multiplication, which improves performance in machine-learned models using matrix multiplication. Benchmarking Guide Teams use Vespa to implement applications with strict latency requirements and minimal cost. In January, we released a new sizing guide. This month, we’re adding a benchmarking guide that you can use to find the perfect spot between cost and performance. Query Builder Thanks to contributions from yehzu, Vespa now has a fluent library for composing queries - explore the client module for details. Hadoop Integration Vespa is integrated with Hadoop and easy to feed from a grid. The grid integration now also supports conditional writes, see #12081.  We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request. About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.

Vespa Product Updates, February 2020: Ranking with LightGBM Models, Matrix Multiplication Performance, Benchmarking Guide, Query Builder, and Hadoop Integration

March 3, 2020
Introducing Proxy Verifier - Open Source Tool for Testing HTTP Based Proxies March 2, 2020
March 2, 2020
Share

Introducing Proxy Verifier - Open Source Tool for Testing HTTP Based Proxies

Alan M. Carroll and Brian Neradt, Software Engineers, Verizon Media We’re pleased to announce Proxy Verifier - an open source tool for testing HTTP based proxies. Originally built as part of Verizon Media’s support for Apache Traffic Server (ATS) to improve testability and reliability, Proxy Verifier generates traffic through a proxy and verifies the behavior of the proxy. A key difference between Proxy Verifier and existing HTTP based test tools is Proxy Verifier verifies traffic to and from the proxy. This bi-directional ability was a primary motivation. In addition, handling traffic on both sides of the proxy means a Proxy Verifier setup can run in a network disconnected environment, which was an absolute requirement for this work - no other servers are required, and the risk of hitting production servers with test traffic is eliminated. After sharing the idea for Proxy Verifier with the Apache Traffic Server community, we’ve received significant external interest. We are pleased to have achieved a level of maturity with the tool’s development that we can now share it with the world by open sourcing it. As a related benefit, by open sourcing Proxy Verifier we will also be able to use it as a part of Traffic Server’s end-to-end test automation. Within Verizon Media, Proxy Verifier serves to support correctness, production simulation, and load testing. Generated and captured replay files are used for production simulation and load testing. Handbuilt replay files are used for debugging and correctness testing. Replay files are easily constructed by hand based on use cases or packet capture files, and also easily edited and extended later. Proxy Verifier is being integrated into the AuTest framework used in ATS for automated end-to-end testing. Proxy Verifier builds two executables, the client and server, which are used to test the proxy: The client sends requests to the proxy under test, which in turn is configured to send them to the server. The server parses the request from the proxy, sends a response, which the proxy then sends to the client. This traffic is controlled by a “replay file”, which is a YAML formatted configuration file. This contains the transactions as four messages - client to proxy, proxy to server, server to proxy, and proxy to client. Transactions can be grouped into sessions, each of which represents a single connection from the client to the proxy. This set of events are depicted in the following sequence diagram: Because the Proxy Verifier server needs only the replay file and no other configuration, it is easy for a developer to use it as a test HTTP server instead of setting up and configuring a full web server. Other key features: - Fine-grained control of what is sent from the client and server, along with what is expected from the proxy. - Specific fields in the proxy request or response can be checked against one of three criteria: the presence of a field, the absence of a field, or the presence of a field with a specific value. - Transactions in the config can be run once or repeatedly a specified number of times. - Sessions allow control of how much a client session is reused. - Transactions can be sent at a fixed rate to help simulate production level loads. Proxy Verifier has been tested up to over 10K RPS sustained. - The “traffic_dump” plugin for ATS can be used to capture production traffic for later testing with Proxy Verifier. - Protocol support: - IPv4 and IPv6 support. - HTTP/1.x support for both the Verifier client and server. - The Verifier client supports HTTP/2 but the server currently does not. We have plans to support server-side HTTP/2 sometime before the end of Q2 2020. - HTTPS with TLS 1.3 support (assuming Proxy Verifier is linked against OpenSSL 1.1.1 or higher).  For build and installation instructions, explore the github README page. Please file github issues for bugs or enhancement requests. Acknowledgments We would like to thank several people whose work contributed to this project: - Syeda “Persia” Aziz, initial work and proof of concept for the replay server. - Jesse Zhang, previous generation prototype and the schema. - Will Wendorf, initial verification logic. - Susan Hinrichs, implemented the client side HTTP/2 support.

Introducing Proxy Verifier - Open Source Tool for Testing HTTP Based Proxies

March 2, 2020
Remote Join February 28, 2020
February 28, 2020
Share

Remote Join

Tiffany Kyi, Software Engineer, Verizon Media We have recently rolled out a new feature: Remote Join. Previously, with remote triggers, users could kick off jobs in external pipelines by requiring a job from another one. With this new remote join feature, users can do parallel forks and join with jobs from external pipelines. An example of external parallel fork join in the Screwdriver UI: User configuration Make sure your cluster admin has the proper configuration set to support this feature. In order to use this new feature, you can configure your screwdriver.yaml similar to how remote triggers are done today. Just as with normal jobs, remote triggers will follow the rules: - ~ tilde prefix denotes logical [OR] - Omitting the ~ tilde prefix denotes logical [AND] Example Pipeline 3 screwdriver.yaml: shared: image: node:12 steps: - echo: echo hi jobs: main: requires: [~commit, ~pr] internal_fork: requires: [main] join_job: requires: [internal_fork, sd@2:external_fork, sd@4:external_fork] Pipeline 2 screwdriver.yaml: shared: image: node:12 steps:    - echo: echo hi jobs: external_fork: requires: [~sd@3:main] Pipeline 4 screwdriver.yaml: shared: image: node:12 steps:    - echo: echo hi jobs: external_fork: requires: [~sd@3:main] Caveats - In the downstream remote job, you’ll need to use ~ tilde prefix for the external requires - This feature is only guaranteed one external dependency level deep - This feature currently does not work with PR chain - The event list on the right side of the UI might not show the complete mini-graph for the eventCluster Admin configuration In order to enable this feature in your cluster, you’ll need to make changes to your Screwdriver cluster’s configuration by setting EXTERNAL_JOIN custom environment variable to true. Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.877 - UI - v1.0.494 - Store - v3.10.5 - Launcher - v6.0.12Contributors Thanks to the following contributors for making this feature possible: - adong - d2lam - jithine - klu909 - tkyi Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Remote Join

February 28, 2020
Dash Open 18: A chat with Joshua Simmons, Vice President, Open Source Initiative February 24, 2020
February 24, 2020
Share

Dash Open 18: A chat with Joshua Simmons, Vice President, Open Source Initiative

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Gil Yehuda, Sr. Director of Open Source, interviews Joshua Simmons, Vice President, Open Source Initiative (OSI). Joshua discusses the Open Source Initiative (OSI), a global non-profit championing software freedom in society through education, collaboration, and infrastructure. Joshua also highlights trends in the open source landscape and potential future changes.  Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 18: A chat with Joshua Simmons, Vice President, Open Source Initiative

February 24, 2020
Dash Open 17: A chat with Neil McGovern, Executive Director, GNOME Foundation February 22, 2020
February 22, 2020
Share

Dash Open 17: A chat with Neil McGovern, Executive Director, GNOME Foundation

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Neil McGovern, Executive Director, GNOME Foundation. Neil shares how he originally became involved with open source, the industry changes he has observed, and his focus at the GNOME Foundation, a non-profit organization that furthers the goals of the GNOME Project, helping it to create a free software computing platform for the general public that is designed to be elegant, efficient, and easy to use. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 17: A chat with Neil McGovern, Executive Director, GNOME Foundation

February 22, 2020
Dash Open 16: OSCON 2019 - A chat with Rachel Roumeliotis, VP Content Strategy, O'Reilly Media February 19, 2020
February 19, 2020
Share

Dash Open 16: OSCON 2019 - A chat with Rachel Roumeliotis, VP Content Strategy, O'Reilly Media

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Rachel Roumeliotis, Vice President of Content Strategy at O'Reilly Media. Rachel reflects on OSCON 2019 themes, what to expect at OSCON 2020, where the industry is going, and how she empowers her team to be great storytellers. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 16: OSCON 2019 - A chat with Rachel Roumeliotis, VP Content Strategy, O'Reilly Media

February 19, 2020
Improvements and Fixes February 14, 2020
February 14, 2020
Share

Improvements and Fixes

Screwdriver Team from Verizon Media UI - Enhancement: Upgrade to node.js v12. - Enhancement: Users can now link to custom test & coverage URL via metadata. - Enhancement: Reduce number of API calls to fetch active build logs. - Enhancement: Display proper title for Commands and Templates pages. - Bug fix: Hide “My Pipelines” from Add to collection dialogue. - Enhancement: Display usage stats for a template. API - Enhancement: Upgrade to node.js v12. - Enhancement: Reduce DB Size by removing steps column from builds. - Enhancement: New API to display usage metrics of a template. - Bug fix: Restarting builds multiple times now carries over proper context. Store - Enhancement: Upgrade to node.js v12. - Enhancement: Support for private AWS S3 buckets. Compatibility List In order to have these improvements, you will need these minimum versions: - UI - v1.0.491 - API - v0.5.851 - Store - v3.10.5 Contributors Thanks to the following contributors for making this feature possible: - adong - jithine - klu909 - InderH - djerraballi - tkyi - wahapo - tk3fftk - sugarnaoming - kkisic - kkokufud - sakka2 - yuichi10 - s-yoshika Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Improvements and Fixes

February 14, 2020
Build cache - Disk strategy January 30, 2020
January 30, 2020
Share

Build cache - Disk strategy

Build cache - Disk strategy Screwdriver now has the ability to cache and restore files and directories from your builds to either s3 or disk-based storage. Rest all features related to the cache feature remains the same, only a new storage option is added. Please DO NOT USE this cache feature to store any SENSITIVE data or information. The graph below is our Internal Screwdriver instance build-cache comparison between disk-based strategy vs aws s3. Build cache - get cache - (disk strategy) Build cache - get cache - (s3) Build cache - set cache - (disk strategy) Build cache - set cache - (s3) Why disk-based strategy? Based on the cache analysis, 1. The majority of time was spent pushing data from build to s3, 2. At times the cache push fails if the cache size is big (ex: >1gb). So, simplified the storage part by using a disk cache strategy and using filer/storage mount as a disk option. Each cluster will have its own filer/storage disk mount. NOTE: When a cluster becomes unavailable and if the requested cache is not available in the new cluster, the cache will be rebuilt once as part of the build. Cache Size:  Max size limit per cache is configurable by Cluster admins. Retention policy: Cluster admins are responsible to enforce retention policy.Cluster Admins: Screwdriver cluster-admin has the ability to specify the cache storage strategy along with other options like compression, md5 check, cache max limit in MB Reference:  1. https://github.com/screwdriver-cd/screwdriver/blob/master/config/default.yaml#L280 2. https://github.com/screwdriver-cd/executor-k8s-vm/blob/master/index.js#L336 3. Issue: https://github.com/screwdriver-cd/screwdriver/issues/1830   Compatibility List: In order to use this feature, you will need these minimum versions: - API - v0.5.835 - Buildcuster queue worker - v1.4.7 - Launcher - v6.0.42 - Store-cli - v0.0.50 - Store - v3.10.3 Contributors: Thanks to the following people for making this feature possible: - parthasl Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Build cache - Disk strategy

January 30, 2020
Vespa Product Updates, January 2020: Tensor Functions, New Sizing Guides, Performance Improvement for Matched Elements in Map/Array-of-Struct, Boolean Field Query Optimization January 28, 2020
January 28, 2020
Share

Vespa Product Updates, January 2020: Tensor Functions, New Sizing Guides, Performance Improvement for Matched Elements in Map/Array-of-Struct, Boolean Field Query Optimization

Kristian Aune, Tech Product Manager, Verizon Media In the December Vespa product update, we mentioned improved ONNX support, new rank feature attributeMatch().maxWeight, free lists for attribute multivalue mapping, faster updates for out-of-sync documents, and ZooKeeper 3.5.6. This month, we’re excited to share the following updates: Tensor Functions The tensor language has been extended with functions to allow the representation of very complex neural nets, such as BERT models, and better support for working with mapped (sparse) tensors: - Slice makes it possible to extract values and subspaces from tensors. - Literal tensors make it possible to create tensors on the fly, for instance from values sliced out of other tensors or from a list of scalar attributes or functions. - Merge produces a new tensor from two mapped tensors of the same type, where a lambda to resolve is invoked only for overlapping values. This can be used, for example, to supply default values which are overridden by an argument tensor. New Sizing Guides Vespa is used for applications with high performance or cost requirements. New sizing guides for queries and writes are now available to help teams use Vespa optimally. Performance Improvement for Matched Elements in Map/Array-of-Struct As maps or arrays in documents can often grow large, applications use matched-elements-only to return only matched items. This also simplifies application code. Performance for this feature is now improved - ex: an array or map with 20.000 elements is now 5x faster. Boolean Field Query Optimization Applications with strict latency requirements, using boolean fields and concurrent feed and query load, have a latency reduction since Vespa 7.165.5 due to an added bitCount cache. For example, we realized latency improvement from 3ms to 2ms for an application with a 30k write rate. Details in #11879. About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, January 2020: Tensor Functions, New Sizing Guides, Performance Improvement for Matched Elements in Map/Array-of-Struct, Boolean Field Query Optimization

January 28, 2020
Recent Enhancements and bug fixes January 22, 2020
January 22, 2020
Share

Recent Enhancements and bug fixes

Screwdriver Team from Verizon Media UI - Bugfix: Artifacts images are now displayed correctly in Firefox browser - Feature: Deep linking to an artifact for a specific build You can now share a link directly to an artifact, for example: https://cd.screwdriver.cd/pipelines/3709/builds/168862/artifacts/artifacts/dog.jpeg - Enhancement: Can override Freeze Window to start a build. Previously, users could not start builds during a freeze window unless they made changes to the freeze window setting in the screwdriver.yaml configuration. Now, you can start a build by entering a reason in the confirmation modal. This can be useful for users needing to push out an urgent patch or hotfix during a freeze window. Store - Feature: Build cache now supports local disk-based cache in addition to S3 cache. Queue Worker - Bugfix: Periodic build timeout check - Enhancement: Prevent re-enqueue of builds from same event. Compatibility List In order to have these improvements, you will need these minimum versions: - UI - v1.0.479 - API - v0.5.835 - Store - v3.10.3 - Launcher - v6.0.42 - Queue-Worker - v2.9.0 Contributors Thanks to the following contributors for making this feature possible: - adong - jithine - klu909 - parthasl - pritamstyz4ever - tk3fftk - tkyi Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Recent Enhancements and bug fixes

January 22, 2020
Speak at the 1st Annual Pulsar Summit - April 28th, San Francisco January 13, 2020
January 13, 2020
Share

Speak at the 1st Annual Pulsar Summit - April 28th, San Francisco

Sijie Guo, Founder, StreamNative The first-ever Pulsar Summit will bring together an international audience of CTOs, CIOs, developers, data architects, data scientists, Apache Pulsar committers/contributors, and the messaging and streaming community, to share experiences, exchange ideas and knowledge, and receive hands-on training sessions led by Apache Pulsar experts. Talk submissions, pre-registration, and sponsorship opportunities are now open for the conference! Speak at Pulsar Summit Submit a presentation or a lightning talk. Suggested topics cover Pulsar use cases, operations, technology deep dives, and ecosystem. Submissions are open until January 31, 2020. If you would like feedback or advice on your proposal, please reach out to sf-2020@pulsar-summit.org. We’re happy to help! - CFP Closes: January 31, 2020 - 23:59 PST - Speakers Notified: February 21, 2020 - Schedule Announced: February 24, 2020 Speaker Benefits Accepted speakers will enjoy: - Conference pass & speaker swag - Name, title, company, and bio will be featured on the Summit website - Professionally produced video of your presentation - Session recording added to the Pulsar Summit YouTube Channel - Session recording promoted on Twitter and LinkedIn Pre-registration Pre-registration is now open! After submitting the pre-registration form, you will be added to the Pulsar Summit waitlist. Once registration is open, we’ll email you. Sponsor Pulsar Summit Pulsar Summit is a community-run conference and your support is appreciated. Sponsoring this event will provide a great opportunity for your organization to further engage with the Apache Pulsar community. Contact us to learn more. Follow us on Twitter @pulsarsummit to receive the latest conference updates. Hope to see you there!

Speak at the 1st Annual Pulsar Summit - April 28th, San Francisco

January 13, 2020
Dash Open 15: The Virtues and Pitfalls of Contributor License Agreements January 9, 2020
January 9, 2020
Share

Dash Open 15: The Virtues and Pitfalls of Contributor License Agreements

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Gil Yehuda, Sr. Director of Open Source, interviews Michael Martin, Associate General Counsel and Head of Patents at Verizon Media. Mike shares why Contributor License Agreements (also known as CLAs) came to be and some of the reasons they don’t work as well as we’d hope. Fundamentally, we need to foster trust among people who don’t know each other and have no reason to trust each other. Without it, we’re not going to be able to build these incredibly complex things that require us to work together. Do CLAs do that? Listen and find out. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify. P.S. If you enjoyed this podcast then you might be interested in this Open Source Developer Lead position.

Dash Open 15: The Virtues and Pitfalls of Contributor License Agreements

January 9, 2020
Omid Graduates from the Apache Incubator Program! January 6, 2020
January 6, 2020
Share

Omid Graduates from the Apache Incubator Program!

Ohad Shacham, Sr. Research Scientist Yonatan Gottesman, Sr. Research Engineer Edward Bortnikov, Sr. Director, Research Scalable Systems, Yahoo Research Haifa, Verizon Media We have awesome news to share with you about Apache Omid, a scalable transaction processing platform for Apache HBase developed and open sourced by Yahoo. Omid has just graduated from the Apache Incubator program and is now part of Apache Phoenix. Phoenix is a real-time SQL database for OLTP and real-time analytics over HBase. It is widely employed by the industry, powering products in Alibaba, Bloomberg, Salesforce, and many others. Omid means ‘hope’ in Farsi, and as we hoped, it has proven to be a successful project.  In 2011, a team of scientists at Yahoo Research incepted Omid in anticipation of the need for low-latency OLTP applications within the Hadoop ecosystem. It has been powering real-time content indexing for Yahoo Search since 2015. In the same year, Omid entered the Apache Incubator program, taking the path towards wider adoption, community development, and code maturity. A year ago, Omid hit another major milestone when the Apache Phoenix project community selected it as the default provider of ACID transaction technology. We worked hard to make Omid’s recent major release match Phoenix’s requirements - flexibility,  high speed, and SQL extensions. Our work started when Phoenix was already using the Tephra technology to power its transaction backend. In order to provide backward compatibility, we contributed a brand new transaction adaptation layer (TAL) to Phoenix, which enables a configurable choice for the transaction technology provider. Our team performed extensive benchmarks, demonstrating Omid’s excellent scalability and reliability, which led to its adoption as the default transaction processor for Phoenix. With Omid’s support, Phoenix now features consistent secondary indexes and extended query semantics. The Phoenix-Omid integration is now generally available (release 4.15). Notwithstanding this integration, Omid can still be used as a standalone service by NoSQL HBase applications. In parallel with Phoenix adoption, Omid’s code and documentation continuously improved to meet the Apache project standards. Recently, the Apache community suggested that Omid (as well as Tephra) becomes an integral part of Phoenix going forward. The community vote ratified this decision. Omid’s adoption by the top-level Apache project is a huge success. We could not imagine a better graduation for Omid, since it will now enjoy a larger developer community and will be used in even more real-world applications. As we celebrate Omid’s Apache graduation, it’s even more exciting to see new products using it to run their data platforms at scale. Omid could not have been successful without its wonderful developer community at Yahoo and beyond. Thank you Maysam Yabandeh, Flavio Junqueira, Ben Reed, Ivan Kelly, Francisco Perez-Sorrosal, Matthieu Morel, Sameer Paranjpye, Igor Katkov, James Taylor, and Lars Hofhansl for your numerous contributions. Thank you also to the Apache community for your commitment to open source and for letting us bring our technology to benefit the community at large. We invite future contributors to explore Omid’s new home repository at https://gitbox.apache.org/repos/asf?p=phoenix-omid.git.

Omid Graduates from the Apache Incubator Program!

January 6, 2020
Documentation for Panoptes - Open Source Global Scale Network Telemetry Ecosystem December 30, 2019
December 30, 2019
Share

Documentation for Panoptes - Open Source Global Scale Network Telemetry Ecosystem

By James Diss, Software Systems Engineer, Verizon Media Documentation is important to the success of any project. Panoptes, which we open-sourced in October 2018, is no exception, due to its distribution of concerns and plugin architecture. Because of this, there are inherent complexities in implementing and personalizing the framework for individual users. While the code provides clarity, it’s the documentation that supplies the map for exploration. In recognition of this, we’ve split out the documentation for the Panoptes project, and will update it separately from now on. Expanding the documentation contained within the project and separating out the documentation from the actual framework code gives us a little more flexibility in expanding and contextualizing the documentation, but also gets it away from the code that would be deployed to production hosts. We’re also using an internal template and the docusaurus.io project to produce a website that will be updated at the same time as the project documentation at https://getpanoptes.io. Panoptes Resources - Panoptes Documentation Repo - https://github.com/yahoo/panoptes_documentation - Panoptes in Docker Image  - https://hub.docker.com/r/panoptes/panoptes_docker - Panoptes in Docker GitHub Repo  - https://github.com/yahoo/panoptes_docker - Panoptes GitHub Repo                      - https://github.com/yahoo/panoptes/ Questions, Suggestions, & Contributions Your feedback and contributions are appreciated! Explore Panoptes, use and help contribute to the project, and chat with us on Slack.

Documentation for Panoptes - Open Source Global Scale Network Telemetry Ecosystem

December 30, 2019
Vespa Product Updates, December 2019: Improved ONNX support, New rank feature attributeMatch().maxWeight, Free lists for attribute multivalue mapping, faster updates for out-of-sync documents, Zookeeper 3.5.6 December 19, 2019
December 19, 2019
Share

Vespa Product Updates, December 2019: Improved ONNX support, New rank feature attributeMatch().maxWeight, Free lists for attribute multivalue mapping, faster updates for out-of-sync documents, Zookeeper 3.5.6

In the November Vespa product update, we mentioned Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance and Datadog Monitoring Support. Today, we’re excited to share the following updates: Improved ONNX Support Vespa has added more operations to its ONNX model API, such as GEneral Matrix to Matrix Multiplication (GEMM) - see list of supported opsets. Vespa has also improved support for PyTorch through ONNX, see the pytorch_test.py example. New Rank Feature attributeMatch().maxWeight attributeMatch(name).maxWeight was added in Vespa-7.135.5. The value is  the maximum weight of the attribute keys matched in a weighted set attribute. Free Lists for Attribute Multivalue Mapping Since Vespa-7.141.8, multivalue attributes uses a free list to improve performance. This reduces CPU (no compaction jobs) and approximately 10% memory. This primarily benefits applications with a high update rate to such attributes. Faster Updates for Out-of-Sync Documents Vespa handles replica consistency using bucket checksums. Updating documents can be cheaper than putting a new document, due to less updates to posting lists. For updates to documents in inconsistent buckets, a GET-UPDATE is now used instead of a GET-PUT whenever the document to update is consistent across replicas. This is the common case when only a subset of the documents in the bucket are out of sync. This is useful for applications with high update rates, updating multi-value fields with large sets. Explore details here. ZooKeeper 3.5.6 Vespa now uses Apache ZooKeeper 3.5.6 and can encrypt communication between ZooKeeper servers. About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, December 2019: Improved ONNX support, New rank feature attributeMatch().maxWeight, Free lists for attribute multivalue mapping, faster updates for out-of-sync documents, Zookeeper 3.5.6

December 19, 2019
Vespa Product Updates, December 2019: Improved ONNX Support, New Rank Feature attributeMatch().maxWeight, Free Lists for Attribute Multivalue Mapping, Faster Updates for Out-of-Sync Documents, and ZooKeeper 3.5.6 Support December 18, 2019
December 18, 2019
Share

Vespa Product Updates, December 2019: Improved ONNX Support, New Rank Feature attributeMatch().maxWeight, Free Lists for Attribute Multivalue Mapping, Faster Updates for Out-of-Sync Documents, and ZooKeeper 3.5.6 Support

Kristian Aune, Tech Product Manager, Verizon Media In the November Vespa product update, we mentioned Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance and Datadog Monitoring Support. Today, we’re excited to share the following updates: Improved ONNX Support Vespa has added more operations to its ONNX model API, such as GEneral Matrix to Matrix Multiplication (GEMM) - see list of supported opsets. Vespa has also improved support for PyTorch through ONNX, see the pytorch_test.py example. New Rank Feature attributeMatch().maxWeight attributeMatch(name).maxWeight was added in Vespa-7.135.5. The value is  the maximum weight of the attribute keys matched in a weighted set attribute. Free Lists for Attribute Multivalue Mapping Since Vespa-7.141.8, multivalue attributes uses a free list to improve performance. This reduces CPU (no compaction jobs) and approximately 10% memory. This primarily benefits applications with a high update rate to such attributes. Faster Updates for Out-of-Sync Documents Vespa handles replica consistency using bucket checksums. Updating documents can be cheaper than putting a new document, due to less updates to posting lists. For updates to documents in inconsistent buckets, a GET-UPDATE is now used instead of a GET-PUT whenever the document to update is consistent across replicas. This is the common case when only a subset of the documents in the bucket are out of sync. This is useful for applications with high update rates, updating multi-value fields with large sets. Explore details here. ZooKeeper 3.5.6 Vespa now uses Apache ZooKeeper 3.5.6 and can encrypt communication between ZooKeeper servers. About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, December 2019: Improved ONNX Support, New Rank Feature attributeMatch().maxWeight, Free Lists for Attribute Multivalue Mapping, Faster Updates for Out-of-Sync Documents, and ZooKeeper 3.5.6 Support

December 18, 2019
Learning to Rank with Vespa – Getting started with Text Search December 4, 2019
December 4, 2019
Share

Learning to Rank with Vespa – Getting started with Text Search

Vespa.ai have just published two tutorials to help people to get started with text search applications by building scalable solutions with Vespa. The tutorials were based on the full document ranking task released by Microsoft’s MS MARCO dataset’s team. The first tutorial helps you to create and deploy a basic text search application with Vespa as well as to download, parse and feed the dataset to a running Vespa instance. They also show how easy it is to experiment with ranking functions based on built-in ranking features available in Vespa. The second tutorial shows how to create a training dataset containing Vespa ranking features that allow you to start training ML models to improve the app’s ranking function. It also illustrates the importance of going beyond pointwise loss functions when training models in a learning to rank context. Both tutorials are detailed and come with code available to reproduce the steps. Here are the highlights.Basic text search app in a nutshell The main task when creating a basic app with Vespa is to write a search definition file containing information about the data you want to feed to the application and how Vespa should match and order the results returned in response to a query. Apart from some additional details described in the tutorial, the search definition for our text search engine looks like the code snippet below. We have a title and body field containing information about the documents available to be searched. The fieldset keyword indicates that our query will match documents by searching query words in both title and body fields. Finally, we have defined two rank-profile, which controls how the matched documents will be ranked. The default rank-profile uses nativeRank, which is one of many built-in rank features available in Vespa. The bm25 rank-profile uses the widely known BM25 rank feature. search msmarco {     document msmarco {        field title type string        field body type string    }    fieldset default {        fields: title, body    }    rank-profile default {        first-phase {            expression: nativeRank(title, body)        }    }    rank-profile bm25 inherits default {        first-phase {            expression: bm25(title) + bm25(body)        }    } } When we have more than one rank-profile defined, we can chose which one to use at query time, by including the ranking parameter in the query: curl -s "/search/?query=what+is+dad+bod" curl -s "/search/?query=what+is+dad+bod&ranking=bm25" The first query above does not specify the ranking parameter and will therefore use the default rank-profile. The second query explicitly asks for the bm25 rank-profile to be used instead. Having multiple rank-profiles allow us to experiment with different ranking functions. There is one relevant document for each query in the MSMARCO dataset. The figure below is the result of an evaluation script that sent more than 5.000 queries to our application and asked for results using both rank-profiles described above. We then tracked the position of the relevant document for each query and plotted the distribution for the first 10 positions. It is clear that the bm25 rank-profile does a much better job in this case. It places the relevant document in the first positions much more often than the default rank-profile.Data collection sanity check After setting up a basic application, we likely want to collect rank feature data to help improve our ranking functions. Vespa allow us to return rank features along with query results, which enable us to create training datasets that combine relevance information with search engine rank information. There are different ways to create a training dataset in this case. Because of this, we believe it is a good idea to have a sanity check established before we start to collect the dataset. The goal of such sanity check is to increase the likelihood that we catch bugs early and create datasets containing the right information associated with our task of improving ranking functions. Our proposal is to use the dataset to train a model using the same features and functional form used by the baseline you want to improve upon. If the dataset is well built and contains useful information about the task you are interested you should be able to get results at least as good as the one obtained by your baseline on a separate test set. Since our baseline in this case is the bm25 rank-profile, we should fit a linear model containing only the bm25 features: a + b * bm25(title) + c * bm25(body) Having this simple procedure in place helped us catch a few silly bugs in our data collection code and got us in the right track faster than would happen otherwise. Having bugs on your data is hard to catch when you begin experimenting with complex models as we never know if the bug comes from the data or the model. So this is a practice we highly recommend.How to create a training dataset with Vespa Asking Vespa to return ranking features in the result set is as simple as setting the ranking.listFeatures parameter to true in the request. Below is the body of a POST request that specify the query in YQL format and enable the rank features dumping. body = {    "yql": 'select * from sources * where (userInput(@userQuery));',    "userQuery": "what is dad bod",    "ranking": {"profile": "bm25", "listFeatures": "true"}, } Vespa returns a bunch of ranking features by default, but we can explicitly define which features we want by creating a rank-profile and ask it to ignore-default-rank-features and list the features we want by using the rank-features keyword, as shown below. The random first phase will be used when sampling random documents to serve as a proxy to non-relevant documents. rank-profile collect_rank_features inherits default {    first-phase {        expression: random    }    ignore-default-rank-features    rank-features {        bm25(title)        bm25(body)        nativeRank(title)        nativeRank(body)    } } We want a dataset that will help train models that will generalize well when running on a Vespa instance. This implies that we are only interested in collecting documents that are matched by the query because those are the documents that would be presented to the first-phase model in a production environment. Here is the data collection logic: hits = get_relevant_hit(query, rank_profile, relevant_id) if relevant_hit:    hits.extend(get_random_hits(query, rank_profile, n_samples))    data = annotate_data(hits, query_id, relevant_id)    append_data(file, data) For each query, we first send a request to Vespa to get the relevant document associated with the query. If the relevant document is matched by the query, Vespa will return it and we will expand the number of documents associated with the query by sending a second request to Vespa. The second request asks Vespa to return a number of random documents sampled from the set of documents that were matched by the query. We then parse the hits returned by Vespa and organize the data into a tabular form containing the rank features and the binary variable indicating if the query-document pair is relevant or not. At the end we have a dataset with the following format. More details can be found in our second tutorial.Beyond pointwise loss functions The most straightforward way to train the linear model suggested in our data collection sanity check would be to use a vanilla logistic regression, since our target variable relevant is binary. The most commonly used loss function in this case (binary cross-entropy) is referred to as a pointwise loss function in the LTR literature, as it does not take the relative order of documents into account. However, as we described in our first tutorial, the metric that we want to optimize in this case is the Mean Reciprocal Rank (MRR). The MRR is affected by the relative order of the relevance we assign to the list of documents generated by a query and not by their absolute magnitudes. This disconnect between the characteristics of the loss function and the metric of interest might lead to suboptimal results. For ranking search results, it is preferable to use a listwise loss function when training our model, which takes the entire ranked list into consideration when updating the model parameters. To illustrate this, we trained linear models using the TF-Ranking framework. The framework is built on top of TensorFlow and allow us to specify pointwise, pairwise and listwise loss functions, among other things. We made available the script that we used to train the two models that generated the results displayed in the figure below. The script uses simple linear models but can be useful as a starting point to build more complex ones. Overall, on average, there is not much difference between those models (with respect to MRR), which was expected given the simplicity of the models described here. However, we can see that a model based on a listwise loss function allocate more documents in the first two positions of the ranked list when compared to the pointwise model. We expect the difference in MRR between pointwise and listwise loss functions to increase as we move on to more complex models. The main goal here was simply to show the importance of choosing better loss functions when dealing with LTR tasks and to give a quick start for those who want to give it a shot in their own Vespa applications. Now, it is up to you, check out the tutorials, build something and let us know how it went. Feedbacks are welcome!

Learning to Rank with Vespa – Getting started with Text Search

December 4, 2019
Dash Open 14: How Verizon Media’s Data Platforms and Systems Engineering Team Uses and Contributes to Open Source December 3, 2019
December 3, 2019
Share

Dash Open 14: How Verizon Media’s Data Platforms and Systems Engineering Team Uses and Contributes to Open Source

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Tom Miller, Director of Software Development Engineering on the Data Platforms and Systems Engineering Team at Verizon Media. Tom shares how his team uses and contributes to open source. Tom also chats about empowering his team to do great work and what it’s like to live and work in Champaign, IL. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify. P.S. If you enjoyed this podcast then you might be interested in this Software Development Engineer position in Champaign!

Dash Open 14: How Verizon Media’s Data Platforms and Systems Engineering Team Uses and Contributes to Open Source

December 3, 2019
E-commerce search and recommendation with Vespa.ai November 29, 2019
November 29, 2019
Share

E-commerce search and recommendation with Vespa.ai

Introduction Holiday shopping season is upon us and it’s time for a blog post on E-commerce search and recommendation using Vespa.ai. Vespa.ai is used as the search and recommendation backend at multiple Yahoo e-commerce sites in Asia, like tw.buy.yahoo.com. This blog post discusses some of the challenges in e-commerce search and recommendation, and shows how they can be solved using the features of Vespa.ai. Photo by Jonas Leupe on Unsplash Text matching and ranking in e-commerce search E-commerce search have text ranking requirements where traditional text ranking features like BM25 or TF-IDF might produce poor results. For an introduction to some of the issues with TF-IDF/BM25 see the influence of TF-IDF algorithms in e-commerce search. One example from the blog post is a search for ipad 2 which with traditional TF-IDF ranking will rank ‘black mini ipad cover, compatible with ipad 2’ higher than ‘Ipad 2’ as the former product description has several occurrences of the query terms Ipad and 2. Vespa allows developers and relevancy engineers to fine tune the text ranking features to meet the domain specific ranking challenges. For example developers can control if multiple occurrences of a query term in the matched text should impact the relevance score. See text ranking occurrence tables and Vespa text ranking types for in-depth details. Also the Vespa text ranking features takes text proximity into account in the relevancy calculation, i.e how close the query terms appear in the matched text. BM25/TF-IDF on the other hand does not take query term proximity into account at all. Vespa also implements BM25 but it’s up to the relevancy engineer to chose which of the rich set of built-in text ranking features in Vespa that is used. Vespa uses OpenNLP for linguistic processing like tokenization and stemming with support for multiple languages (as supported by OpenNLP).Custom ranking business logic in e-commerce search Your manager might tell you that these items of the product catalog should be prominent in the search results. How to tackle this with your existing search solution? Maybe by adding some synthetic query terms to the original user query, maybe by using separate indexes with federated search or even with a key value store which rarely is in synch with the product catalog search index? With Vespa it’s easy to promote content as Vespa’s ranking framework is just math and allows the developer to formulate the relevancy scoring function explicitly without having to rewrite the query formulation. Vespa controls ranking through ranking expressions configured in rank profiles which enables full control through the expressive Vespa ranking expression language. The rank profile to use is chosen at query time so developers can design multiple ranking profiles to rank documents differently based on query intent classification. See later section on query classification for more details how query classification can be done with Vespa. A sample ranking profile which implements a tiered relevance scoring function where sponsored or promoted items are always ranked above non-sponsored documents is shown below. The ranking profile is applied to all documents which matches the query formulation and the relevance score of the hit is the assigned the value of the first-phase expression. Vespa also supports multi-phase ranking. Sample hand crafted ranking profile defined in the Vespa application package. The above example is hand crafted but for optimal relevance we do recommend looking at learning to rank (LTR) methods. See learning to Rank using TensorFlow Ranking and learning to Rank using XGBoost. The trained MLR models can be used in combination with the specific business ranking logic. In the example above we could replace the default-ranking function with the trained MLR model, hence combining business logic with MLR models. Facets and grouping in e-commerce search Guiding the user through the product catalog by guided navigation or faceted search is a feature which users expects from an e-commerce search solution today and with Vespa, facets and guided navigation is easily implemented by the powerful Vespa Grouping Language. Sample screenshot from Vespa e-commerce sample application UI demonstrating search facets using Vespa Grouping Language. The Vespa grouping language supports deep nested grouping and aggregation operations over the matched content. The language also allows pagination within the group(s). For example if grouping hits by category and displaying top 3 ranking hits per category the language allows paginating to render more hits from a specified category group.The vocabulary mismatch problem in e-commerce search Studies (e.g. this study from FlipKart) finds that there is a significant fraction of queries in e-commerce search which suffer from vocabulary mismatch between the user query formulation and the relevant product descriptions in the product catalog. For example, the query “ladies pregnancy dress” would not match a product with description “women maternity gown” due to vocabulary mismatch between the query and the product description. Traditional Information Retrieval (IR) methods like TF-IDF/BM25 would fail retrieving the relevant product right off the bat. Most techniques currently used to try to tackle the vocabulary mismatch problem are built around query expansion. With the recent advances in NLP using transfer learning with large pre-trained language models, we believe that future solutions will be built around multilingual semantic retrieval using text embeddings from pre-trained deep neural network language models. Vespa has recently announced a sample application on semantic retrieval which addresses the vocabulary mismatch problem as the retrieval is not based on query terms alone, but instead based on the dense text tensor embedding representation of the query and the document. The mentioned sample app reproduces the accuracy of the retrieval model described in the Google blog post about Semantic Retrieval. Using our query and product title example from the section above, which suffers from the vocabulary mismatch, and instead move away from the textual representation to using the respective dense tensor embedding representation, we find that the semantic similarity between them is high (0.93). The high semantic similarity means that the relevant product would be retrieved when using semantic retrieval. The semantic similarity is in this case defined as the cosine similarity between the dense tensor embedding representations of the query and the product description. Vespa has strong support for expressing and storing tensor fields which one can perform tensor operations (e.g cosine similarity) over for ranking, this functionality is demonstrated in the mentioned sample application. Below is a simple matrix comparing the semantic similarity of three pairs of (query, product description). The tensor embeddings of the textual representation is obtained with the Universal Sentence Encoder from Google. Semantic similarity matrix of different queries and product descriptions. The Universal Sentence Encoder Model from Google is multilingual as it was trained on text from multiple languages. Using these text embeddings enables multilingual retrieval so searches written in Chinese can retrieve relevant products by descriptions written in multiple languages. This is another nice property of semantic retrieval models which is particularly useful in e-commerce search applications with global reach.Query classification and query rewriting in e-commerce search Vespa supports deploying stateless machine learned (ML) models which comes handy when doing query classification. Machine learned models which classify the query is commonly used in e-commerce search solutions and the recent advances in natural language processing (NLP) using pre-trained deep neural language models have improved the accuracy of text classification models significantly. See e.g text classification using BERT for an illustrated guide to text classification using BERT. Vespa supports deploying ML models built with TensorFlow, XGBoost and PyTorch through the Open Neural Network Exchange (ONNX) format. ML models trained with mentioned tools can successfully be used for various query classification tasks with high accuracy. In e-commerce search, classifying the intent of the query or query session can help ranking the results by using an intent specific ranking profile which is tailored to the specific query intent. The intent classification can also determine how the result page is displayed and organised. Consider a category browse intent query like ‘shoes for men’. A query intent which might benefit from a query rewrite which limits the result set to contain only items which matched the unambiguous category id instead of just searching the product description or category fields for ‘shoes for men’ . Also ranking could change based on the query classification by using a ranking profile which gives higher weight to signals like popularity or price than text ranking features. Vespa also features a powerful query rewriting language which supports rule based query rewrites, synonym expansion and query phrasing.Product recommendation in e-commerce search Vespa is commonly used for recommendation use cases and e-commerce is no exception. Vespa is able to evaluate complex Machine Learned (ML) models over many data points (documents, products) in user time which allows the ML model to use real time signals derived from the current user’s online shopping session (e.g products browsed, queries performed, time of day) as model features. An offline batch oriented inference architecture would not be able to use these important real time signals. By batch oriented inference architecture we mean pre-computing the inference offline for a set of users or products and where the model inference results is stored in a key-value store for online retrieval. In our blog recommendation tutorial we demonstrate how to apply a collaborative filtering model for content recommendation and in part 2 of the blog recommendation tutorial we show to use a neural network trained with TensorFlow to serve recommendations in user time. Similar recommendation approaches are used with success in e-commerce.Keeping your e-commerce index up to date with real time updates Vespa is designed for horizontal scaling with high sustainable write and read throughput with low predictable latency. Updating the product catalog in real time is of critical importance for e-commerce applications as the real time information is used in retrieval filters and also as ranking signals. The product description or product title rarely changes but meta information like inventory status, price and popularity are real time signals which will improve relevance when used in ranking. Also having the inventory status reflected in the search index also avoids retrieving content which is out of stock. Vespa has true native support for partial updates where there is no need to re-index the entire document but only a subset of the document (i.e fields in the document). Real time partial updates can be done at scale against attribute fields which are stored and updated in memory. Attribute fields in Vespa can be updated at rates up to about 40-50K updates/s per content node.Campaigns in e-commerce search Using Vespa’s support for predicate fields it’s easy to control when content is surfaced in search results and not. The predicate field type allows the content (e.g a document) to express if it should match the query instead of the other way around. For e-commerce search and recommendation we can use predicate expressions to control how product campaigns are surfaced in search results. Some examples of what predicate fields can be used for: - Only match and retrieve the document if time of day is in the range 8–16 or range 19–20 and the user is a member. This could be used for promoting content for certain users, controlled by the predicate expression stored in the document. The time of day and member status is passed with the query. - Represent recurring campaigns with multiple time ranges. Above examples are by no means exhaustive as predicates can be used for multiple campaign related use cases where the filtering logic is expressed in the content.Scaling & performance for high availability in e-commerce search Are you worried that your current search installation will break by the traffic surge associated with the holiday shopping season? Are your cloud VMs running high on disk busy metrics already? What about those long GC pauses in the JVM old generation causing your 95percentile latency go through the roof? Needless to say but any downtime due to a slow search backend causing a denial of service situation in the middle of the holiday shopping season will have catastrophic impact on revenue and customer experience. Photo by Jon Tyson on Unsplash The heart of the Vespa serving stack is written in C++ and don’t suffer from issues related to long JVM GC pauses. The indexing and search component in Vespa is significantly different from the Lucene based engines like SOLR/Elasticsearch which are IO intensive due to the many Lucene segments within an index shard. A query in a Lucene based engine will need to perform lookups in dictionaries and posting lists across all segments across all shards. Optimising the search access pattern by merging the Lucene segments will further increase the IO load during the merge operations. With Vespa you don’t need to define the number of shards for your index prior to indexing a single document as Vespa allows adaptive scaling of the content cluster(s) and there is no shard concept in Vespa. Content nodes can be added and removed as you wish and Vespa will re-balance the data in the background without having to re-feed the content from the source of truth. In ElasticSearch, changing the number of shards to scale with changes in data volume requires an operator to perform a multi-step procedure that sets the index into read-only mode and splits it into an entirely new index. Vespa is designed to allow cluster resizing while being fully available for reads and writes. Vespa splits, joins and moves parts of the data space to ensure an even distribution with no intervention needed At the scale we operate Vespa at Verizon Media, requiring more than 2X footprint during content volume expansion or reduction would be prohibitively expensive. Vespa was designed to allow content cluster resizing while serving traffic without noticeable serving impact. Adding content nodes or removing content nodes is handled by adjusting the node count in the application package and re-deploying the application package. Also the shard concept in ElasticSearch and SOLR impacts search latency incurred by cpu processing in the matching and ranking loops as the concurrency model in ElasticSearch/SOLR is one thread per search per shard. Vespa on the other hand allows a single search to use multiple threads per node and the number of threads can be controlled at query time by a rank-profile setting: num-threads-per-search. Partitioning the matching and ranking by dividing the document volume between searcher threads reduces the overall latency at the cost of more cpu threads, but makes better use of multi-core cpu architectures. If your search servers cpu usage is low and search latency is still high you now know the reason. In a recent published benchmark which compared the performance of Vespa versus ElasticSearch for dense vector ranking Vespa was 5x faster than ElasticSearch. The benchmark used 2 shards for ElasticSearch and 2 threads per search in Vespa. The holiday season online query traffic can be very spiky, a query traffic pattern which can be difficult to predict and plan for. For instance price comparison sites might direct more user traffic to your site unexpectedly at times you did not plan for. Vespa supports graceful quality of search degradation which comes handy for those cases where traffic spikes reaches levels not anticipated in the capacity planning. These soft degradation features allow the search service to operate within acceptable latency levels but with less accuracy and coverage. These soft degradation mechanisms helps avoiding a denial of service situation where all searches are becoming slow due to overload caused by unexpected traffic spikes. See details in the Vespa graceful degradation documentation.Summary In this post we have explored some of the challenges in e-commerce Search and Recommendation and highlighted some of the features of Vespa which can be used to tackle e-commerce search and recommendation use cases. If you want to try Vespa for your e-commerce application you can go check out our e-commerce sample application found here . The sample application can be scaled to full production size using our hosted Vespa Cloud Service at https://cloud.vespa.ai/. Happy Holiday Shopping Season!

E-commerce search and recommendation with Vespa.ai

November 29, 2019
YAML tip: Using anchors for shared steps & jobs November 26, 2019
November 26, 2019
Share

YAML tip: Using anchors for shared steps & jobs

Sheridan Rawlins, Architect, Verizon Media Overview Occasionally, a pipeline needs several similar but different jobs. When these jobs are specific to a single pipeline, it would not make much sense to create a Screwdriver template. In order to reduce copy/paste issues and facilitate sharing jobs and steps in a single YAML, the tips shared in this post will hopefully be as helpful to you as they were to us. Below is a condensed example showcasing some techniques or patterns that can be used for sharing steps. Example of desired use jobs: deploy-prod: template: sd/noop deploy-location1: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location1 deploy-location2: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location2 Complete working example at the end of this post. Defining shared steps What is a step? First, let us define a step. Steps of a job look something like the following, and each step is an array element with an object with only one key and corresponding value. The key is the step name and the value is the cmd to be run. More details can be found in the SD Guide. jobs: job1: steps: - step1: echo "do step 1" - step2: echo "do step 2" What are anchors and aliases? Second, let me describe YAML anchors and aliases. An anchor may only be placed between an object key and its value. An alias may be used to copy or merge the anchor value. Recommendation for defining shared steps and jobs While an anchor can be defined anywhere in a yaml, defining shared things in the shared section makes intuitive sense. As annotations can contain freeform objects in addition to documented ones, we recommend defining annotations in the “shared” section. Now, I’ll show an example and explain the details of how it works: shared: environment: ANOTHER_ARG: another_arg_value annotations: steps: - .: &set-dryrun set-dryrun: | DRYRUN=false if [[ -n $SD_PULL_REQUEST ]]; then DRYRUN=true fi - .: &deploy deploy: | CMD=( my big deploy tool --dry-run="${DRYRUN:?}" --location "${LOCATION:?}" --another-arg "${ANOTHER_ARG:?}" ) "${CMD[@]}" Explanation of how the step anchor declaration patterns work: In order to reduce redundancy, annotations allow users to define one shared configuration with an “alias” that can be referenced multiple times, such as *some-step in the following example, used by job1 and job2. jobs: job1: steps: - *some-step job2: steps: - *some-step To use the alias, the anchor &some-step must result in an object with single key (also some-step) and value which is the shell code to execute. Because an anchor can only be declared between a key and a value, we use an array with a single object with single key . (short to type). The array allows us to use . again without conflict - if it were in an object, we might need to repeat the some-step three times such as: # Anti-pattern: do not use as it is too redundant. some-step: &some-step some-step: | # shell instructions The following is an example of a reasonably short pattern that can be used to define the steps with only redundancy being the anchor name and the step name: shared: annotations: steps: - .: &some-step some-step: | echo "do some step" When using *some-step, you alias to the anchor which is an object with single key some-step and value of echo "do some step" which is exactly what you want/need. FAQ Why the | character after some-step:? While you could just write some-step: echo "do some step", I prefer to use the | notation for describing shell code because it allows you to do multiline shell scripting. Even for one-liners, you don’t have to reason about the escape rules - as long as the commands are indented properly, they will be passed to the $USER_SHELL_BIN correctly, allowing your shell to deal with escaping naturally. set-dryrun: | DRYRUN=false if [[ -n $SD_PULL_REQUEST ]]; then DRYRUN=true fi Why that syntax for environment variables? 1. By using environment variables for shared steps, it allows the variables to be altered by the specific jobs that invoke them. 2. The syntax "${VARIABLE:?}" is useful for a step that needs a value - it will cause an error if the variable is undefined or empty.Why split CMD into array assignment and invocation? The style of defining an array and then invoking it helps readability by putting each logical flag on its own line. It can be digested by a human very easily and also copy/pasted to other commands or deleted with ease as a single line. Assigning to an array allows multiple lines as bash will not complete the statement until the closing parenthesis. Why does one flag have –flag=value and another have –flag value Most CLI parsers treat boolean flags as a flag without an expected value - omission of the flag is false, existence is true. However, many CLI parsers also accept the --flag=value syntax for boolean flags and, in my opinion, it is far easier to debug and reason about a variable (such as false) than to know that the flag exists and is false when not provided. Defining shared jobs What is a job? A job in screwdriver is an object with many fields described in the SD Guide Job anchor declaration patterns To use a shared job effectively, it is helpful to use a feature of YAML that is documented outside of the YAML 1.2 Spec called Merge Key. The syntax <<: *some-object-anchor lets you merge in keys of an anchor that has an object as its value into another object and then add or override keys as necessary. Recommendation for defining shared jobs shared: annotations: jobs: deploy-job: &deploy-job image: the-deploy-image steps: - *set-dryrun - *deploy If you browse back to the previous example of desired use (also copied here), you can see use of the <<: *deploy-job to start with the deploy-job keys/values, and then add requires and environment overrides to customize the concrete instances of the deploy job. jobs: deploy-prod: template: sd/noop deploy-location1: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location1 deploy-location2: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location2 FAQ Why is environment put in the shared section and not included with the shared job? The answer to that is quite subtle. The Merge Key merges top level keys; if you were to put defaults in a shared job, overriding environment: would end up clobbering all of the provided values. However, Screwdriver follows up the YAML parsing phase with its own logic to merge things from the shared section at the appropriate depth. Why not just use shared.steps? As noted above, Screwdriver does additional work to merge annotations, environment, and steps into each job after the YAML parsing phase. The logic for steps goes like this: 1. If a job has NO steps key, then it inherits ALL shared steps. 2. If a job has at least one step, then only matching wrapping steps (steps starting with pre or post) are copied in at the right place (before or after steps that the job provides matching the remainder of the step name after pre or post). While the above pattern might be useful for some pipelines, complex pipelines typically have a few job types and may want to share some but not all steps. Complete Example Copy paste the following into validator shared: environment: ANOTHER_ARG: another_arg_value annotations: steps: - .: &set-dryrun set-dryrun: | DRYRUN=false if [[ -n $SD_PULL_REQUEST ]]; then DRYRUN=true fi - .: &deploy deploy: | CMD=( my big deploy tool --dry-run="${DRYRUN:?}" --location "${LOCATION:?}" --another-arg "${ANOTHER_ARG:?}" ) "${CMD[@]}" jobs: deploy-job: &deploy-job image: the-deploy-image steps: - *set-dryrun - *deploy jobs: deploy-prod: template: sd/noop deploy-location1: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location1 deploy-location2: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location2

YAML tip: Using anchors for shared steps & jobs

November 26, 2019
Dash Open 13: Using and Contributing to Hadoop at Verizon Media November 24, 2019
November 24, 2019
Share

Dash Open 13: Using and Contributing to Hadoop at Verizon Media

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Eric Badger, Software Development Engineer, about using and contributing to Hadoop at Verizon Media.  Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 13: Using and Contributing to Hadoop at Verizon Media

November 24, 2019
Build Parameters November 13, 2019
November 13, 2019
Share

Build Parameters

Alan Dong, Software Engineer, Verizon Media Screwdriver team is constantly evolving and building new features for its users. Today, we are announcing a nuanced feature: Build Parameters, aka Parameterized Builds, which enables users to have more control over build pipelines.Purpose The Build Parameters feature allows users to define a set of parameters on the pipeline level; users can customize runtime parameters either through using the UI or API to kickoff builds. This means users can now implement reactive behaviors based on the parameters passed in as well.Definition There are 2 ways of defining parameters, see paramters: nameA: "value1" nameB: value: "value2" description: "description of nameB" Parameters is a dictionary which expects key:value pairs. nameA: "value1" key: string is a shorthand for writting as key: value nameA: value: "value1" description: "" These two are identical with description to be an empty string.Example See Screwdriver pipeline shared: image: node:8 parameters: region: "us-west-1" az: value: "1" description: "default availability zone" jobs: main: requires: [~pr, ~commit] steps: - step1: 'echo "Region: $(meta get parameters.region.value)"' - step2: 'echo "AZ: $(meta get parameters.az.value)"' You can also preview the parameters that being used during a build in Setup -> sd-setup-init step Pipeline Preview Screenshot:Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.780 - UI - v1.0.466Contributors Thanks to the following contributors for making this feature possible: - adong Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Build Parameters

November 13, 2019
Vespa Product Updates, October/November 2019: Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, 
Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance, and Datadog Monitoring Support November 5, 2019
November 5, 2019
Share

Vespa Product Updates, October/November 2019: Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance, and Datadog Monitoring Support

Kristian Aune, Tech Product Manager, Verizon Media In the September Vespa product update, we mentioned Tensor Float Support, Reduced Memory Use for Text Attributes, Prometheus Monitoring Support, and Query Dispatch Integrated in Container.  This month, we’re excited to share the following updates: Nearest Neighbor and Tensor Ranking Tensors are native to Vespa. We compared elastic.co to vespa.ai testing nearest neighbor ranking using dense tensor dot product. The result of an out-of-the-box configuration demonstrated that Vespa performed 5 times faster than Elastic. View the test results. Optimized JSON Tensor Feed Format A tensor is a data type used for advanced ranking and recommendation use cases in Vespa. This month, we released an optimized tensor format, enabling a more than 10x improvement in feed rate. Read more. Matched Elements in Complex Multi-value Fields Vespa is used in many use cases with structured data - documents can have arrays of structs or maps. Such arrays and maps can grow large, and often only the entries matching the query are relevant. You can now use the recently released matched-elements-only setting to return matches only. This increases performance and simplifies front-end code. Large Weighted Set Update Performance Weighted sets in documents are used to store a large number of elements used in ranking. Such sets are often updated at high volume, in real-time, enabling online big data serving. Vespa-7.129 includes a performance optimization for updating large sets. E.g. a set with 10K elements, without fast-search, is 86.5% faster to update. Datadog Monitoring Support Vespa is often used in large scale mission-critical applications. For easy integration into dashboards, Vespa is now in Datadog’s integrations-extras GitHub repository. Existing Datadog users will now find it easy to monitor Vespa. Read more. About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, October/November 2019: Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance, and Datadog Monitoring Support

November 5, 2019
Collection Page Redesign November 4, 2019
November 4, 2019
Share

Collection Page Redesign

Yufeng Gao, Software Engineer Intern, Verizon Media We would like to introduce our new collections dashboard page. Users can now know more about the statuses of pipelines and have more flexibility when managing pipelines within a collection. Main Features View Modes The new collection dashboard provides two view options - card mode and list mode. Both modes display pipeline repo names, branches, histories, and latest event info (such as commit sha, status, start date, duration). However, card mode also shows the latest events while the list mode doesn’t. Users can switch between the two modes using the toggle on the top right corner of the dashboard’s main panel. Collection Operations To create or delete a collection, users can use the left sidebar of the new collections page. For a specific existing collection, the dashboard offers three operations which can be found to the right of the title of the current collection: 1. Search all pipelines that the current collection doesn’t contain, then select and add some of them into the current collection; 2. Change the name and description of the current collection; 3. Copy and share the link of the current collection. Additionally, the dashboard also provides useful pipeline management operations: 1. Easily remove a single pipeline from the collection; 2. Remove multiple pipelines from the collection; 3. Copy and add multiple pipelines of the current collection to another collection. Default Collection Another new feature is the default collection, a collection where users can find all pipelines created by themselves. Note: Users have limited powers when it comes to the default collection; that is, they cannot conduct most operations they can do on normal collections. Users can only copy and share default collection links.Compatibility List In order to see the collection page redesign, you will need these minimum versions: - API: v0.5.781 - UI: v1.0.466Contributors - code-beast - adong - jithine Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on GitHub and Slack.

Collection Page Redesign

November 4, 2019
Recent Updates October 21, 2019
October 21, 2019
Share

Recent Updates

Jithin Emmanuel, Engineering Manager, Verizon Media Recent bug fixes in Screwdriver: Meta - skip-store option to prevent caching external meta. - meta cli is now concurrency safe. - When caching external metadata, meta-cli will not store existing cached data present in external metadata.API - Users can use SD_COVERAGE_PLUGIN_ENABLED environment variable to skip Sonarqube coverage bookend. - Screwdriver admins can now update build status to FAILURE through the API. - New API endpoint for fetching latest build for a job is now available. - Fix for branch filtering not working for PR builds. - Fix for branch filtering not working for triggered jobs.Compatibility List In order to have these improvements, you will need these minimum versions: - API - v0.5.773 - Launcher - v6.0.23Contributors Thanks to the following contributors for making this feature possible: - adong - klu909 - scr-oath - kumada626 - tk3fftk Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Recent Updates

October 21, 2019
Database schema migrations October 18, 2019
October 18, 2019
Share

Database schema migrations

Lakshminarasimhan Parthasarathy, Verizon Media Database schema migrations Screwdriver now supports database schema migrations using sequelize-cli migrations. When adding any fields to models in the data-schema, you will need to add a migration file. Sequelize-cli migrations keep track of changes to the database, helping with adding and/or reverting the changes to the DB. They also ensure models and migration files are in sync. Why schema migrations? Database schema migrations will help to manage the state of schemas. Screwdriver originally did schema deployments during API deployments while this was helpful for low-scale deployments, it also leads to unexpected issues for high-scale deployments. For such high-scale deployments, migrations are more effective as they ensure quicker and more consistent schema deployment outside of API deployments. Moreover, API traffic is not served until database schema changes are applied and ready. Cluster Admins In order to run schema migrations, DDL sync via API should be disabled using the DATASTORE_DDL_SYNC_ENABLED environment variable, since this option is enabled by default.  - Both schema migrations and DDL sync via API should not be run together. Either option should suffice based on the scale of Screwdriver deployment. - Always create new migration files for any new DDL changes. - Do not edit or remove migration files even after it’s migrated and available in the database.  Screwdriver cluster admins can refer to the following documentation for more details on  database schema migrations: - README: https://github.com/screwdriver-cd/data-schema/blob/master/CONTRIBUTING.md#migrations - Issue: https://github.com/screwdriver-cd/screwdriver/issues/1664 - Disable DDL sync via API: https://github.com/screwdriver-cd/screwdriver/pull/1756 Compatibility List In order to use this feature, you will need these minimum versions: - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.752 Contributors Thanks to the following people for making this feature possible: - parthasl - catto - dekus Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Database schema migrations

October 18, 2019
Improving Screwdriver’s meta tool October 2, 2019
October 2, 2019
Share

Improving Screwdriver’s meta tool

Sheridan Rawlins, Architect, Verizon Media Improving Screwdriver’s meta tool Over the past month there have been a few changes to the meta tool mostly focused on using external metadata, but also on helping to identify and self-diagnose a few silent gotchas we found. Metadata is a structured key/value data store that gives you access to details about your build. Metadata is carried over to all builds part of the same event, At the end of each build metadata is merged back to the event the build belongs to. This allows builds to share their metadata with other builds in the same event or external. External metadata External metadata can be populated for a job in your pipeline using the requires clause that refers to it in the form sd@${pipelineID}:${jobName} (such as sd@12345:main). If sd@12345:main runs to completion and “triggers” a job or jobs in your pipeline, a file will be created with meta from that build in /sd/meta/sd@12345:main.json, and you can refer to any of its data with commands such as meta get someKey --external sd@12345:main. The above feature has existed for some time, but there were several corner cases that made it challenging to use external metadata: 1. External metadata was not provided when the build was not triggered from the external pipeline such as by clicking the “Start” button, or via a scheduled trigger (i.e., through using the buildPeriodically annotation). 2. Restarting a previously externally-triggered build would not provide the external metadata. The notion of rollback can be facilitated by retriggering a deployment job, but if that deployment relies on metadata from an external trigger, it wouldn’t be there before these improvements.Fetching from lastSuccessfulMeta Screwdriver has an API endpoint called lastSuccessfulMeta. While it is possible to use this endpoint directly, providing this ability directly in the meta tool makes it a little easier to just “make it so”. By default now, if external metadata does not exist in the file /sd/meta/sd@12345:main.json, it is fetched from that job’s lastSuccessfulMeta via the API call. Should this behavior not be desired, the flag --skip-fetch can be used to skip fetching. For rollback behavior, however, this feature by itself wasn’t enough - consider a good deployment followed by a bad deployment. The “bad” deployment would most likely have deployed what, from a build standpoint, was “successful”. When retriggering the previous job, because it is a manual trigger, there will be no external metadata and the lastSuccessfulMeta will most likely be fetched and the newer, “bad” code would just get re-deployed again. For this reason the next feature was also added to meta - “caching external data in the local meta”. Caching external data in the local meta External metadata for sd@12345:main (whether from trigger or fetched from lastSuccessfulMeta) will now be stored into and searched first from the local meta under the key sd.12345.main. Note: no caching will be done when --skip-fetch is passed. This caching of external meta helps with a few use cases: 1. Rollback is facilitated because the external metadata at the time a job was first run is now saved and used when “Restart” is pressed for a job. 2. External metadata is now available throughout all jobs of an event - previously, only the triggered job or jobs would receive the external metadata, but because the local meta is propagated to every job in an event, the sd.12345.main key will be available to all jobs. Since meta will look there first, any job in a workflow can use that same --external sd@12345:main with confidence that it will get the same metadata which was received by the triggered job.Self-diagnosing gotchas 1. Meta uses a CLI parser called urfave/cli. Previously, it configured its CLI flags in both the “global” and “subcommand-specific” locations; this led to being able to pass flags in either location: before the subcommand like get or set or after it. However, they would only be honored in the latter position. Instead, only the –meta-space is global, and all other flags are per-subcommand. It is no longer possible to pass --external to the set subcommand. 2. Number of arguments - previously, if extra arguments were passed to flags that didn’t take them, or if arguments were forgotten to flags that expected an argument, then it was possible to become confused about the key and/or value vs flags. Now, the flags are strictly counted - 1 (“key”) for get, and 2 - (“key”, “value”) for set.

Improving Screwdriver’s meta tool

October 2, 2019
Vespa Product Updates, September 2019: Tensor Float Support, Reduced Memory Use for Text Attributes, Prometheus Monitoring Support, and Query Dispatch Integrated in Container September 28, 2019
September 28, 2019
Share

Vespa Product Updates, September 2019: Tensor Float Support, Reduced Memory Use for Text Attributes, Prometheus Monitoring Support, and Query Dispatch Integrated in Container

Kristian Aune, Tech Product Manager, Verizon Media In the August Vespa product update, we mentioned BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. This month, we’re excited to share the following updates with you: Tensor Float Support Tensors now supports float cell values, for example tensor(key{}, x[100]). Using the 32 bits float type cuts memory footprint in half compared to the 64 bits double, and can increase ranking performance up to 30%. Vespa’s TensorFlow and ONNX integration now converts to float tensors for higher performance. Read more. Reduced Memory Use for Text Attributes  Attributes in Vespa are fields stored in columnar form in memory for access during ranking and grouping. From Vespa 7.102, the enum store used to hold attribute data uses a set of smaller buffers instead of one large. This typically cuts static memory usage by 5%, but more importantly reduces peak memory usage (during background compaction) by 30%. Prometheus Monitoring Support Integrating with the Prometheus open source monitoring solution is now easy to do using the new interface to Vespa metrics. Read more. Query Dispatch Integrated in Container The Vespa query flow is optimized for multi-phase evaluation over a large set of search nodes. Since Vespa-7-109.10, the dispatch function is integrated into the Vespa Container process which simplifies the architecture with one less service to manage. Read more. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, September 2019: Tensor Float Support, Reduced Memory Use for Text Attributes, Prometheus Monitoring Support, and Query Dispatch Integrated in Container

September 28, 2019
Bug fixes and improvements September 26, 2019
September 26, 2019
Share

Bug fixes and improvements

Tiffany Kyi, Software Engineer, Verizon Media Over the last month, we’ve made changes to improve Screwdriver performance, enhance the meta-cli, and fix some feature bugs. Performance - Removed aggregate view - this feature was making calls to get all builds for all jobs in a pipeline; removing this view greatly decreases the load on our API - Added indexes for querying builds table - this change should speed up API calls to get jobsMeta - Allow json values in meta get/set - can set json values in meta using –json-value or -j - –external works even if the external job did not trigger the current one; it will fetch meta from the external job’s last successful buildBugs - Fix for ignoreCommitsBy - will skip ci for any authors that match ignoreCommitsBy field (set by the cluster admin) - Fix cmdPath when running with sourceDir - working directory is fixed now - Version or tag-specific template and command URLs - Now, when switching to a different version or tag in the UI, the URL will update accordingly (ie: Clicking on latest tag for python/validate template -> https://cd.screwdriver.cd/templates/python/validate_type/0.2.120 (corresponding version)) - Use new circuit-fuses - the latest package has enhanced logging options (https://github.com/screwdriver-cd/circuit-fuses/pull/23) - PR authors should not be able to restart builds if restrictPR is onCompatibility List Note: You will need to pull in the buildcluster-queue-worker/queue-worker first, then the API, otherwise you will get data-schema failures. In order to have these improvements, you will need these minimum versions (please read the note above): - API - v0.5.751 - UI - v1.0.447 - Store - v3.10.0 - Launcher - v6.0.18 - Build cluster queue worker - v1.3.7 - Queue worker - v2.7.11Contributors Thanks to the following contributors for making this feature possible: - adong - d2lam - ibu1224 - jithin1987 - klu909 - parthasl - scr-oath Questions and Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Bug fixes and improvements

September 26, 2019
Dash Open 12: Apache Storm 2.0 - Open Source Distributed Real-time Computation System September 25, 2019
September 25, 2019
Share

Dash Open 12: Apache Storm 2.0 - Open Source Distributed Real-time Computation System

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Kishor Patil, Sr. Principal Software Systems Engineer on the Verizon Media team. Kishor shares what’s new in Storm 2.0, an open source distributed real-time computation system, as well as, how Verizon Media uses and contributes to Storm. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 12: Apache Storm 2.0 - Open Source Distributed Real-time Computation System

September 25, 2019
Vespa Product Updates, August 2019: BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export August 19, 2019
August 19, 2019
Share

Vespa Product Updates, August 2019: BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export

Kristian Aune, Tech Product Manager, Verizon Media In the recent Vespa product update, we mentioned Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow. This month, we’re excited to share the following feature updates with you: BM25 Rank Feature The BM25 rank feature implements the Okapi BM25 ranking function and is a great candidate to use in a first phase ranking function when you’re ranking text documents. Read more. Searchable Reference Attribute A reference attribute field can be searched using the document id of the parent document type instance as query term, making it easy to find all children for a parent document. Learn more. Tensor in Summary Features A tensor can now be returned in summary features. This makes rank tuning easier and can be used in custom Searchers when generating result sets. Read more. Metrics Export To export metrics out of Vespa, you can now use the new node metric interface. Aliasing metric names is possible and metrics are assigned to a namespace. This simplifies integration with monitoring products like CloudWatch and Prometheus. Learn more about this update. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, August 2019: BM25 Rank Feature, Searchable Parent References, Tensor Summary Features, and Metrics Export

August 19, 2019
Dash Open 11: Elide - Open Source Java Library - Easily Stand Up a JSON API or GraphQL Web Service August 14, 2019
August 14, 2019
Share

Dash Open 11: Elide - Open Source Java Library - Easily Stand Up a JSON API or GraphQL Web Service

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Gil Yehuda, Sr. Director of Open Source, interviews Aaron Klish, a Distinguished Architect on the Verizon Media team. Aaron shares why Elide, an open source Java library that enables you to stand up a JSON API or GraphQL web service with minimal effort, was built and how others can use and contribute to Elide. Learn more at http://elide.io/. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 11: Elide - Open Source Java Library - Easily Stand Up a JSON API or GraphQL Web Service

August 14, 2019
Artifacts Preview August 7, 2019
August 7, 2019
Share

Artifacts Preview

Alan Dong, Software Engineer, Verizon Media We have recently rolled out a highly requested feature: Artifacts Preview. With this feature, users can view unit test results and click through to other files without needing to download the files locally from $SD_ARTIFACTS_DIR. An example of unit tests in the Screwdriver UI: We also made artifacts a separate route so users can share artifact links with teammates. You can see the live demo at: https://cd.screwdriver.cd/pipelines/1/builds/141890/artifacts Implementation We have gone through multiple iterations of design prior to implementation to reach the above result. We have also redesigned the look and feel based on user feedback. First, the main concern in our design process was security. We wanted to make sure the artifact viewer was code-safe in the unlikely case that a generated artifact contained malicious code. In order to protect our users, we decided to embed html inside an iframe with the sandbox attribute turned on. Iframe, or inline frame, already serves as a layer of separation between our application and the artifacts we’re trying to load. Content in an iframe is able to access content from the parent frame only through a specific attribute. Using the sandbox attribute of an iframe allows for greater granularity and control on the framed content. Second, the next consideration in our design was authentication. We architected Screwdriver to be cloud-ready, with horizontal scalability in mind; thus, the main work horses of the application, the UI, API, and Store were split into microservices (see Overall Architecture). Due to this set up, all artifacts are stored in the Store, all data is shown in the UI, and the API acts as both the gateway and mediator between the UI and Store. The diagram below reflects this relationship: the UI communicates with the API, and API sends back a 302 redirect with a short-lived JWT Token issued Store link. After the link is returned, the UI makes a request with the link to get the appropriate artifacts from the Store. Third, the last main concern was user experience. We wanted to be able to preserve the user’s content type when possible so users could view their artifacts natively in their proper format. The Store generally returns HTML with an image, anchor links, CSS, or Javascript as relative paths as shown in the following examples: <img src="example.png" alt="image"><a href="./example.html"> <link href="example.css"><script src="example.js"> Our solution was to inject a customized script when the query parameter is ?type=preview, to replace the relative path so its URLs are prefixed by the API. This change allowed us to only inject code if the user is previewing artifacts through the Screwdriver UI. Otherwise, we return the user’s original content. One caveat due to this design is that since we don’t override CSS content, some background URLs will not load correctly. Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.722 - UI - v1.0.440 - Store - v3.10.0 - Launcher - v6.0.12Contributors Thanks to the following contributors for making this feature possible: - adong Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Artifacts Preview

August 7, 2019
Meet Yahoo Research at KDD 2019 August 2, 2019
August 2, 2019
Share

Meet Yahoo Research at KDD 2019

By Kim Capps-Tanaka, Chief of Staff, Yahoo Research If you’re attending KDD in Anchorage, Alaska, the Yahoo Research team would love to meet you! Send us an email or tweet to discuss research or job opportunities on the team. In addition to hosting a booth, we’re excited to present papers, posters, and talks.  Sunday, August 4th - “Modeling and Applications for Temporal Point Processes”, Junchi Yan, Hongteng Xu, Liangda Li -  8am - 12pm, Summit 8-Ground Level, Egan Monday, August 5th - “Time-Aware Prospective Modeling of Users for Online Display Advertising”, Djordje Gligorijevic, Jelena Gligorijevic, Aaron Flores - 8:40am - 9am, Kahtnu 2 - Level 2, Dena’ina - “The Future of Ads”, Brendan Kitts - 3pm-3:30pm, Kahtnu 2 - Level 2, Dena’ina - “Learning from Multi-User Activity Trails for B2B Ad Targeting”, Shaunak Mishra, Jelena Gligorijevic, Narayan Bhamidipati - 4:35pm-4:55pm, Kahtnu 2- Level 2, Dena’ina - “Automatic Feature Engineering From Very High Dimensional Event Logs Using Deep Neural Networks”, Kai Hu, Joey Wang, Yong Liu, Datong Chen  - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall Tuesday, August 6th - “Predicting Different Type of Conversions using Multi-Task Learning”, Junwei Pan, Yizhi Mao, Alfonso Ruiz, Yu Sun, Aaron Flores - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall - “Carousel Ads Optimization in Yahoo Gemini Native”, Oren Somekh, Michal Aharon, Avi Shahar, Assaf Singer, Boris Trayvas, Hadas Vogel, Dobri Dobrev - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall - “Understanding Consumer Journey using Attention-based Recurrent Neural Networks”, Yichao Zhou, Shaunak Mishra, Jelena Gligorijevic, Tarun Bhatia, Narayan Bhamidipati - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall - “Recurrent Neural Networks for Stochastic Control in Real-Time Bidding”, Nicolas Grislain, Nicolas Perrin, Antoine Thabault - 7pm-9:30pm, Section 3 of Idlughet (Eklutna) Exhibit Hall * Bold authors denotes Yahoo Researchers Hope to see you at KDD!

Meet Yahoo Research at KDD 2019

August 2, 2019
Introducing Denali: An Open Source Themeable Design System July 31, 2019
July 31, 2019
Share

Introducing Denali: An Open Source Themeable Design System

By Jazmin Orozco, Product Designer, Verizon Media As designers on the Platforms and Technology team at Yahoo (now Verizon Media), we understand firsthand that creating polished and intuitive user interfaces (UI) is a difficult task - even more so when projects do not have dedicated resources for UI design. In order to provide a solution to this, we created an easy plug and play approach called Denali. Denali is a comprehensive and customizable design system and CSS framework that provides a scalable approach to efficient UI design and development. Denali has worked so well for us internally that we’ve decided to share it with the open source community in the hope that your projects may also benefit from its use. Denali is rooted in our experience designing for a wide variety of platform interfaces including monitoring dashboards, CI/CD tools, security authentication, and localization products. Some of these platforms, such as Screwdriver and Athenz, are even open source projects themselves. When creating Denali we audited these platforms to create a library of visually consistent and reusable UI components. These became the core of our design system. We then translated the components into a CSS framework and applied the design system across our products. In doing so we were able to quickly create consistent experiences across our product family. As a whole, Denali allows us to unify the visual appearance of our platform product family, enhance our user’s experience, and facilitate efficient cross-functional front-end development. We encourage you to use Denali as the UI framework for your own open source projects. We look forward to your feedback, ideas, and contributions as we continue to improve and expand Denali. The Denali Design System simplifies the UI design and development process by providing: - A component library with corresponding front-end frameworks - Customization for themes - An icon library with a focus on data and engineering topics such as data visualization, CI/CD, and security - Design principles Component Library and Frameworks Denali’s component library contains 20+ individual component types with a corresponding CSS framework. Components are framework independent allowing you to use only what you need. Additionally, we’ve started building out other industry-leading frameworks such as Angular, Ember, and React. Theme Customization Denali’s components support theming through custom variables. This means their visual appearance can be adapted easily to fit the visual style of any brand or catered towards specific use cases while maintaining the same structure. Data and Engineering Focused Icon Library Denali’s custom icon library offers over 800 solid and outline icons geared towards engineering and data. Icons are available for use as svg, png, and as a font. We also welcome icon requests through GitHub. Design Principles Denali’s comprehensive Design Principles provide guidelines and examples on the proper implementation of components within a product’s UI to create the best user experience. Additionally, our design principles have a strong focus on accessibility best practices. We are excited to share Denali with the open source community. We look forward to seeing what you build with Denali as well as your contributions and feedback! Stay tuned for exciting updates and reach out to us on twitter @denali_design or via email. Acknowledgments Jay Torres, Chas Turansky, Marco Sandoval, Chris Esler, Dennis Chen, Jon Kilroy, Gil Yehuda, Ashley Wolf, Rosalie Bartlett

Introducing Denali: An Open Source Themeable Design System

July 31, 2019
Dash Open 09: Panoptes - Open Source Global Scale Network Telemetry Ecosystem July 30, 2019
July 30, 2019
Share

Dash Open 09: Panoptes - Open Source Global Scale Network Telemetry Ecosystem

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Ian Holmes, James Diss, and Varun Varma, from the Verizon Media team. Learn why they built and open sourced Panoptes, a global scale network telemetry ecosystem, and how others can use and contribute to Panoptes. Learn more at https://getpanoptes.io/. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 09: Panoptes - Open Source Global Scale Network Telemetry Ecosystem

July 30, 2019
Introducing Ariel: Open Source AWS Reserved Instances Management Tooling July 1, 2019
July 1, 2019
Share

Introducing Ariel: Open Source AWS Reserved Instances Management Tooling

Sean Bastille, Chief Architect, Verizon Media Effectively using Reserved Instances (RIs) is a cornerstone component for managing costs in AWS. Proper evaluations of RIs can be challenging. There are many tools, each with their own nuances, that help evaluate RI needs. At Verizon Media, we built a tool to help manage our RIs, called Ariel, and today we are pleased to announce that we have open-sourced Ariel so that you can use and customize it for your own needs. Why We Built Ariel The main reason we chose to build Ariel was due to the limitations of currently available solutions. Amazon provides RI recommendations, both as an executive service, and through Cost Explorer, however, these tools: - Target break-even RI Utilization, without the flexibility to tune - Evaluate per-account RI need, or company-wide RI need, but are not capable of combining the views Additionally, Ariel has a sophisticated configuration allowing for multiple passes of RI evaluations targeting usage slope based thresholds and allowing for simultaneous classic and convertible RI recommendations. Whereas there are 3rd party vendor tools that help optimize RI utilization, we did not find an open source solution that was free to use and could be expanded upon by a community. How Ariel Reduces EC2 Costs RIs are a core component of cost management at Verizon Media. By using RIs, we reduce EC2 costs in some workloads by as much as 70%, which in turn reduces our AWS bill by about 25%.   Ariel helps us evaluate RI purchases by determining: - What our current RI demand is, taking into consideration existing RIs - Floating RIs, which are not used in the purchasing account, but are available to the company - Which specific accounts to make purchases in so the costs can be more closely aligned with P&Ls Explore and Contribute We invite you to use Ariel and join our community by developing more features and contributing to the code. If you have any questions, feel free to email my team.

Introducing Ariel: Open Source AWS Reserved Instances Management Tooling

July 1, 2019
Trusted templates and commands June 26, 2019
June 26, 2019
Share

Trusted templates and commands

Dekus Lam, Developer Advocate, Verizon Media Tiffany Kyi, Software Engineer, Verizon Media Currently, Screwdriver offers templates and commands which help simplify configurations across piplines by encapsulating sets of predefined steps for jobs and save time by reusing prebuilt sets of instructions on jobs. Since these templates and commands are mostly sourced and powered by the developer community, it is better to have a standard process to promote some as certified, or as the Screwdriver team calls it, “Trusted”. Although the certification process may vary among teams and companies, Screwdriver provides the abstraction for system administrators to easily promote/demote partnering templates and commands. Certified, or “Trusted”, templates or commands will receive a special badge next to their name on both the search listing as well as the detailed page. “Trusted” toggle button for Screwdriver Admins: Compatibility List In order to use this feature, you will need these minimum versions (please read the note above): - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.705 - [UI] (https://hub.docker.com/r/screwdrivercd/ui) - v1.0.432Contributors Thanks to the following contributors for making this feature possible: - DekusDenial - tkyiQuestions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Trusted templates and commands

June 26, 2019
Dash Open 08: Bullet - Open Source Real-Time Query Engine for Large Data Streams June 12, 2019
June 12, 2019
Share

Dash Open 08: Bullet - Open Source Real-Time Query Engine for Large Data Streams

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Nate Speidel, a Software Engineer at Verizon Media. Nate shares why his team built and open sourced Bullet, a real-time query engine for very large data streams, and how others can use and contribute to Bullet. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 08: Bullet - Open Source Real-Time Query Engine for Large Data Streams

June 12, 2019
Shared Verizon Media’s AI Solutions at the AI Accelerator Summit - Automobile Traffic Flow Monitoring, Cellular Network Performance Prediction, IoT Analytics, and Threat Detection June 8, 2019
June 8, 2019
Share

Shared Verizon Media’s AI Solutions at the AI Accelerator Summit - Automobile Traffic Flow Monitoring, Cellular Network Performance Prediction, IoT Analytics, and Threat Detection

Chetan Trivedi, Head of Technical Program Management (Verizon Solutions Team), Verizon Media I recently spoke at the AI Accelerator Summit in San Jose. During my presentation, I shared a few of Verizon Media’s AI Solutions via four machine learning use cases, including: - Cellular Network Performance Prediction - Our team implemented a time series prediction model for the performance of base station parameters such as bearer drop, SIP, and handover failure. - Threat Detection System - DDoS (Distributed Denial of Service) use case where we implemented a real-time threat detection capability using time series data. - Automobile Traffic Flow Monitoring - A collaboration with a city to identify traffic patterns at certain traffic junctions and streets to provide insights so they can improve traffic patterns and also address safety concerns. - IoT Analytics - Detecting vending machine anomalies and addressing them before dispatching the service vehicle with personnel to fix the problem which is very costly for the businesses. During the conference, I heard many talks that reinforced common machine learning and AI industry themes. These included: - Key factors to consider when selecting the right use cases for your AI/ML efforts include understanding your error tolerance and ensuring you have sufficient training data. - Implementing AI/ML at scale (with a high volume of data) and moving towards deep learning for supported use cases, where data is highly dimensional and/or higher prediction accuracy is required with enough data to train deep learning models. - Using ensemble learning techniques such as bagging, boosting or other variants of these methods. At Verizon Media, we’ve built and open sourced several helpful tools that are focused on big data, machine learning, and AI, including: - DataSketches - high-performance library of stochastic streaming algorithms commonly called “sketches” in the data sciences. - TensorFlowOnSpark - brings TensorFlow programs to Apache Spark clusters. - Trapezium - framework to build batch, streaming and API services to deploy machine learning models using Spark and Akka compute. - Vespa - big data serving engine. If you’d like to discuss any of the above use cases or open source projects, feel free to email me.

Shared Verizon Media’s AI Solutions at the AI Accelerator Summit - Automobile Traffic Flow Monitoring, Cellular Network Performance Prediction, IoT Analytics, and Threat Detection

June 8, 2019
Apache Storm 2.0 Improvements May 30, 2019
May 30, 2019
Share

Apache Storm 2.0 Improvements

By Kishor Patil, Principal Software Systems Engineer at Verizon Media, and PMC member of Apache Storm & Bobby Evans, Apache Member and PMC member of Apache Hadoop, Spark, Storm, and Tez We are excited to be part of the new release of Apache Storm 2.0.0. The open source community has been working on this major release, Storm 2.0, for quite some time. At Yahoo we had a long time and strong commitment to using and contributing to Storm; a commitment we continue as part of Verizon Media. Together with the Apache community, we’ve added more than 1000 fixes and improvements to this new release. These improvements include sending real-time infrastructure alerts to the DevOps folks running Storm and the ability to augment ingested content with related content, thereby giving the users a deeper understanding of any one piece of content.   Performance Performance and utilization are very important to us, so we developed a benchmark to evaluate various stream processing platforms and the initial results showed Storm to be among the best. We expect to release new numbers by the end of June 2019, but in the interim, we ran some smaller Storm specific tests that we’d like to share. Storm 2.0 has a built-in load generation tool under examples/storm-loadgen. It comes with the requisite word count test, which we used here, but also has the ability to capture a statistical representation of the bolts and spouts in a running production topology and replay that load on another topology, or another version of Storm. For this test, we backported that code to Storm 1.2.2. We then ran the ThroughputVsLatency test on both code bases at various throughputs and different numbers of workers to see what impact Storm 2.0 would have. These were run out of the box with no tuning to the default parameters, except to set max.spout.pending in the topologies to be 1000 sentences, as in the past that has proven to be a good balance between throughput and latency while providing flow control in the 1.2.2 version that lacks backpressure. In general, for a WordCount topology, we noticed 50% - 80% improvements in latency for processing a full sentence. Moreover, 99 percentile latency in most cases, is lower than the mean latency in the 1.2.2 version. We also saw the maximum throughput on the same hardware more than double. Why did this happen? STORM-2306 redesigned the threading model in the workers, replaced disruptor queues with JCTools queues, added in a new true backpressure mechanism, and optimized a lot of code paths to reduce the overhead of the system. The impact on system resources is very promising. Memory usage was untouched, but CPU usage was a bit more nuanced. At low throughput (< 8000 sentences per second) the new system uses more CPU than before. This can be tuned as the system does not auto-tune itself yet. At higher rates, the slope of the line is much lower which means Storm has less overhead than before resulting in being able to process more data with the same hardware. This also means that we were able to max out each of these configurations at > 100,000 sentences per second on 2.0.0 which is over 2x the maximum 45,000 sentences per second that 1.2.2 could do with the same setup. Note that we did nothing to tune these topologies on either setup. With true backpressure, a WordCount Topology could consistently process 230,000 sentences per second by disabling the event tracking feature. Due to true backpressure, when we disabled it entirely, then we were able to achieve over 230,000 sentences per second in a stable way, which equates to over 2 million messages per second being processed on a single node. Scalability In 2.0, we have laid the groundwork to make Storm even more scalable. Workers and supervisors can now heartbeat directly into Nimbus instead of going through ZooKeeper, resulting in the ability to run much larger clusters out of the box. Developer Friendly Prior to 2.0, Storm was primarily written in Clojure. Clojure is a wonderful language with many advantages over pure Java, but its prevalence in Storm became a hindrance for many developers who weren’t very familiar with it and didn’t have the time to learn it.  Due to this, the community decided to port all of the daemon processes over to pure Java. We still maintain a backward compatible storm-clojure package for those that want to continue using Clojure for topologies. Split Classpath In older versions, Storm was a single jar, that included code for the daemons as well as the user code. We have now split this up and storm-client provides everything needed for your topology to run. Storm-core can still be used as a dependency for tests that want to run a local mode cluster, but it will pull in more dependencies than you might expect. To upgrade your topology to 2.0, you’ll just need to switch your dependency from storm-core-1.2.2 to storm-client-2.0.0 and recompile.   Backward Compatible Even though Storm 2.0 is API compatible with older versions, it can be difficult when running a hosted multi-tenant cluster. Coordinating upgrading the cluster with recompiling all of the topologies can be a massive task. Starting in 2.0.0, Storm has the option to run workers for topologies submitted with an older version with a classpath for a compatible older version of Storm. This important feature which was developed by our team, allows you to upgrade your cluster to 2.0 while still allowing for upgrading your topologies whenever they’re recompiled to use newer dependencies. Generic Resource Aware Scheduling With the newer generic resource aware scheduling strategy, it is now possible to specify generic resources along with CPU and memory such as Network, GPU, and any other generic cluster level resource. This allows topologies to specify such generic resource requirements for components resulting in better scheduling and stability. More To Come Storm is a secure enterprise-ready stream but there is always room for improvement, which is why we’re adding in support to run workers in isolated, locked down, containers so there is less chance of malicious code using a zero-day exploit in the OS to steal data. We are working on redesigning metrics and heartbeats to be able to scale even better and more importantly automatically adjust your topology so it can run optimally on the available hardware. We are also exploring running Storm on other systems, to provide a clean base to run not just on Mesos but also on YARN and Kubernetes. If you have any questions or suggestions, please feel free to reach out via email. P.S. We’re hiring! Explore the Big Data Open Source Distributed System Developer opportunity here.

Apache Storm 2.0 Improvements

May 30, 2019
Vespa Product Updates, May 2019: Deploy Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements May 29, 2019
May 29, 2019
Share

Vespa Product Updates, May 2019: Deploy Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements

Kristian Aune, Tech Product Manager, Verizon Media In a recent post, we mentioned Tensor updates, Query tracing and coverage. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to evolve. For May, we’re excited to share the following feature updates with you: Multithreaded Disk Index Fusion Content nodes are now able to sustain a higher feed rate by using multiple threads for disk index fusion. Read more. Feeding Improvements Cluster-internal communications are now multithreaded out of the box, for high throughput feeding operations. This fully utilizes a 10 Gbps network and improves utilization of high-CPU content nodes. Ideal State Optimizations Whenever the content cluster state changes, the ideal state is calculated. This is now optimized (faster and runs less often) and state transitions like node up/down will have less impact on read and write operations. Learn more in the dynamic data distribution documentation. Download Machine Learning Models During Deploy One procedure for using/importing ML models to Vespa is to put them in the application package in the models directory. Applications where models are trained frequently in some external system can refer to the model by URL rather than including it in the application package. This use case is now documented in deploying remote models, and solves the challenge of deploying huge models. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, May 2019: Deploy Large Machine Learning Models, Multithreaded Disk Index Fusion, Ideal State Optimizations, and Feeding Improvements

May 29, 2019
Custom Source Directory May 24, 2019
May 24, 2019
Share

Custom Source Directory

Min Zhang, Software Engineer, Verizon Media Tiffany Kyi, Software Engineer, Verizon Media Previously you were limited having one screwdriver.yaml at root per SCM repository. This prevented users from running workflows based on subdirectories in a monorepo. Now, you can specify a custom source directory for your pipeline, which means you can create multiple pipelines on a single repository! Usage The directory path is relative to the root of the repository. You must have a screwdriver.yaml under your source directory. Example Given a repository with the file structure depicted below: ┌── README.md ├── screwdriver.yaml ├── myapp1/ │ └──test.js ... ├── myapp2/ │ ├── app/ │ │ ├── main.js │ │ ├── ... │ │ └── package.json │ └── screwdriver.yaml │ ... Create pipeline with source directory Update pipeline with source directory In this example, jobs that requires: [~commit, ~pr] will be triggered if there are any changes to files under myapp2. Caveats - This feature is only available for the Github SCM right now. - If you use sourcePaths together with custom source directory, the scope of the sourcePaths is limited to your source directory. You can not listen on changes that are outside your source directory. Note the path for your sourcePaths is relative to the root of the repository not your source directory. For example, if you want to add sourcePaths to listen on changes to main.js and screwdriver.yaml, you should set: sourcePaths: [myapp2/app/main.js, myapp2/screwdriver.yaml] If you try to set sourcePaths: [app/main.js], it will not work, as it is missing the source dir myapp2 and you cannot set a relative source path. If you try to set sourcePaths: [myapp1/test.js], it will not work, as it is outside the scope of your source directory, myapp2. - The screwdriver.yaml must be located at the root of your custom source directory.Compatibility List In order to use this feature, you will need these minimum versions (please read the note above): - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.692 - [UI] (https://hub.docker.com/r/screwdrivercd/ui) - v1.0.425 - [Launcher] (https://hub.docker.com/r/screwdrivercd/launcher) - v6.0.4Contributors Thanks to the following contributors for making this feature possible: - minz1027 - tkyi Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Custom Source Directory

May 24, 2019
Announcing Prototrain-ranker: Open Source Search and Ranking Framework May 16, 2019
May 16, 2019
Share

Announcing Prototrain-ranker: Open Source Search and Ranking Framework

Huy Nguyen, Research Engineer, Verizon Media & Eric Dodds, Research Scientist, Verizon Media E-commerce fashion and furniture sites use a fundamentally different way of searching for content based on visual similarity. We call this “Search 2.0” in homage to Andrej Karpathy’s Software 2.0 essay. Today we’re announcing the release of an open source ranking framework called prototrain-ranker which you can use in your modern search projects. This is based on our extensive research in search technology and ranking optimizations. We’ll describe the visual search problem, how it fits into a developing trend of search engines and the evolving technologies surrounding the industry, and why we open sourced our model and machine learning framework, inviting you to use and work with us to improve. The Search 1.0 stack is one that many engineers and search practitioners are familiar with. It involves indexing documents and relies upon matching keywords to terms in a collection of documents to surface relevant content at query time. In contrast, Search 2.0 relies upon “embeddings” rather than documents, and k-nearest-neighbors retrieval rather than term matching to surface relevant content. The programmer does not directly specify the map from content to embeddings. Instead, the programmer specifies how this map is derived from data. Think of embeddings as points in a high-dimensional space that are used to represent some piece of content or metadata. In a Search 2.0 system, embeddings lying close to each other in this space are more highly “related” or “relevant” than points that are far apart. Instead of parsing a query for specific terms and then matching for those terms in our document index, a Search 2.0 system would encode the query into an embedding space and retrieve the data associated with nearby embeddings. Prototrain-ranker provides two things: (1) a “ranker” model for mapping content to embeddings and performing search, and (2) our “prototrain” framework for training prototype machine learning models like our ranker Search 2.0 system. Why Search 2.0 Whether we’re searching over videos, images, text, or other media, we can represent each type of data as an embedding using the proper deep learning techniques. Representing metadata as embeddings in high-dimensional space opens up the world of search to the powerful machinery of deep learning tools. We can learn ranking functions and directly encode “relevance” into embeddings, avoiding the need for brittle and hand-engineered ranking functions. For example, it would be error-prone and tedious to program a Search 1.0 search engine to respond to queries like “images with a red bird in the upper right-hand corner”. Certainly one could build specific classifiers for each one of these attributes (color, object, location) and index them. But each individual classifier and rule to parse the results would take work to build and test, with any new attribute entailing additional work and opportunities for errors and brittleness. Instead one could build a Search 2.0 system by obtaining pairs of images and descriptions that directly capture one’s notion of “relevance” to train an end-to-end ranking model. The flexibility of this approach – defining relevance as an abstract distance using examples rather than potentially brittle rules – allows several other capabilities in a straightforward manner. These capabilities include multi-modal input (e.g. text with an image), interpolating between queries (“something between this sofa and that one”), and conditioning a query (“a dress like this, but in any color”). Reframing search as a nearest-neighbor retrieval also has other benefits. We separate the process of ranking from the process of storing data. In doing so, we are able to reduce rules and logic of Search 1.0 matching and ranking into a portable matrix multiplication routine. This makes the search engine massively parallel and allows it to take advantage of GPU hardware which has been optimized over decades to efficiently execute matrix multiplication. Why we open sourced prototrain-ranker The code we open source today enables a key component in the Search 2.0 system. It allows one to “learn” embeddings by defining pairs of relevant and irrelevant data items. We provide as an example the necessary processing to train the model on the Stanford Online Products dataset, which provides multiple images of each of the thousands of products. The notion of relevance here is that two images contain the same item. We also use the prototrain framework for training other machine learning models such as image classifiers. You can too. Please check out the framework and/or the ranker model. We hope you will have questions or comments, and will want to contribute to the project. Engage with us via GitHub or email directly if you have questions.

Announcing Prototrain-ranker: Open Source Search and Ranking Framework

May 16, 2019
Announcing the new Bay Area CI/CD and DevOps Meetup - Join us on May 21st at the Yahoo Campus in Sunnyvale May 13, 2019
May 13, 2019
Share

Announcing the new Bay Area CI/CD and DevOps Meetup - Join us on May 21st at the Yahoo Campus in Sunnyvale

By Ashley Wolf, Open Source Program Manager, Verizon Media Continuous Delivery (CD) enables software development teams to move faster and adapt to users’ needs quicker by reducing the inherent friction associated with releasing software changes. Releasing new software was once considered risky. By implementing CD, we confront the fragility and improve engineering resilience by delivering new software constantly and automatically. At Yahoo, we built a tool called Screwdriver that enabled us to implement CD at incredible scale. In 2016, Yahoo open sourced Screwdriver.cd, a streamlined build system designed to enable Continuous Delivery to production at scale for dynamic infrastructure. In the spirit of open source and community, Verizon Media’s Open Source and Screwdriver teams (formerly Yahoo) started the Bay Area CI/CD and DevOps Meetup to build and grow community around continuous delivery. We are planning frequent meetups hosted at Yahoo where industry experts from the Bay Area will be invited to share stories about their CI/CD and DevOps experiences. We invite you to join us. Our first meetup is on May 21st, 5pm to 8:30pm, at Yahoo in Sunnyvale. Learn from speakers at Walmart Labs, SMDV, and Aeris. RSVP here. Agenda 5-5:45: Pizza & Networking. 5:45-6: Welcome & Introductions. 6-6:30: “Supercharging the CI/CD pipelines at Walmart with Concord” Vilas Veeraraghavan, Director of Engineering, Walmart Labs This talk will focus on how Concord (https://concord.walmartlabs.com/) – an open source workflow orchestration tool built at Walmart – helped supercharge the continuous delivery pipelines used by application teams. We will start with an overview of the state of CI/CD in the industry and then showcase the progress made at Walmart and the upcoming innovations we are working on. 6:30-7: “Successful Continuous Build, Integration & Deployment + Continuous or Controlled Delivery?” Karthi Sadasivan, Director of Engineering (DevOps), Aeris This talk will explore ways to improve speed, quality, and security, as well as, how to align tools, processes, and people. 7-7:30: “Practical CI/CD for React Native Apps” Ariya Hidayat, EIR, SMDV React Native emerges as a popular solution to build Android and iOS applications from a single code base written in JavaScript/TypeScript. For teams just starting to embrace React Native, the best practices to ensure rock-solid development and deployment are not widely covered yet. In this talk, we will discuss practical CI/CD techniques that allow your team to accelerate the process towards the development of world-class, high-quality React Native apps: - Automated build and verification for every single revision - Continuous check for code quality metrics - Easy deployment to the QA/QE/Verification team 7:30-8:30: Open Discussion. Using one of the available microphones, share a question or thought with attendees. Collectively, let’s share and discuss CICD/DevOps struggles and opportunities. Speakers - Vilas Veeraraghavan, Director of Engineering, Walmart Labs Vilas joined Walmart Labs in 2017 and leads the teams responsible for the continuous integration, testing, and deployment pipelines for eCommerce and Stores. Prior to joining Walmart Labs, he had long stints at Comcast and Netflix where he wore many hats as automation, performance, and failure testing lead. - Karthi Sadasivan, Director of Engineering (DevOps), Aeris Karthi heads the DevOps Practice at Aeris Communications. She has 18+ years of global IT industry experience with expertise in Product Engineering Services, DevOps, Agile Engineering and Continuous Delivery. Karthi is a DevOps Evangelist, DevOps Practitioner, and DevOps Enabler. She enjoys to architect, implement and deliver end-to-end devops solutions across multiple industry domains. Karthi is a thought-leader and solution finder, she has a strong passion for solving business problems by bringing people, process, and technologies together. - Ariya Hidayat, EIR, SMDV Ariya’s official day job is to run start-up engineering teams and he has been done that a couple of times already. Yet, he’s equally excited with building open-source tools such as PhantomJS (the world first’s headless browser) and Esprima (one of the most popular npm modules). Through his active involvement in the development communities, Ariya is on a mission to spread the gospel of engineering excellence and so far he has delivered over a hundred tech talks of various subjects. Meetups are a great way to learn about the latest technology trends, open source projects you can join, and networking opportunities that can turn into your next great job. So come for the talks, stay for the conversation, pizza, refreshments, and cookies. Invest in your tech career and meet people who care about the things you care about. Want to get involved? - Join the Meetup group to find out about upcoming events. - RSVP for the May 21st Meetup. - Volunteer at an upcoming meetup. - Apply to be a speaker. - Ask us about anything, we’re open to work with you.

Announcing the new Bay Area CI/CD and DevOps Meetup - Join us on May 21st at the Yahoo Campus in Sunnyvale

May 13, 2019
Expanding Environment Variables May 13, 2019
May 13, 2019
Share

Expanding Environment Variables

Dao Lam, Software Engineer, Verizon Media Previously, Screwdriver users had to rely on string substitutions or create a step to export variables to get environment variables evaluated. For example: RESTORE_STATEFILE=`echo ${RESTORE_STATEFILE} | envsubst '${SD_ARTIFACTS_DIR}'` Or steps: - export-env: export RESTORE_STATEFILE=${SD_ARTIFACTS_DIR}/statefile With this change, users can now expand environment variables within the “environment” field in screwdriver.yaml like below: jobs: main: image: node:10 environment: FOO: hello BAR: "${FOO} world" steps: - echo: echo $BAR # prints “hello world” requires: [~pr, ~commit] Setting default cluster environment variables (for cluster admins) Cluster admins can now set default environment variables to be injected into user builds. They can be configured in the API config under the field builds: { environment: {} } or via CLUSTER_ENVIRONMENT_VARIABLES. Read more about how to configure the API here Order of evaluation Environment variables are now evaluated in this order: - User secrets - Base ENV set by launcher such as SD_PIPELINE_ID, SD_JOB_ID, etc. - Cluster ENV set by cluster admin - Build ENV set by user in screwdriver.yaml environment fieldImportant note when pulling in this feature This new version of API v0.5.677 needs to be pulled in together with the new version of LAUNCHER 6.0.1 because it includes a breaking change (GET /v4/builds now returns environment as an array to ensure in-order evaluation inside launcher). Please schedule a short downtime when pulling this feature into your cluster to ensure the API and LAUNCHER are on compatible versions. The versions working before this change would be: API (v0.5.667) and launcher (v5.0.75) Compatibility List In order to use this feature, you will need these minimum versions (please read the note above): - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.677 - [LAUNCHER] (https://hub.docker.com/r/screwdrivercd/launcher) - v6.0.1Contributors Thanks to the following contributors for making this feature possible: - d2lamQuestions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Expanding Environment Variables

May 13, 2019
Vespa use case: shopping May 3, 2019
May 3, 2019
Share

Vespa use case: shopping

Imagine you are tasked with creating a shopping website. How would you proceed? What tools and technologies would you choose? You need a technology that allows you to create data-driven navigational views as well as search and recommend products. It should be really fast, and able to scale easily as your site grows, both in number of visitors and products. And because good search relevance and product recommendation drives sales, it should be possible to use advanced features such as machine-learned ranking to implement such features. Vespa - the open source big data serving engine - allows you to implement all these use cases in a single backend. As it  is a general engine for low latency computation it can be hard to know where to start. To help with that, we have provided a detailed shopping use case with a sample application. This sample application contains a fully-functional shopping-like front-end with reasonably advanced functionality right out of the box, including sample data. While this is an example of a searchable product catalog, with customization it could be used for other application types as well, such as video and social sites. The features highlighted in this use case are: - Grouping - used for instance in search to aggregate the results of the query into categories, brands, item ratings and price ranges. - Partial update - used in liking product reviews. - Custom document processors - used to intercept the feeding of product reviews to update the product itself. - Custom handlers and configuration - used to power the front-end of the site. The goal with this is to start a new series of example applications that each showcase different features of Vespa, and show them in context of practical applications. The use cases can be used as starting points for new applications, as they contain fully-functional Vespa application packages, including sample data for getting started quickly. The use cases come in addition to the quick start guide, which gives a very basic introduction to get up and running with Vespa, and the tutorial which is much more in-depth. With the use case series we want to fill the gap between these two with something closer to the practical problems user want to solve with Vespa. Take a look for yourself. More information can be found at https://docs.vespa.ai/documentation/use-case-shopping.html

Vespa use case: shopping

May 3, 2019
Dash Open 07: Oak - Open Source Scalable Concurrent Key-Value Map for Big Data Analytics May 1, 2019
May 1, 2019
Share

Dash Open 07: Oak - Open Source Scalable Concurrent Key-Value Map for Big Data Analytics

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Paul Donnelly, a Principal Engineer at Verizon Media, interviews Eddie Bortnikov, Senior Director of Research, and Eshcar Hillel, Senior Research Scientist. Eddie and Eshcar share how Druid (open source data store designed for sub-second queries on real-time and historical data) inspired their team to build Oak, an open source scalable concurrent key-value map for big data analytics, and how companies can use and contribute to Oak. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 07: Oak - Open Source Scalable Concurrent Key-Value Map for Big Data Analytics

May 1, 2019
Dash Open 06: Apache Omid - Open Source Transaction Processing Platform for Big Data April 15, 2019
April 15, 2019
Share

Dash Open 06: Apache Omid - Open Source Transaction Processing Platform for Big Data

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Paul Donnelly, a Principal Engineer at Verizon Media, interviews Eddie Bortnikov, Senior Director of Research and Ohad Shacham, Senior Research Scientist. Eddie and Ohad share the inspiration behind Omid, an open source transaction processing platform for Big Data, and how companies can use and contribute to Omid. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 06: Apache Omid - Open Source Transaction Processing Platform for Big Data

April 15, 2019
Pull Request chain April 15, 2019
April 15, 2019
Share

Pull Request chain

Pull request chaining feature expands the workflow capabilties available when running pull request builds. By default during a pull request Screwdriver will run only those jobs which has ~pr in requires field. But with pull request chain feature turned on, there is no such restriction. shared: image: node:8 annotations: screwdriver.cd/chainPR: true jobs: first-job: requires: [ ~pr, ~commit ] steps: - echo: echo "this is first job." second-job: requires: [ first-job ] steps: - echo: echo "this is second job." With annotation chainPR set to true above pipeline workflow config will run second-job after first-job on a Pull Request Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.641 - UI - v1.0.396Contributors Thank you to the following contributors for making this feature possible: - Hiroki tktk, Software Engineer, Yahoo Japan - Yomei K, Software Engineer, Yahoo Japan - Teppei Minegishi, Software Engineer, Yahoo Japan - Yuichi Sawada, Software Engineer, Yahoo Japan - Yoshika Shota, Software Engineer, Yahoo JapanQuestions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Pull Request chain

April 15, 2019
Build Metrics April 11, 2019
April 11, 2019
Share

Build Metrics

Dao Lam, Software Engineer, Verizon Media Dekus Lam, Software Engineer, Verizon Media Screwdriver just released a new feature called Build Metrics, which gives users more insight into their pipeline, build, and step trends. Viewing the metrics You can now navigate to the Metrics tab in your pipeline to view these metrics graphs or navigate to https://${SD_UI_URL}/pipelines/${PIPELINE_ID}/metrics. The first graph shows metrics across different events for the pipeline. An event is a series of builds triggered by a single action, which could be a commit, an external pipeline trigger, or a manual start. (Read more about workflow). The graph illustrates the following data about your pipeline: - Total duration of each event - Total time it takes to pull images across builds in each event - Total time the builds spend in the queue in each event The second graph shows a build duration breakdown for corresponding events from the first graph. The third graph shows the step breakdown across multiple builds for a specific job. Chart Interactions: - Legend to filter visibility of data - Bar graph tooltip on hover for more details about the selected metric data - Copy-to-clipboard button inside tooltip - Preset time ranges & custom date ranges - Toggle between UTC and Local date time - Toggle for trendline view - Toggle for viewing only successful build data - Drag-and-zoom & button to reset zoom level - Deep links to step or build logsCompatibility List In order to use this feature, you will need these minimum versions: - [UI] (https://hub.docker.com/r/screwdrivercd/ui) - v1.0.408 - [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.641Contributors Thanks to the following contributors for making this feature possible: - chasturansky - d2lam - dekuslam - parthasl Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Build Metrics

April 11, 2019
Dash Open 05: Makeskill Design Kit, the Open Source Multimodal Rapid Prototyping Suite for Alexa April 10, 2019
April 10, 2019
Share

Dash Open 05: Makeskill Design Kit, the Open Source Multimodal Rapid Prototyping Suite for Alexa

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, I interview Lauren Tsung, who was previously a Sr. Designer for Yahoo Mail and Anna Shainskaya, a Sr. Designer for Yahoo Mail at Verizon Media. Lauren and Anna share their journey from designing chatbots to publishing Makeskill, an open source project for rapid prototyping Alexa Skills. Audio and transcript available here. You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 05: Makeskill Design Kit, the Open Source Multimodal Rapid Prototyping Suite for Alexa

April 10, 2019
Meta Event Label and Stop a Build April 9, 2019
April 9, 2019
Share

Meta Event Label and Stop a Build

Tiffany Kyi, Software Engineer, Verizon Media We’ve introduced new UI features in Screwdriver to the pipeline events page! You can now: - use meta to label events - stop a running build from the pipeline event graph Meta Event Label You can label your events using the label key in metadata. This label can be useful when trying to identify which event to rollback. To label an event, set the meta label key in your screwdriver.yaml. It will appear on the UI after the build is complete. Example screwdriver.yaml: jobs: main: steps: - set-label: | meta set label VERSION_3.0 # this will show up in your pipeline events page Example result: Stop a Build When a build is running or queued, you can now stop the build using the dropdown from the pipeline events graph. Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.639 - UI - v1.0.402Contributors Thanks to the following contributors for making this feature possible: - tkyi Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Meta Event Label and Stop a Build

April 9, 2019
Dash Open 04: Frode Lundgren - Building and Open Sourcing Vespa, the Big Data Serving Engine April 4, 2019
April 4, 2019
Share

Dash Open 04: Frode Lundgren - Building and Open Sourcing Vespa, the Big Data Serving Engine

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Amber Wilson interviews Frode Lundgren, Director of Engineering for Vespa at Verizon Media. Frode discusses the inspiration behind building Vespa and shares thoughts on personalized search. Audio and transcript available here. You can also listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.

Dash Open 04: Frode Lundgren - Building and Open Sourcing Vespa, the Big Data Serving Engine

April 4, 2019
Panoptes, an open source distributed network telemetry ecosystem is now available on Docker April 1, 2019
April 1, 2019
Share

Panoptes, an open source distributed network telemetry ecosystem is now available on Docker

James Diss, Software Systems Engineer, Verizon Media Panoptes is an open source network telemetry system that we built to replace a myriad of tools and scripts built up over time to monitor the fleet of hosts worldwide. The core framework is designed to be extended through the use of plugins which allows the many different devices and device types to be monitored. Within Verizon Media, Panoptes provides the data collection layer that we feed to other projects to allow for visualization of device health according to need, alerting and collation of information. Normally the components of Panoptes run as a distributed system that allows for horizontal scaling and sharding to different geographical and virtual locations, handling thousands of endpoints, but this can be a difficult environment to simulate. Therefore, we have created a docker image which holds the entire structure of Panoptes in a single container. I hasten to add that this is not a production instance of Panoptes, and we would not recommend trying to use the docker container “as-is”. Rather, it is more of a workbench installation to examine in motion. The container is entirely self-contained and builds Panoptes using the freely available package with pip install yahoo-panoptes; it is open-source and built on the ubiquitous Ubuntu 18.04 (Bionic Beaver). A set of scripts are supplied that allow for examination of the running container, and it also runs a Grafana instance on the container to see the data being collected. A dashboard already exists and is connected to the internal data store (influxdb) collecting the metrics. If you would like to get started right away, you can have Panoptes running very easily by using the prebuilt docker hub image; docker pull panoptes/panoptes_docker docker run -d \ --sysctl net.core.somaxconn=511 \ --name="panoptes_docker" \ --shm-size=2G \ -p 127.0.0.1:8080:3000/tcp \ panoptes/panoptes_docker This pulls the docker image from docker hub, then runs the image with a couple of parameters. In order, the “sysctl” command allows redis to run, the “name” is the name that the running container will be given (docker ps shows the currently running containers), and the “shm-size” reserves memory for the container. The “p” parameter exposes port 3000 inside the container to port 8080 on the outside; this is to allow the Grafana library to communicate outside of the container. If you’re more interested in building the image yourself (which would allow you to play with the configuration files that are dropped in place during the build), clone the repo and build from source. git clone https://github.com/yahoo/panoptes_docker.git && cd panoptes_docker docker build . -t panoptes_docker Once the image is built, run with: docker run -d \ --sysctl net.core.somaxconn=511 \ --name="panoptes_docker" \ --shm-size=2G \ -p 127.0.0.1:8080:3000/tcp \ panoptes_docker Here are a few useful links and references: Docker Resources - Docker desktop for Mac https://docs.docker.com/docker-for-mac/install/ - Docker desktop for Windows https://docs.docker.com/docker-for-windows/install/ - Docker Hub - prebuilt images https://hub.docker.com Panoptes Resources - Panoptes in Docker prebuilt image https://hub.docker.com/r/panoptes/panoptes_docker - Panoptes in Docker GitHub repo https://github.com/yahoo/panoptes_docker - Panoptes GitHub repo https://github.com/yahoo/panoptes/ Questions, Suggestions & Contributions Your feedback and contributions are appreciated! Explore Panoptes, use and help contribute to the project, and chat with us on Slack.

Panoptes, an open source distributed network telemetry ecosystem is now available on Docker

April 1, 2019
Vespa Product Updates, March 2019: Tensor updates, Query tracing and coverage March 29, 2019
March 29, 2019
Share

Vespa Product Updates, March 2019: Tensor updates, Query tracing and coverage

In last month’s Vespa update, we mentioned Boolean Field Type, Environment Variables, and Advanced Search Core Tuning. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to feedback and contributions from the community, Vespa continues to grow. This month, we’re excited to share the following updates with you: Tensor update Easily update individual tensor cells. Add, remove, and modify cell is now supported. This enables high throughput and continuous updates as tensor values can be updated without writing the full tensor. Advanced Query Trace Query tracing now includes matching and ranking execution information from content nodes - Query Explain,  is useful for performance optimization. Search coverage in access log Search coverage is now available in the access log. This enables operators to track the fraction of queries that are degraded with lower coverage. Vespa has features to gracefully reduce query coverage in overload situations and now it’s easier to track this. Search coverage is a useful signal to reconfigure or increase the capacity for the application. Explore the access log documentation to learn more. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.

Vespa Product Updates, March 2019: Tensor updates, Query tracing and coverage

March 29, 2019
User Teardown Steps in Templates and Manual Start for [skip ci] and restrictPR March 22, 2019
March 22, 2019
Share

User Teardown Steps in Templates and Manual Start for [skip ci] and restrictPR

Dao Lam, Software Engineer, Verizon Media Dekus Lam, Software Engineer, Verizon Media Screwdriver V4 user teardown steps are now working for templates. In the below example, teardown-write will be injected to the end of the build (before Screwdriver teardown steps) and will run regardless of build status. If the template has the same teardown step, it will be overwritten by user’s teardown step. jobs: main: image: node:8 template: template_namespace/nodejs_main@1.2.0 steps: - teardown-write: echo hello requires: [~pr, ~commit] Additionally, we also added the ability to manually start skip ci or restrictPR events. For skip ci, users can now hover over the empty build and select “Start pipeline from here” to trigger the build manually. For restrictPR, users can now click on the Start button to manually start the build. Note: if skip ci commit is made by a bot under the ignoreCommitsBy configured by the cluster, skip ci will take precedence and users can still manually start the build. Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.624 - [UI](https://hub.docker.com/r/screwdrivercd/ui - v1.0.389Contributors Thanks to the following contributors for making this feature possible: - d2lam - dekus Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

User Teardown Steps in Templates and Manual Start for [skip ci] and restrictPR

March 22, 2019
Dash Open 03: Alan Carroll - Networking On The Edge: IPv6, HTTP2, Apache Traffic Server March 20, 2019
March 20, 2019
Share

Dash Open 03: Alan Carroll - Networking On The Edge: IPv6, HTTP2, Apache Traffic Server

By Ashley Wolf, Open Source Program Manager, Verizon Media In this episode, Gil Yehuda (Sr. Director of Open Source at Verizon Media) interviews Alan Carroll, PhD, Senior Software Engineer for Global Networking / Edge at Verizon Media. Alan discusses networking at Verizon Media and how user traffic and proxy happens through Apache Traffic Server. He also shares his love of model rockets. Audio and transcript available here. You can also listen to this episode of Dash Open on iTunes or SoundCloud.

Dash Open 03: Alan Carroll - Networking On The Edge: IPv6, HTTP2, Apache Traffic Server

March 20, 2019
Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More March 8, 2019
March 8, 2019
rosaliebeevm
Share

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

yahoodevelopers: By Akshay Sarma, Principal Engineer, Verizon Media & Brian Xiao, Software Engineer, Verizon Media This is the first of an ongoing series of blog posts sharing releases and announcements for Bullet, an open-sourced lightweight, scalable, pluggable, multi-tenant query system. Bullet allows you to query any data flowing through a streaming system without having to store it first through its UI or API. The queries are injected into the running system and have minimal overhead. Running hundreds of queries generally fit into the overhead of just reading the streaming data. Bullet requires running an instance of its backend on your data. This backend runs on common stream processing frameworks (Storm and Spark Streaming currently supported). The data on which Bullet sits determines what it is used for. For example, our team runs an instance of Bullet on user engagement data (~1M events/sec) to let developers find their own events to validate their code that produces this data. We also use this instance to interactively explore data, throw up quick dashboards to monitor live releases, count unique users, debug issues, and more. Since open sourcing Bullet in 2017, we’ve been hard at work adding many new features! We’ll highlight some of these here and continue sharing update posts for future releases. Windowing Bullet used to operate in a request-response fashion - you would submit a query and wait for the query to meet its termination conditions (usually duration) before receiving results. For short-lived queries, say, a few seconds, this was fine. But as we started fielding more interactive and iterative queries, waiting even a minute for results became too cumbersome. Enter windowing! Bullet now supports time and record-based windowing. With time windowing, you can break up your query into chunks of time over its duration and retrieve results for each chunk.  For example, you can calculate the average of a field, and stream back results every second: In the above example, the aggregation is operating on all the data since the beginning of the query, but you can also do aggregations on just the windows themselves. This is often called a Tumbling window: With record windowing, you can get the intermediate aggregation for each record that matches your query (a Sliding window). Or you can do a Tumbling window on records rather than time. For example, you could get results back every three records: Overlapping windows in other ways (Hopping windows) or windows that reset based on different criteria (Session windows, Cascading windows) are currently being worked on. Stay tuned! Apache Pulsar support as a native PubSub Bullet uses a PubSub (publish-subscribe) message queue to send queries and results between the Web Service and Backend. As with everything else in Bullet, the PubSub is pluggable. You can use your favorite pubsub by implementing a few interfaces if you don’t want to use the ones we provide. Until now, we’ve maintained and supported a REST-based PubSub and an Apache Kafka PubSub. Now we are excited to announce supporting Apache Pulsar as well! Bullet Pulsar will be useful to those users who want to use Pulsar as their underlying messaging service. If you aren’t familiar with Pulsar, setting up a local standalone is very simple, and by default, any Pulsar topics written to will automatically be created. Setting up an instance of Bullet with Pulsar instead of REST or Kafka is just as easy. You can refer to our documentation for more details. Plug your data into Bullet without code While Bullet worked on any data source located in any persistence layer, you still had to implement an interface to connect your data source to the Backend and convert it into a record container format that Bullet understands. For instance, your data might be located in Kafka and be in the Avro format. If you were using Bullet on Storm, you would perhaps write a Storm Spout to read from Kafka, deserialize, and convert the Avro data into the Bullet record format. This was the only interface in Bullet that required our customers to write their own code. Not anymore! Bullet DSL is a text/configuration-based format for users to plug in their data to the Bullet Backend without having to write a single line of code. Bullet DSL abstracts away the two major components for plugging data into the Bullet Backend. A Connector piece to read from arbitrary data-sources and a Converter piece to convert that read data into the Bullet record container. We currently support and maintain a few of these - Kafka and Pulsar for Connectors and Avro, Maps and arbitrary Java POJOs for Converters. The Converters understand typed data and can even do a bit of minor ETL (Extract, Transform and Load) if you need to change your data around before feeding it into Bullet. As always, the DSL components are pluggable and you can write your own (and contribute it back!) if you need one that we don’t support. We appreciate your feedback and contributions! Explore Bullet on GitHub, use and help contribute to the project, and chat with us on Google Groups. To get started, try our Quickstarts on Spark or Storm to set up an instance of Bullet on some fake data and play around with it.

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

March 8, 2019
Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More March 6, 2019
March 6, 2019
Share

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

By Akshay Sarma, Principal Engineer, Verizon Media & Brian Xiao, Software Engineer, Verizon Media This is the first of an ongoing series of blog posts sharing releases and announcements for Bullet, an open-sourced lightweight, scalable, pluggable, multi-tenant query system. Bullet allows you to query any data flowing through a streaming system without having to store it first through its UI or API. The queries are injected into the running system and have minimal overhead. Running hundreds of queries generally fit into the overhead of just reading the streaming data. Bullet requires running an instance of its backend on your data. This backend runs on common stream processing frameworks (Storm and Spark Streaming currently supported). The data on which Bullet sits determines what it is used for. For example, our team runs an instance of Bullet on user engagement data (~1M events/sec) to let developers find their own events to validate their code that produces this data. We also use this instance to interactively explore data, throw up quick dashboards to monitor live releases, count unique users, debug issues, and more. Since open sourcing Bullet in 2017, we’ve been hard at work adding many new features! We’ll highlight some of these here and continue sharing update posts for future releases. Windowing Bullet used to operate in a request-response fashion - you would submit a query and wait for the query to meet its termination conditions (usually duration) before receiving results. For short-lived queries, say, a few seconds, this was fine. But as we started fielding more interactive and iterative queries, waiting even a minute for results became too cumbersome. Enter windowing! Bullet now supports time and record-based windowing. With time windowing, you can break up your query into chunks of time over its duration and retrieve results for each chunk.  For example, you can calculate the average of a field, and stream back results every second: In the above example, the aggregation is operating on all the data since the beginning of the query, but you can also do aggregations on just the windows themselves. This is often called a Tumbling window: With record windowing, you can get the intermediate aggregation for each record that matches your query (a Sliding window). Or you can do a Tumbling window on records rather than time. For example, you could get results back every three records: Overlapping windows in other ways (Hopping windows) or windows that reset based on different criteria (Session windows, Cascading windows) are currently being worked on. Stay tuned! Apache Pulsar support as a native PubSub Bullet uses a PubSub (publish-subscribe) message queue to send queries and results between the Web Service and Backend. As with everything else in Bullet, the PubSub is pluggable. You can use your favorite pubsub by implementing a few interfaces if you don’t want to use the ones we provide. Until now, we’ve maintained and supported a REST-based PubSub and an Apache Kafka PubSub. Now we are excited to announce supporting Apache Pulsar as well! Bullet Pulsar will be useful to those users who want to use Pulsar as their underlying messaging service. If you aren’t familiar with Pulsar, setting up a local standalone is very simple, and by default, any Pulsar topics written to will automatically be created. Setting up an instance of Bullet with Pulsar instead of REST or Kafka is just as easy. You can refer to our documentation for more details. Plug your data into Bullet without code While Bullet worked on any data source located in any persistence layer, you still had to implement an interface to connect your data source to the Backend and convert it into a record container format that Bullet understands. For instance, your data might be located in Kafka and be in the Avro format. If you were using Bullet on Storm, you would perhaps write a Storm Spout to read from Kafka, deserialize, and convert the Avro data into the Bullet record format. This was the only interface in Bullet that required our customers to write their own code. Not anymore! Bullet DSL is a text/configuration-based format for users to plug in their data to the Bullet Backend without having to write a single line of code. Bullet DSL abstracts away the two major components for plugging data into the Bullet Backend. A Connector piece to read from arbitrary data-sources and a Converter piece to convert that read data into the Bullet record container. We currently support and maintain a few of these - Kafka and Pulsar for Connectors and Avro, Maps and arbitrary Java POJOs for Converters. The Converters understand typed data and can even do a bit of minor ETL (Extract, Transform and Load) if you need to change your data around before feeding it into Bullet. As always, the DSL components are pluggable and you can write your own (and contribute it back!) if you need one that we don’t support. We appreciate your feedback and contributions! Explore Bullet on GitHub, use and help contribute to the project, and chat with us on Google Groups. To get started, try our Quickstarts on Spark or Storm to set up an instance of Bullet on some fake data and play around with it.

Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More

March 6, 2019
Vespa Product Updates, February 2019: Boolean Field Type, Environment Variables, and Advanced Search Core Tuning February 28, 2019
February 28, 2019
Share

Vespa Product Updates, February 2019: Boolean Field Type, Environment Variables, and Advanced Search Core Tuning

In last month’s Vespa update, we mentioned Parent/Child, Large File Config Download, and a Simplified Feeding Interface. Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to helpful feedback and contributions from the community, Vespa continues to grow. This month, we’re excited to share the following updates: Boolean field type Vespa has released a boolean field type in #6644. This feature was requested by the open source community and is targeted for applications that have many boolean fields. This feature reduces memory footprint to 1/8 for the fields (compared to byte) and hence increases query throughput / cuts latency. Learn more about choosing the field type here. Environment variables The Vespa Container now supports setting environment variables in services.xml. This is useful if the application uses libraries that read environment variables. Advanced search core tuning You can now configure index warmup - this reduces high-latency requests at startup. Also, reduce spiky memory usage when attributes grow using resizing-amortize-count - the default is changed to provide smoother memory usage. This uses less transient memory in growing applications. More details surrounding search core configuration can be explored here. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to see.

Vespa Product Updates, February 2019: Boolean Field Type, Environment Variables, and Advanced Search Core Tuning

February 28, 2019
Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications February 27, 2019
February 27, 2019
Share

Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications

By Arun Gupta Effective monitoring of applications depends on high-quality instrumentation. By measuring key metrics for your applications, you can identify performance characteristics, bottlenecks, detect failures, and plan for growth. Here are some examples of metrics that you might want about your applications: - How much processing is being done, which could be in terms of requests, queries, transactions, records, backend calls, etc. - How long is a particular part of the code taking (ie, latency), which could be in the form of total time spent as well as statistics like weighted average (based on sum and count), min, max, percentiles, and histograms. - How many resources are being utilized, like memory, entries in a hashmap, length of an array, etc. Further, you might want to know details about your service, such as: - How many users are querying the service? - Latency experience by users, sliced by users’ device types, countries of origin, operating system versions, etc. - Number of errors encountered by users, sliced by types of errors. - Sizes of responses returned to users. At Verizon Media, we have applications and services that run at a very large-scale and metrics are critical for driving business and operational insights. We set out to find a good metrics library for our Java services that provide lots of features but performs well at scale. After evaluating available options, we realized that existing libraries did not meet our requirements: - Support for dynamic dimensions (ie, tags) - Metrics need to support associative operations - Works well in very high traffic applications - Minimal garbage collection pressure - Report metrics to multiple monitoring systems As a result, we built and open sourced UltraBrew Metrics, which is a Java library for instrumenting very large-scale applications. Performance UltraBrew Metrics can operate at millions of requests per second per JVM without measurably slowing the application down. We currently use the library to instrument multiple applications at Verizon Media, including one that uses this library 20+ million times per second on a single JVM. Here are some of the techniques that allowed us to achieve our performance target: - Minimize the need for synchronization by: - Using Java’s Unsafe API for atomic operations. - Aligning data fields to L1/L2-cache line size. - Tracking state over 2 time-intervals to prevent contention between writes and reads. - Reduce the creation of objects, including avoiding the use of Java HashMaps. - Writes happen on caller thread rather than dedicated threads. This avoids the need for a buffer between threads. Questions or Contributions To learn more about this library, please visit our GitHub. Feel free to also tweet or email us with any questions or suggestions. Acknowledgments Special thanks to my colleagues who made this possible: - Matti Oikarinen - Mika Mannermaa - Smruti Ranjan Sahoo - Ilpo Ruotsalainen - Chris Larsen - Rosalie Bartlett - The Monitoring Team at Verizon Media

Open-sourcing UltraBrew Metrics, a Java library for instrumenting very large-scale applications

February 27, 2019
Freeze Windows and Collapsed Builds February 25, 2019
February 25, 2019
Share

Freeze Windows and Collapsed Builds

Min Zhang, Software Dev Engineer, Verizon Media Pranav Ravichandran, Software Dev Engineer, Verizon Media Freeze Windows Want to prevent your deployment jobs from running on weekends? You can now freeze your Screwdriver jobs and prevent them from running during specific time windows using the freezeWindows feature. Screwdriver will collapse all the frozen jobs inside the window to a single job and run it as soon as the window expires. The job will be run from the last commit within the window. Screwdriver Users The freezeWindows setting takes a cron expression or a list of them as the value. Caveats: - Unlike buildPeriodically, freezeWindows should not use hashed time therefore the symbol H for hash is disabled. - The combinations of day of week and day of month are invalid. Therefore only one out of day of week and day of month can be specified. The other field should be set to ?. - All times are in UTC. In the following example, job1 will be frozen during the month of March, job2 will be frozen on weekends, and job3 will be frozen from 10 PM to 10 AM. shared: image: node:6 jobs: job1: freezeWindows: ['* * ? 3 *'] requires: [~commit] steps: - build: echo "build" job2: freezeWindows: ['* * ? * 0,6,7'] requires: [~job1] steps: - build: echo "build" job3: freezeWindows: ['* 0-10,22-23 ? * *'] requires: [~job2] steps: - build: echo "build" In the UI, jobs within the freeze window appear as below (deploy and auxiliary): Collapsed Builds Screwdriver now supports collapsing all BLOCKED builds of the same job into a single build (the latest one). With this feature, users with concurrent builds no longer need to wait until all of them finish in the series to get the latest release out. Screwdriver Users To opt in for collapseBuilds, Screwdriver users can configure their screwdriver.yaml using annotations as shown below: jobs: main: annotations: screwdriver.cd/collapseBuilds: true image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] In the UI, collapsed build appears as below: Cluster Admin Cluster admin can configure the default behavior as collapsed or not in queue-worker configuration. Compatibility List In order to use freeze windows and collapsed builds, you will need these minimum versions: - API - v0.5.578 - Queue-worker - v2.5.2 - Buildcluster-queue-worker:v1.1.8Contributors Thank you to the following contributors for making this feature possible: - minz1027 - pranavrcQuestions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Freeze Windows and Collapsed Builds

February 25, 2019
Restrict PRs from forked repository February 21, 2019
February 21, 2019
Share

Restrict PRs from forked repository

Dao Lam, Software Engineer, Verizon Media Previously, any Screwdriver V4 user can start PR jobs (jobs configured to run on ~pr) by forking the repository and creating a PR against it. For many pipelines, this is not a desirable behavior due to security reasons since secrets and other sensitive data might get exposed in the PR builds. Screwdriver V4 now allows users to specify whether they want to restrict forked PRs or all PRs using pipeline-level annotation screwdriver.cd/restrictPR. Example: annotations: screwdriver.cd/restrictPR: fork shared: image: node:8 jobs: main: requires: - ~pr - ~commit steps: - echo: echo test Cluster admins can set the default behavior for the cluster by setting the environment variable: RESTRICT_PR. Explore the guide here Compatibility List In order to use this feature, you will need these minimum versions: - API - v0.5.581Contributors Thanks to the following contributors for making this feature possible: - d2lam - stjohnjohnson Questions & Suggestions We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.

Restrict PRs from forked repository

February 21, 2019
Shared “Verizon Media Case Study: Zero Trust Security With Athenz” at the OpenStack Summit in Berlin February 20, 2019
February 20, 2019
Share

Shared “Verizon Media Case Study: Zero Trust Security With Athenz” at the OpenStack Summit in Berlin

By James Penick, Architect Director, Verizon Media At Verizon Media, we’ve developed and open sourced a platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures called Athenz. Athenz addresses zero trust principles, including situations where authenticated clients require explicit authorization to be allowed to perform actions, and authorization needs to always be limited to the least privilege required. During the OpenStack Summit in Berlin, I discussed Athenz and its integration with OpenStack for fully automated role-based authorization and identity provisioning. We are using Athenz to bootstrap our instances deployed in both private and public clouds with service identities in the form of short-lived X.509 certificates that allow one service to securely communicate with another. Our OpenStack instances are powered by Athenz identities at scale. To learn more about Athenz, give feedback, or contribute, please visit our Github and chat with us on Slack.

Shared “Verizon Media Case Study: Zero Trust Security With Athenz” at the OpenStack Summit in Berlin

February 20, 2019
Efficient Personal Search at Scale with Vespa, the Open Source Big Data Serving Engine February 13, 2019
February 13, 2019
Share

Efficient Personal Search at Scale with Vespa, the Open Source Big Data Serving Engine

Jon Bratseth, Distinguished Architect, Verizon Media Vespa, the open source big data serving engine, includes a mode which provides personal search at scale for a fraction of the cost of alternatives. In this article, we explain streaming search and discuss how to use it. Imagine you are tasked with building the next email service, a massive personal data store centered around search. How would you do it? An obvious answer is to just use a regular search engine, write all documents to a big index and simply restrict queries to match documents belonging to a single user. Although this works, it’s incredibly costly. Successful personal data stores have a tendency to become massive — the amount of personal data produced in the world outweighs public data by many orders of magnitude. Storing indexes in addition to raw data means paying for extra disk space and the overhead of updating this massive index each time a user changes or adds data. Index updates are costly, especially when they need to be handled in realtime, which users often expect for their own data. Systems need to handle billions of writes per day so this quickly becomes the dominating cost of the entire system. However, when you think about it, there’s really no need to go through the trouble of maintaining global indexes when each user only searches her own data. What if we instead just maintain a separate small index per user? This makes both index updates and queries cheaper but leads to a new problem: writes will arrive randomly over all users, which means we’ll need to read and write a user’s index on every update without help from caching. A billion writes per day translates to about 25k read-and-write operations per second peak. Handling traffic at that scale either means using a few thousand spinning disks, or storing all data on SSD’s. Both options are expensive. Large scale data stores already solve this problem for appending writes, by using some variant of multilevel log storage. Could we leverage this to layer the index on top of a data store? That helps, but it means we need to do our own development to put these systems together in a way that performs at scale every time for both queries and writes. And we still need to pay the cost of storing the indexes in addition to the raw user data. Do we need indexes at all though? It turns out that we don’t. Indexes consist of pointers from words/tokens to the documents containing them. This allows us to find those documents faster than would be possible if we had to read the content of the documents to find the right ones, at the considerable cost of maintaining those indexes. In personal search, however, any query only accesses a small subset of the data, and the subsets are known in advance. If we take care to store the data of each subset together we can achieve search with low latency by simply reading the data at query time — what we call streaming search. In most cases, subsets of data (i.e most users) are so small that this can be done serially on a single node. Subsets of data that are too large to stream quickly on a single node can be split over multiple nodes streaming in parallel. Numbers How many documents can be searched per node per second with this solution? Assuming a node with 500 Mb/sec read speed (either from an SSD or multiple spinning disks), and 1k average compressed document size, the disk can search max 500Mb/sec / 1k/doc = 500,000 docs/sec. If each user stores 1000 documents on average this gives a max throughput per node of 500 queries/second. This is not an exact computation since we disregard time used to seek and write, and inefficiency from reading non-compacted data on one hand, and assume an overly pessimistic zero effect from caching on the other, but it is a good indication that our solution is cost effective. What about latency? From the calculation above we see that the latency from finding the matching documents will be 2 ms on average. However, we usually care more about the 99% latency (or similar). This will be driven by large users which need to be split among multiple nodes streaming in parallel. The max data size per node is then a trade-off between latency for such users and the overall cost of executing their queries (less nodes per query is cheaper). For example, we can choose to store max 50.000 documents per user per node such that we get a max latency of 100 ms per query. Lastly, the total number of nodes decides the max parallelism and hence latency for the very largest users. For example, with 20 nodes in total per cluster, we can support 20 * 50k = 1 million documents for a single user with 100 ms latency. Streaming search Alright, we now have a cost-effective solution to implement the next email provider: store just the raw data of users, in a log-level store. Locate the data of each user on a single node in the system for locality (or 2–3 nodes for redundancy), but split over multiple nodes for users that grow large. Implement a fully functional search and relevance engine on top of the raw data store, which distributes queries to the right set of nodes for each user and merges the results. This will be inexpensive and efficient, but it sounds like a lot of work! It would be great if somebody already did all of this, ran it at scale for years and then released it as open source. Well, as luck would have it, we already did this in Vespa. In addition to the standard indexing mode, Vespa includes a streaming mode for documents which provides this solution, implemented by layering the full search engine functionality over the raw data store built into Vespa. When this solution is compared to indexed search in Vespa or more complicated sharding solutions in Elasticsearch for personal search applications, we typically see about an order of magnitude reduction in the cost of achieving a system which can sustain the query and update rates needed by the application with stable latencies over long time periods. It has been used to implement various applications such as storing and searching massive amounts of emails, personal typeahead suggestions, personal image collections, and private forum group content. Streaming search on Vespa The steps to using streaming search on Vespa are: - Set streaming mode for the document type(s) in question in services.xml. - Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123 - Pass the group id in queries by setting the query property streaming.groupname in queries. Set streaming mode for the document type(s) in question in services.xml. Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123 Pass the group id in queries by setting the query property streaming.groupname in queries. That’s it! By following the above steps, you’ll have created a scalable, battle-proven personal search solution which is an order of magnitude cheaper than any available alternative, with full support for structured and text search, advanced relevance including natural language and machine-learned models, and powerful grouping and aggregation for features like faceting. For more details see the documentation on streaming search. Have fun using Vespa and let us know (tweet or email) what you’re building and any features you’d like to see.

Efficient Personal Search at Scale with Vespa, the Open Source Big Data Serving Engine

February 13, 2019
Serving article comments using reinforcement learning of a neural net February 12, 2019
February 12, 2019
Share

Serving article comments using reinforcement learning of a neural net

Don’t look at the comments. When you allow users to make comments on your content pages you face the problem that not all of them are worth showing — a difficult problem to solve, hence the saying. In this article I’ll show how this problem has been attacked using reinforcement learning at serving time on Yahoo content sites, using the Vespa open source platform to create a scalable production solution. Yahoo properties such as Yahoo Finance, News and Sports allow users to comment on the articles, similar to many other apps and websites. To support this the team needed a system that can add, find, count and serve comments at scale in real time. Not all comments are equally as interesting or relevant though, and some articles can have hundreds of thousands of comments, so a good commenting system must also choose the right comments among these to show to users viewing the article. To accomplish this, the system must observe what users are doing and learn how to pick comments that are interesting. Here I’ll explain how this problem was solved for Yahoo properties by using Vespa — the open source big data serving engine. I’ll start with the basics and then show how comment selection using a neural net and reinforcement learning was implemented.Real-time comment serving As mentioned, the team needed a system that can add, find, count, and serve comments at scale in real time. The team chose Vespa, the open big data serving engine for this, as it supports both such basic serving as well as incorporating machine learning at serving time (which we’ll get to below). By storing each comment as a separate document in Vespa, containing the ID of the article commented upon, the ID of the user commenting, various comment metadata, and the comment text itself, the team could issue queries to quickly retrieve the comments on a given article for display, or to show a comment count next to the article: In addition, this document structure allowed less-used operations such as showing all the articles of a given user and similar. The Vespa instance used at Yahoo for this store about a billion comments at any time, serve about 12.000 queries per second, and about twice as many writes (new comments + comment metadata updates). Average latency for queries is about 4 ms, and write latency roughly 1 ms. Nodes are organized in two tiers as a single Vespa application: A single stateless cluster handling incoming queries and writes, and a content cluster storing the comments, maintaining indexes and executing the distributed part of queries in parallel. In total, 32 stateless and 96 stateful nodes are spread over 5 regional data centers. Data is automatically sharded by Vespa in each datacenter, in 6–12 shards depending on the traffic patterns of that region.Ranking comments Some articles on Yahoo pages have a very large number of comments — up to hundreds of thousands are not uncommon, and no user is going to read all of them. Therefore it is necessary to pick the best comments to show each time someone views an article. Vespa does this by finding all the comments for the article, computing a score for each, and picking the comments with the best scores to show to the user. This process is called ranking. By configuring the function to compute for each comment as a ranking expression in Vespa, the engine will compute it locally on each data partition in parallel during query execution. This allows executing these queries with low latency and ensures that more comments can be handled by adding more content nodes, without causing an increase in latency. The input to the ranking function is features which are typically stored in the document (here: a comment) or sent with the query. Comments have various features indicating how users interacted with the comment, as well as features computed from the comment content itself. In addition, the system keeps track of the reputation of each comment author as a feature. User actions are sent as update operations to Vespa as they are performed. The information about authors is also continuously changing, but since each author can write many comments it would be wasteful to have to update each comment every time there is new information about the author. Instead, the author information is stored in a separate document type — one document per author, and a document reference in Vespa is used to import that author feature into each comment. This allows updating the author information once and have it automatically take effect for all comments by that author. With these features, it’s possible in Vespa to configure a mathematical function as a ranking expression which computes the rank score or each comment to produce a ranked list of the top comments, like the following:Using a neural net and reinforcement learning The team used to rank comments with a handwritten ranking expression having hardcoded weighting of the features. This is a good way to get started but obviously not optimal. To improve it they needed to decide on a measurable target and use machine learning to optimize towards it. The ultimate goal is for users to find the comments interesting. This can not be measured directly, but luckily we can define a good proxy for interest based on signals such as dwell time (the amount of time the users spend on the comments of an article) and user actions (whether users reply to comments, provide upvotes and downvotes, etc). The team knew they wanted user interest to go up on average, but there is no way to know what the correct value of the measure of interest might be for any single given list of comments. Therefore it’s hard to create a training set of interest signals for articles (supervised learning), so reinforcement learning was chosen instead: Let the system make small changes to the live machine-learned model iteratively, observe the effect on the signal used as a proxy for user interest, and use this to converge on a model that increases it. The model chosen here was a neural net with multiple hidden layers, roughly illustrated as follows: The advantage of using a neural net compared to a simple function such as linear regression is that it can capture non-linear relationships in the feature data without anyone having to guess which relationship exists and hand-write functions to capture them (feature engineering). To explore the space of possible rankings, the team implemented a sampling algorithm in a Searcher to perturb the ranking of comments returned from each query. They logged the ranking information and user interest signals such as dwell time to their Hadoop grid where they are joined. This generates a training set each hour which is used to retrain the model using TensorFlow-on-Spark, which produces a new model for the next iteration of the reinforcement learning cycle. To implement this on Vespa, the team configured the neural net as the ranking function for comments. This was done as a manually written ranking function over tensors in a rank profile. Here is the production configuration used: rank-profile neuralNet {  function get_model_weights(field) {    expression: if(query(field) == 0, constant(field), query(field))  }  function layer_0() { # returns tensor(hidden[9])    expression: elu(xw_plus_b(nn_input,                              get_model_weights(W_0),                              get_model_weights(b_0),                              x))  }  function layer_1() { # returns tensor(out[9])    expression: elu(xw_plus_b(layer_0,                              get_model_weights(W_1),                              get_model_weights(b_1),                              hidden))  }  # xw_plus_b returns tensor(out[1]), so sum converts to double  function layer_out() {    expression: sum(xw_plus_b(layer_1,                              get_model_weights(W_out),                              get_model_weights(b_out),                              out))  }  first-phase {    expression: freshnessRank  }  second-phase {    expression: layer_out    rerank-count: 2000  } } More recently Vespa added support for deploying TensorFlow SavedModels directly (as well as similar support for tools saving in the ONNX format), which would also be a good option here since the training happens in TensorFlow. Neural nets have a pair of weight and bias tensors for each layer, which is what the team wanted the training process to optimize. The simplest way to include the weights and biases in the model is to add them as constant tensorsto the application package. However, with reinforcement learning it is necessary to be able update these tensor parameters frequently. This could be achieved by redeploying the application package frequently, as Vespa allows that to be done without restarts or disruption to ongoing queries. However, it is still a somewhat heavy-weight process, so another approach was chosen: Store the neural net parameters as tensors in a separate document type in Vespa, and create a Searcher component which looks up this document on each incoming query, and adds the parameter tensors to it before it’s passed to the content nodes for evaluation. Here is the full production code needed to accomplish this serving-time operation: import com.yahoo.document.Document; import com.yahoo.document.DocumentId; import com.yahoo.document.Field; import com.yahoo.document.datatypes.FieldValue; import com.yahoo.document.datatypes.TensorFieldValue; import com.yahoo.documentapi.DocumentAccess; import com.yahoo.documentapi.SyncParameters; import com.yahoo.documentapi.SyncSession; import com.yahoo.search.Query; import com.yahoo.search.Result; import com.yahoo.search.Searcher; import com.yahoo.search.searchchain.Execution; import com.yahoo.tensor.Tensor; import java.util.Map; public class LoadRankingmodelSearcher extends Searcher {    private static final String VESPA_ID_FORMAT = "id:canvass_search:rankingmodel::%s";    // https://docs.vespa.ai/documentation/ranking.html#using-query-variables:    private static final String FEATURE_FORMAT = "query(%s)";      /** To fetch model documents from Vespa index */    private final SyncSession fetchDocumentSession;    public LoadRankingmodelSearcher() {        this.fetchDocumentSession =           DocumentAccess.createDefault()                         .createSyncSession(new SyncParameters.Builder().build());    }    @Override    public Result search(Query query, Execution execution) {        // Fetch model document from Vespa        String id = String.format(VESPA_ID_FORMAT, query.getRanking().getProfile());        Document modelDoc = fetchDocumentSession.get(new DocumentId(id));        // Add it to the query        if (modelDoc != null) {            modelDoc.iterator().forEachRemaining((Map.Entry e) ->                addTensorFromDocumentToQuery(e.getKey().getName(), e.getValue(), query)           );        }        return execution.search(query);    }    private static void addTensorFromDocumentToQuery(String field,                                                     FieldValue value,                                                     Query query) {        if (value instanceof TensorFieldValue) {            Tensor tensor = ((TensorFieldValue) value).getTensor().get();            query.getRanking().getFeatures().put(String.format(FEATURE_FORMAT, field),                                                 tensor);        }    } } The model weight document definition is added to the same content cluster as the comment documents and simply contains attribute fields for each weight and bias tensor of the neural net (where each field below is configured with “indexing: attribute | summary”): document rankingmodel {    field modelTimestamp type long { … }  field W_0 type tensor(x[9],hidden[9]) { … }  field b_0 type tensor(hidden[9]) { … }  field W_1 type tensor(hidden[9],out[9]) { … }  field b_1 type tensor(out[9]) { … }  field W_out type tensor(out[9]) { … }  field b_out type tensor(out[1]) { … } } Since updating documents is a lightweight operation it is now possible to make frequent changes to the neural net to implement the reinforcement learning process.Results Switching to the neural net model with reinforcement learning has already led to a 20% increase in average dwell time. The average response time when ranking with the neural net increased to about 7 ms since the neural net model is more expensive. The response time stays low because in Vespa the neural net is evaluated on all the content nodes (partitions) in parallel. This avoids the bottleneck of sending the data for each comment to be evaluated over the network and allows increasing parallelization indefinitely by adding more content nodes. However, evaluating the neural net for all comments for outlier articles which have hundreds of thousands of comments would still be very costly. If you read the rank profile configuration shown above, you’ll have noticed the solution to this: Two-phase ranking was used where the comments are first selected by a cheap rank function (termed freshnessRank) and the highest scoring 2000 documents (per content node) are re-ranked using the neural net. This caps the max CPU spent on evaluating the neural net per query.Conclusion and future work In this article I have shown how to implement a real comment serving and ranking system on Vespa. With reinforcement learning gaining popularity, the serving system needs to become a more integrated part of the machine learning stack, and by using Vespa this can be accomplished relatively easily with a standard open source technology. The team working on this plan to expand on this work by applying it to other domains such as content recommendation, incorporating more features in a larger network, and exploring personalized comment ranking.

Serving article comments using reinforcement learning of a neural net

February 12, 2019
Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving February 8, 2019
February 8, 2019
Share

Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving

Online evaluation of machine-learned models (model serving) is difficult to scale to large datasets. Vespa.ai is an open source big data serving solution used to solve this problem and in use today on some of the largest such systems in the world. These systems evaluate models over millions of data points per request for hundreds of thousands of requests per second. If you’re in Warsaw on February 27th, please join Jon Bratseth (Distinguished Architect, Verizon Media) at the Big Data Technology Warsaw Summit, where he’ll share “Scalable machine-learned model serving” and answer any questions. Big Data Technology Warsaw Summit is a one-day conference with technical content focused on big data analysis, scalability, storage, and search. There will be 27 presentations and more than 500 attendees are expected. Jon’s talk will explore the problem and architectural solution, show how Vespa can be used to achieve scalable serving of TensorFlow and ONNX models, and present benchmarks comparing performance and scalability to TensorFlow Serving. Hope to see you there!

Join us at the Big Data Technology Warsaw Summit on February 27th for Scalable Machine-Learned Model Serving

February 8, 2019
Meta Pull Request Checks February 8, 2019
February 8, 2019
Share

Meta Pull Request Checks

Screwdriver now supports adding extra status checks on pull requests through Screwdriver build meta. This feature allows users to add custom checks such as coverage results to the Git pull request. Note: This feature is only available for Github plugin at the moment. Screwdriver Users To add a check to a pull request build, Screwdriver users can configure their screwdriver.yaml with steps as shown below: jobs: main: steps: - status: | meta set meta.status.findbugs '{"status":"FAILURE","message":"923 issues found. Previous count: 914 issues.","url":"http://findbugs.com"}' meta set meta.status.coverage '{"status":"SUCCESS","message":"Coverage is above 80%."}' These commands will result in a status check in Git that will look something like: For more details, see our documentation. Compatibility List In order to use the new meta PR comments feature, you will need these minimum versions: - API:v0.5.559Contributors Thanks to the following people for making this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Meta Pull Request Checks

February 8, 2019
Serving article comments using neural nets and reinforcement learning February 4, 2019
February 4, 2019
Share

Serving article comments using neural nets and reinforcement learning

Yahoo properties such as Yahoo Finance, Yahoo News, and Yahoo Sports allow users to comment on the articles, similar to many other apps and websites. To support this we needed a system that can add, find, count and serve comments at scale in real time. Not all comments are equally as interesting or relevant though, and some articles can have hundreds of thousands of comments, so a good commenting system must also choose the right comments among these to show to users viewing the article. To accomplish this, the system must observe what users are doing and learn how to pick comments that are interesting. In this blog post, we’ll explain how we’re solving this problem for Yahoo properties by using Vespa - the open source big data serving engine. We’ll start with the basics and then show how comment selection using a neural net and reinforcement learning has been implemented. Real-time comment serving As mentioned, we need a system that can add, find, count, and serve comments at scale in real time. Vespa allows us to do this easily by storing each comment as a separate document, containing the ID of the article commented upon, the ID of the user commenting, various comment metadata, and the comment text itself. Vespa then allows us to issue queries to quickly retrieve the comments on a given article for display, or to show a comment count next to the article: Ranking comments In addition, we can show all the articles of a given user and similar less-used operations. We store about a billion comments at any time, serve about 12.000 queries per second, and about twice as many writes (new comments + comment metadata updates). Average latency for queries is about 4 ms, and write latency roughly 1 ms. Nodes are organized in two tiers as a single Vespa application: A single stateless cluster handling incoming queries and writes, and a content cluster storing the comments, maintaining indexes and executing the distributed part of queries in parallel. In total, we use 32 stateless and 96 stateful nodes spread over 5 regional data centers. Data is automatically sharded by Vespa in each datacenter, in 6-12 shards depending on the traffic patterns of that region. Some articles have a very large number of comments - up to hundreds of thousands are not uncommon, and no user is going to read all of them. Therefore we need to pick the best comments to show each time someone views an article. To do this, we let Vespa find all the comments for the article, compute a score for each, and pick the comments with the best scores to show to the user. This process is called ranking. By configuring the function to compute for each comment as a ranking expression in Vespa, the engine will compute it locally on each data partition in parallel during query execution. This allows us to execute these queries with low latency and ensures that we can handle more comments by adding more content nodes, without causing an increase in latency. The input to the ranking function is features which are typically stored in the comment or sent with the query. Comments have various features indicating how users interacted with the comment, as well as features computed from the comment content itself. In addition, we keep track of the reputation of each comment author as a feature. User actions are sent as update operations to Vespa as they are performed. The information about authors is also continuously changing, but since each author can write many comments it would be wasteful to have to update each article everytime we have new information about the author. Instead, we store the author information in a separate document type - one document per author and use a document reference in Vespa to import that author feature into each comment. This allows us to update author information once and have it automatically take effect for all comments by that author. With these features, we can configure a mathematical function as a ranking expression which computes the rank score or each comment to produce a ranked list of the top comments, like the following: Using a neural net and reinforcement learning We used to rank comments using a handwritten ranking expression with hardcoded weighting of the features. This is a good way to get started but obviously not optimal. To improve it we need to decide on a measurable target and use machine learning to optimize towards it. The ultimate goal is for users to find the comments interesting. This can not be measured directly, but luckily we can define a good proxy for interest based on signals such as dwell time (the amount of time the users spend on the comments of an article) and user actions (whether users reply to comments, provide upvotes and downvotes, etc). We know that we want user interest to go up on average, but we don’t know what the correct value of this measure of interest might be for any given list of comments. Therefore it’s hard to create a training set of interest signals for articles (supervised learning), so we chose to use reinforcement learning instead: Let the system make small changes to the live machine-learned model iteratively, observe the effect on the signal we use as a proxy for user interest, and use this to converge on a model that increases it. The model chosen is a neural net with multiple hidden layers, roughly illustrated as follows: The advantage of using a neural net compared to a simple function such as linear regression is that we can capture non-linear relationships in the feature data without having to guess which relationship exists and hand-write functions to capture them (feature engineering). To explore the space of possible rankings, we implement a sampling algorithm in a Searcher to perturb the ranking of comments returned from each query. We log the ranking information and our user interest signals such as dwell time to our Hadoop grid where they are joined. This generates a training set each hour which we use to retrain the model using TensorFlow-on-Spark, which generates a new model for the next iteration of the reinforcement learning. To implement this on Vespa, we configure the neural net as the ranking function for comments. This was done as a manually written ranking function over tensors in a rank profile:    rank-profile neuralNet {        function get_model_weights(field) {            expression: if(query(field) == 0, constant(field), query(field))        }        function layer_0() {  # returns tensor(hidden[9])            expression: elu(xw_plus_b(nn_input,                                      get_model_weights(W_0),                                      get_model_weights(b_0),                                      x))        }        function layer_1() {  # returns tensor(out[9])            expression: elu(xw_plus_b(layer_0,                                      get_model_weights(W_1),                                      get_model_weights(b_1),                                     hidden))        }        function layer_out() {  # xw_plus_b returns tensor(out[1]), so sum converts to double            expression: sum(xw_plus_b(layer_1,                                      get_model_weights(W_out),                                      get_model_weights(b_out),                                      out))        }        first-phase {            expression: freshnessRank        }        second-phase {            expression: layer_out            rerank-count: 2000        }    } More recently Vespa added support for deploying TensorFlow SavedModels directly, which would also be a good option since the training happens in TensorFlow. Neural nets have a pair of weight and bias tensors for each layer, which is what we want our training process to optimize. The simplest way to include the weights and biases in the model is to add them as constant tensors to the application package. However, to do reinforcement learning we need to be able to update them frequently. We could achieve this by redeploying the application package frequently, as Vespa allows this to be done without restarts or disruption to ongoing queries. However, it is still a somewhat heavy-weight process, so we chose another approach: Store the neural net parameters as tensors in a separate document type, and create a Searcher component which looks up this document on each incoming query, and adds the parameter tensors to it before it’s passed to the content nodes for evaluation. Here is the full code needed to accomplish this: import com.yahoo.document.Document; import com.yahoo.document.DocumentId; import com.yahoo.document.Field; import com.yahoo.document.datatypes.FieldValue; import com.yahoo.document.datatypes.TensorFieldValue; import com.yahoo.documentapi.DocumentAccess; import com.yahoo.documentapi.SyncParameters; import com.yahoo.documentapi.SyncSession; import com.yahoo.search.Query; import com.yahoo.search.Result; import com.yahoo.search.Searcher; import com.yahoo.search.searchchain.Execution; import com.yahoo.tensor.Tensor; import java.util.Map; public class LoadRankingmodelSearcher extends Searcher {   private static final String VESPA_DOCUMENTID_FORMAT = “id:canvass_search:rankingmodel::%s”;   // https://docs.vespa.ai/documentation/ranking.html#using-query-variables:   private static final String QUERY_FEATURE_FORMAT = “query(%s)”;     /** To fetch model documents from Vespa index */   private final SyncSession fetchDocumentSession;   public LoadRankingmodelSearcher() {       this.fetchDocumentSession = DocumentAccess.createDefault().createSyncSession(new SyncParameters.Builder().build());   }   @Override   public Result search(Query query, Execution execution) {       // fetch model document from Vespa       String documentId = String.format(VESPA_DOCUMENTID_FORMAT, query.getRanking().getProfile());       Document modelDoc = fetchDocumentSession.get(new DocumentId(documentId));       // Add it to the query       if (modelDoc != null) {           modelDoc.iterator().forEachRemaining((Map.Entry e) ->                                                        addTensorFromDocumentToQuery(e.getKey().getName(), e.getValue(), query)           );       }       return execution.search(query);   }   private static void addTensorFromDocumentToQuery(String field, FieldValue value, Query query) {       if (value instanceof TensorFieldValue) {           Tensor tensor = ((TensorFieldValue) value).getTensor().get();           query.getRanking().getFeatures().put(String.format(QUERY_FEATURE_FORMAT, field), tensor);       }   } } The model weight document definition is added to the same content cluster as the comment documents and simply contains attribute fields for each weight and bias tensor of the neural net:    document rankingmodel {        field modelTimestamp type long { … }        field W_0 type tensor(x[9],hidden[9]){ … }        field b_0 type tensor(hidden[9]){ … }        field W_1 type tensor(hidden[9],out[9]){ … }        field b_1 type tensor(out[9]){ … }        field W_out type tensor(out[9]){ … }        field b_out type tensor(out[1]){ … }    } Since updating documents is a lightweight operation we can now make frequent changes to the neural net to implement the reinforcement learning. Results Switching to the neural net model with reinforcement learning led to a 20% increase in average dwell time. The average response time when ranking with the neural net increased to about 7 ms since the neural net model is more expensive. The response time stays low because in Vespa the neural net is evaluated on all the content nodes (partitions) in parallel. We avoid the bottleneck of sending the data for each comment to be evaluated over the network and can increase parallelization indefinitely by adding more content nodes. However, evaluating the neural net for all comments for outlier articles which have hundreds of thousands of comments would still be very costly. If you read the rank profile configuration shown above, you’ll have noticed the solution to this: We use two-phase ranking where the comments are first selected by a cheap rank function (which we term freshnessRank) and the highest scoring 2000 documents (per content node) are re-ranked using the neural net. This caps the max CPU spent on evaluating the neural net per query. Conclusion and future work We have shown how to implement a real comment serving and ranking system on Vespa. With reinforcement learning gaining popularity, the serving system needs to become a more integrated part of the machine learning stack, and by using Vespa and TensorFlow-on-Spark, this can be accomplished relatively easily with a standard open source technology. We plan to expand on this work by applying it to other domains such as content recommendation, incorporating more features in a larger network, and exploring personalized comment ranking. Acknowledgments Thanks to Aaron Nagao, Sreekanth Ramakrishnan, Zhi Qu, Xue Wu, Kapil Thadani, Akshay Soni, Parikshit Shah, Troy Chevalier, Sreekanth Ramakrishnan, Jon Bratseth, Lester Solbakken and Håvard Pettersen for their contributions to this work.

Serving article comments using neural nets and reinforcement learning

February 4, 2019
Vespa 7 is released! February 1, 2019
February 1, 2019
Share

Vespa 7 is released!

This week we rolled the major version of Vespa over from 6 to 7. The releases we make public already run a large number of high traffic production applications on our Vespa cloud, and the 7 versions are no exception. There are no new features on version 7 since we release all new features incrementally on minors. Instead, the major version change is used to mark the point where we remove legacy features marked as deprecated and change some default settings. We only do this on major version changes, as Vespa uses semantic versioning. Before upgrading, go through the list of changes in the release notes to make sure your application and usage is ready. Upgrading can be done by following the regular live upgrade procedure.

Vespa 7 is released!

February 1, 2019
Bay Area Hadoop Meetup Recap - Bullet (Open Source Real-Time Data Query Engine) & Vespa (Open Source Big Data Serving Engine) January 31, 2019
January 31, 2019
Share

Bay Area Hadoop Meetup Recap - Bullet (Open Source Real-Time Data Query Engine) & Vespa (Open Source Big Data Serving Engine)

Nate Speidel, Software Engineer, Verizon Media In December, I joined Michael Natkovich (Director, Software Dev Engineering, Verizon Media) at a Bay Area Hadoop meetup to share about Bullet. Created by Yahoo, Bullet is an open-source multi-tenant query system. It’s lightweight, scalable and pluggable, and allows you to query any data flowing through a streaming system without having to store it. Bullet queries look forward in time and we use it to support intractable Big Data aggregations like Top K, Counting Distincts, and Windowing efficiently without having a storage layer using Sketch-based algorithms. Jon Bratseth, Distinguished Architect at Verizon Media, joined us at the meetup and presented “Big Data Serving with Vespa”. Largely developed by engineers from Yahoo, Vespa is a big data processing and serving engine, available as open source on GitHub. Vespa allows you to search, organize, and evaluate machine-learned models from TensorFlow over large, evolving data sets, with latencies in the tens of milliseconds. Many of our products — such as Yahoo News, Yahoo Sports, Yahoo Finance and Oath Ads Platforms — currently employ Vespa. To learn about future product updates from Bullet or Vespa, follow YDN on Twitter or LinkedIn.

Bay Area Hadoop Meetup Recap - Bullet (Open Source Real-Time Data Query Engine) & Vespa (Open Source Big Data Serving Engine)

January 31, 2019
Musings from our CI/CD Meetup: Using Screwdriver, Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, and Data Agility for Stateful Workloads in Kubernetes January 29, 2019
January 29, 2019
Share

Musings from our CI/CD Meetup: Using Screwdriver, Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, and Data Agility for Stateful Workloads in Kubernetes

By Jithin Emmanuel, Sr. Software Dev Manager, Verizon Media On Tuesday, December 4th, I joined speakers from Spotinst, Nirmata, CloudYuga, and MayaData, at the Microservices and Cloud Native Apps Meetup in Sunnyvale. We shared how Screwdriver is used for CI/CD at Verizon Media. Created by Yahoo and open-sourced in 2016, Screwdriver is a build platform designed for continuous delivery at scale. Screwdriver supports an expanding list of source code services, execution engines, and databases since it is not tied to any specific compute platform. Moreover, it has a fully documented API and growing open source community base. The meetup also featured very interesting CI/CD presentations including these: - A Quick Overview of Intro to Kubernetes Course, by Neependra Khare, Founder, CloudYuga Neependra discussed his online course which includes some of Kubernetes’ basic concepts, architecture, the problems it solves, and the model that it uses to handle containerized deployments and scaling. Additionally, CloudYuga provides training in Docker, Kubernetes, Mesos Marathon, Container Security, GO Language, Advanced Linux Administration, and more. - Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, by Amiram Shachar, CEO & Founder, Spotinst Amiram discussed two important concepts of Kubernetes: Headroom and 2 Levels Scaling. Amiram also reviewed the different Kubernetes deployment tools, including Kubernetes Operations (Kops). Ritesh Patel, Founder and VP Products at Nirmata, demoed Spotinst and Nirmata. Nirmata provides a complete solution for Kubernetes deployment and management for cloud-based app containerization. Spotinst is workload automation software that’s focused on helping enterprises save time and costs on their cloud compute infrastructure.  - Data Agility for Stateful Workloads in Kubernetes, by Murat Karslioglu, VP Products, MayaData MayaData is focused on freeing DevOps and Kubernetes from storage constraints with OpenEBS. Murat discussed accelerating CI/CD Pipelines and DevOps, using chaos engineering and containerized storage. Murat also explored some of the open source tools available from MayaData and introduced the MayaData Agility Platform (MDAP). Murat’s presentation ended with a live demo of OpenEBS and Litmus. To learn about future meetups, follow us on Twitter at @YDN or on LinkedIn.

Musings from our CI/CD Meetup: Using Screwdriver, Achieving a Serverless Experience While Scaling with Kubernetes or Amazon ECS, and Data Agility for Stateful Workloads in Kubernetes

January 29, 2019
Vespa Product Updates, January 2019: Parent/Child, Large File Config Download, and a Simplified Feeding Interface January 28, 2019
January 28, 2019
Share

Vespa Product Updates, January 2019: Parent/Child, Large File Config Download, and a Simplified Feeding Interface

In last month’s Vespa update, we mentioned ONNX integration, precise transaction log pruning, grouping on maps, and improvements to streaming search performance.  Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and Oath Ads Platforms. Thanks to feedback and contributions from the community, Vespa continues to evolve. This month, we’re excited to share the following updates with you: Parent/Child We’ve added support for multiple levels of parent-child document references. Documents with references to parent documents can now import fields, with minimal impact on performance. This simplifies updates to parent data as no denormalization is needed and supports use cases with many-to-many relationships, like Product Search. Read more in parent-child. File URL references in application packages Serving nodes sometimes require data files which are so large that it doesn’t make sense for them to be stored and deployed in the application package. Such files can now be included in application packages by using the URL reference. When the application is redeployed, the files are automatically downloaded and injected into the components who depend on them. Batch feed in java client The new SyncFeedClient provides a simplified API for feeding batches of data with high performance using the Java HTTP client. This is convenient when feeding from systems without full streaming support such as Kafka and DynamoDB. We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to see.

Vespa Product Updates, January 2019: Parent/Child, Large File Config Download, and a Simplified Feeding Interface

January 28, 2019
Pipeline page redesign January 25, 2019
January 25, 2019
Share

Pipeline page redesign

Check out Screwdriver’s redesigned UI for the pipeline page! In addition to a smoother interface and easier navigation, here are some utility fixes: Disabled jobs We’ve change disabled job icons to stand out more in the pipeline graph. Also, you can now: - Hover over a disabled job in the pipeline graph to view its details (who disabled it). - Add a reason when you disable a job from the Pipeline Options tab. This information will be displayed on the same page. Disabled job confirmation: Disabled job reason display: Pipeline events The event list has now been conveniently shifted to the right sidebar! The sidebar now has minimal data, including only showing a minified version of the parts of your workflow that ran, to make for quicker information processing. This change gives more space for large workflow graphs and makes for less scrolling on the page. Pull requests can be accessed by switching from the Events tab to the Pull Requests tab on the top right. Old and new pipeline page comparison: Pull requests sidebar: Compatibility List In order to see the new pipeline redesign, you will need these minimum versions: - API:v0.5.551 - UI:v1.0.365Contributors Thanks to the following people for making this feature possible: - DekusDenial - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Pipeline page redesign

January 25, 2019
Efficient personal search at large scale January 21, 2019
January 21, 2019
Share

Efficient personal search at large scale

Vespa includes a relatively unknown mode which provides personal search at massive scale for a fraction of the cost of alternatives: streaming search. In this article we explain streaming search and how to use it. Imagine you are tasked with building the next Gmail, a massive personal data store centered around search. How do you do it? An obvious answer is to just use a regular search engine, write all documents to a big index and simply restrict queries to match documents belonging to a single user. This works, but the problem is cost. Successful personal data stores has a tendency to become massive — the amount of personal data produced in the world outweighs public data by many orders of magnitude. Storing indexes in addition to raw data means paying for extra disk space for all this data and paying for the overhead of updating this massive index each time a user changes or adds data. Index updates are costly, especially when they need to be handled in real time, which users often expect for their own data. Systems like Gmail handle billions of writes per day so this quickly becomes the dominating cost of the entire system. However, when you think about it there’s really no need to go through the trouble of maintaining global indexes when each user only searches her own data. What if we just maintain a separate small index per user? This makes both index updates and queries cheaper, but leads to a new problem: Writes will arrive randomly over all users, which means we’ll need to read and write a user’s index on every update without help from caching. A billion writes per day translates to about 25k read-and write operations per second peak. Handling traffic at that scale either means using a few thousand spinning disks, or storing all data on SSD’s. Both options are expensive. Large scale data stores already solve this problem for appending writes, by using some variant of multilevel log storage. Could we leverage this to layer the index on top of a data store like that? That helps, but means we need to do our own development to put these systems together in a way that performs at scale every time for both queries and writes. And we still pay the cost of storing the indexes in addition to the raw user data. Do we need indexes at all though? With some reflection, it turns out that we don’t. Indexes consists of pointers from words/tokens to the documents containing them. This allows us to find those documents faster than would be possible if we had to read the content of the documents to find the right ones, of course at the considerable cost of maintaining those indexes. In personal search however, any query only accesses a small subset of the data, and the subsets are know in advance. If we take care to store the data of each subset together we can achieve search with low latency by simply reading the data at query time — what we call streaming search. In most cases, most subsets of data (i.e most users) are so small that this can be done serially on a single node. Subsets of data that are too large to stream quickly on a single node can be split over multiple nodes streaming in parallel.Numbers How many documents can be searched per node per second with this solution? Assuming a node with 500 Mb/sec read speed (either from an SSD or multiple spinning disks), and 1k average compressed document size, the disk can search max 500Mb/sec / 1k/doc = 500,000 docs/sec. If each user store 1000 documents each on average this gives a max throughput per node of 500 queries/second. This is not an exact computation since we disregard time used to seek and write, and inefficiency from reading non-compacted data on one hand, and assume an overly pessimistic zero effect from caching on the other, but it is a good indication that our solution is cost effective. What about latency? From the calculation above we see that the latency from finding the matching documents will be 2 ms on average. However, we usually care more about the 99% latency (or similar). This will be driven by large users which needs to be split among multiple nodes streaming in parallel. The max data size per node is then a tradeoff between latency for such users and the overall cost of executing their queries (less nodes per query is cheaper). For example, we can choose to store max 50.000 documents per user per node such that we get a max latency of 100 ms per query. Lastly, the total number of nodes decides the max parallelism and hence latency for the very largest users. For example, with 20 nodes in total a cluster we can support 20 * 50k = 1 million documents for a single user with 100 ms latency.Streaming search All right — with this we have our cost-effective solution to implement the next Gmail: Store just the raw data of users, in a log-level store. Locate the data of each user on a single node in the system for locality (or, really 2–3 nodes for redundancy), but split over multiple nodes for users that grow large. Implement a fully functional search and relevance engine on top of the raw data store, which distributes queries to the right set of nodes for each user and merges the results. This will be cheap and efficient, but it sounds like a lot of work! It sure would be nice if somebody already did all of it, ran it at large scale for years and then released it as open source. Well, as luck would have it we already did this in Vespa. In addition to the standard indexing mode, Vespa includes a streaming mode for documents which provides this solution, implemented by layering the full search engine functionality over the raw data store built into Vespa. When this solution is compared to indexed search in Vespa or more complicated sharding solutions in Elastic Search for personal search applications, we typically see about an order of magnitude reduction in cost of achieving a system which can sustain the query and update rates needed by the application with stable latencies over long time periods. It has been used to implement various applications such as storing and searching massive amounts of mails, personal typeahead suggestions, personal image collections, and private forum group content.Using streaming search on Vespa The steps to using streaming search on Vespa are: - Set streaming mode for the document type(s) in question in services.xml. - Write documents with a group name (e.g a user id) in their id, by setting g=[groupid] in the third part of the document id, as in e.g id:mynamespace:mydocumenttype:g=user123:doc123 - Pass the group id in queries by setting the query property streaming.groupname in queries. That’s it! With those steps you have created a scalable, battle-proven personal search solution which is an order of magnitude cheaper than any alternative out there, with full support for structured and text search, advanced relevance including natural language and machine-learned models, and powerful grouping and aggregation for features like faceting. For more details see the documentation on streaming search. Have fun with it, and as usual let us know what you are building!

Efficient personal search at large scale

January 21, 2019
Dash Open Podcast: Episode 02 - Building Community and Mentorship around Hackdays January 17, 2019
January 17, 2019
Share

Dash Open Podcast: Episode 02 - Building Community and Mentorship around Hackdays

By Ashley Wolf, Open Source Program Manager, Verizon Media The second installment of Dash Open is ready for you to tune in! In this episode, Gil Yehuda, Sr. Director of Open Source at Verizon Media, interviews Dav Glass, Distinguished Architect of IaaS and Node.js at Verizon Media. Dav discusses how open source inspired him to start HackSI, a Hack Day for all ages, as well as robotics mentorship programs for the Southern Illinois engineering community. Listen now on iTunes or SoundCloud. Dash Open is your place for interesting conversations about open source and other technologies, from the open source program office at Verizon Media. Verizon Media is the home of many leading brands including Yahoo, Aol, Tumblr, TechCrunch, and many more. Follow us on Twitter @YDN and on LinkedIn.

Dash Open Podcast: Episode 02 - Building Community and Mentorship around Hackdays

January 17, 2019
Meta PR Comments January 10, 2019
January 10, 2019
Share

Meta PR Comments

Screwdriver now supports commenting on pull requests through Screwdriver build meta. This feature allows users to add custom data such as coverage results to the Git pull request. Screwdriver Users To add a comment to a pull request build, Screwdriver users can configure their screwdriver.yaml with steps as shown below: jobs: main: steps: - postdeploy: | meta set meta.summary.coverage "Coverage increased by 15%" meta set meta.summary.markdown "this markdown comment is **bold** and *italic*" These commands will result in a comment in Git that will look something like: Cluster Admins In order to enable meta PR comments, you’ll need to create a bot user in Git with a personal access token with the public_repo scope. In Github, create a new user. Follow instructions to create a personal access token, set the scope as public_repo. Copy this token and set it as commentUserToken in your scms settings in your API config yaml. You need this headless user for commenting since Github requires public_repo scope in order to comment on pull requests (https://github.community/t5/How-to-use-Git-and-GitHub/Why-does-GitHub-API-require-admin-rights-to-leave-a-comment-on-a/td-p/357). For more information about Github scope, see https://developer.github.com/apps/building-oauth-apps/understanding-scopes-for-oauth-apps. Compatibility List In order to use the new meta PR comments feature, you will need these minimum versions: - API:v0.5.545Contributors Thanks to the following people for making this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Meta PR Comments

January 10, 2019
Multiple Build Cluster January 3, 2019
January 3, 2019
Share

Multiple Build Cluster

Screwdriver now supports running builds across multiple build clusters. This feature allows Screwdriver to provide a native hot/hot HA solution with multiple clusters on standby. This also opens up the possibility for teams to run their builds in their own infrastructure. Screwdriver Users To specify a build cluster, Screwdriver users can configure their screwdriver.yamls using annotations as shown below: jobs: main: annotations: screwdriver.cd/buildClusters: us-west-1 image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Users can view a list of available build clusters at /v4/buildclusters. Without the annotation, Screwdriver assigns builds to a default cluster that is managed by the Screwdriver team. Users can assign their build to run in any cluster they have access to (the default cluster or any external cluster that your repo is allowed to use, which is indicated by the field scmOrganizations). Contact your cluster admin if you want to onboard your own build cluster. Cluster Admins Screwdriver cluster admins can refer to the following issues and design doc to set up multiple build clusters properly. - Design: https://github.com/screwdriver-cd/screwdriver/blob/master/design/build-clusters.md - Feature issue: https://github.com/screwdriver-cd/screwdriver/issues/1319Compatibility List In order to use the new build clusters feature, you will need these minimum versions: - API:v0.5.537 - Scheduler:v2.4.2 - Buildcluster-queue-worker:v1.1.3Contributors Thanks to the following people for making this feature possible: - minz1027 - parthasl - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Multiple Build Cluster

January 3, 2019
Announcing OpenTSDB 2.4.0: Rollup and Pre-Aggregation Storage, Histograms, Sketches, and More December 27, 2018
December 27, 2018
amberwilsonla
Share

Announcing OpenTSDB 2.4.0: Rollup and Pre-Aggregation Storage, Histograms, Sketches, and More

yahoodevelopers: By Chris Larsen, Architect OpenTSDB is one of the first dedicated open source time series databases built on top of Apache HBase and the Hadoop Distributed File System. Today, we are proud to share that version 2.4.0 is now available and has many new features developed in-house and with contributions from the open source community. This release would not have been possible without support from our monitoring team, the Hadoop and HBase developers, as well as contributors from other companies like Salesforce, Alibaba, JD.com, Arista and more. Thank you to everyone who contributed to this release! A few of the exciting new features include: Rollup and Pre-Aggregation Storage As time series data grows, storing the original measurements becomes expensive. Particularly in the case of monitoring workflows, users rarely care about last years’ high fidelity data. It’s more efficient to store lower resolution “rollups” for longer periods, discarding the original high-resolution data. OpenTSDB now supports storing and querying such data so that the raw data can expire from HBase or Bigtable, and the rollups can stick around longer. Querying for long time ranges will read from the lower resolution data, fetching fewer data points and speeding up queries. Likewise, when a user wants to query tens of thousands of time series grouped by, for example, data centers, the TSD will have to fetch and process a significant amount of data, making queries painfully slow. To improve query speed, pre-aggregated data can be stored and queried to fetch much less data at query time, while still retaining the raw data. We have an Apache Storm pipeline that computes these rollups and pre-aggregates, and we intend to open source that code in 2019. For more details, please visit http://opentsdb.net/docs/build/html/user_guide/rollups.html. Histograms and Sketches When monitoring or performing data analysis, users often like to explore percentiles of their measurements, such as the 99.9th percentile of website request latency to detect issues and determine what consumers are experiencing. Popular metrics collection libraries will happily report percentiles for the data they collect. Yet while querying for the original percentile data for a single time series is useful, trying to query and combine the data from multiple series is mathematically incorrect, leading to errant observations and problems. For example, if you want the 99.9th percentile of latency in a particular region, you can’t just sum or recompute the 99.9th of the 99.9th percentile. To solve this issue, we needed a complex data structure that can be combined to calculate an accurate percentile. One such structure that has existed for a long time is the bucketed histogram, where measurements are sliced into value ranges and each range maintains a count of measurements that fall into that bucket. These buckets can be sized based on the required accuracy and the counts from multiple sources (sharing the same bucket ranges) combined to compute an accurate percentile. Bucketed histograms can be expensive to store for highly accurate data, as many buckets and counts are required. Additionally, many measurements don’t have to be perfectly accurate but they should be precise. Thus another class of algorithms could be used to approximate the data via sampling and provide highly precise data with a fixed interval. Data scientists at Yahoo (now part of Oath) implemented a great Java library called Data Sketches that implements the Stochastic Streaming Algorithms to reduce the amount of data stored for high-throughput services. Sketches have been a huge help for the OLAP storage system Druid (also sponsored by Oath) and Bullet, Oath’s open source real-time data query engine. The latest TSDB version supports bucketed histograms, Data Sketches, and T-Digests. Some additional features include: - HBase Date Tiered Compaction support to improve storage efficiency. - A new authentication plugin interface to support enterprise use cases. - An interface to support fetching data directly from Bigtable or HBase rows using a search index such as ElasticSearch. This improves queries for small subsets of high cardinality data and we’re working on open sourcing our code for the ES schema. - Greater UID cache controls and an optional LRU implementation to reduce the amount of JVM heap allocated to UID to string mappings. - Configurable query size and time limits to avoid OOMing a JVM with large queries. Try the releases on GitHub and let us know of any issues you run into by posting on GitHub issues or the OpenTSDB Forum. Your feedback is appreciated! OpenTSDB 3.0 Additionally, we’ve started on 3.0, which is a rewrite that will support a slew of new features including: - Querying and analyzing data from the plethora of new time series stores. - A fully configurable query graph that allows for complex queries OpenTSDB 1x and 2x couldn’t support. - Streaming results to improve the user experience and avoid overwhelming a single query node. - Advanced analytics including support for time series forecasting with Yahoo’s EGADs library. Please join us in testing out the current 3.0 code, reporting bugs, and adding features.

Announcing OpenTSDB 2.4.0: Rollup and Pre-Aggregation Storage, Histograms, Sketches, and More

December 27, 2018
Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping December 14, 2018
December 14, 2018
Share

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

Hi Vespa Community! Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues. We welcome your contributions and feedback about any new features or improvements you’d like to see. For December, we’re excited to share the following product news: Streaming Search Performance Improvement Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here. ONNX Integration ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models. Precise Transaction Log Pruning Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart. Grouping on Maps Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes. Questions or suggestions? Send us a tweet or an email.

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

December 14, 2018
Vespa Product Updates, December 2018 - ONNX Import and Map Attribute Grouping December 13, 2018
December 13, 2018
amberwilsonla
Share

Vespa Product Updates, December 2018 - ONNX Import and Map Attribute Grouping

yahoodevelopers: Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues. We welcome your contributions and feedback about any new features or improvements you’d like to see. For December, we’re excited to share the following product news: Streaming Search Performance Improvement Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here. ONNX Integration ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models. Precise Transaction Log Pruning Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart. Grouping on Maps Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes. Questions or suggestions? Send us a tweet or an email.

Vespa Product Updates, December 2018 - ONNX Import and Map Attribute Grouping

December 13, 2018
Vespa Product Updates, December 2018:
ONNX Import and Map Attribute Grouping December 13, 2018
December 13, 2018
Share

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues. We welcome your contributions and feedback about any new features or improvements you’d like to see. For December, we’re excited to share the following product news: Streaming Search Performance Improvement Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here. ONNX Integration ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models. Precise Transaction Log Pruning Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart. Grouping on Maps Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes. Questions or suggestions? Send us a tweet or an email.

Vespa Product Updates, December 2018: ONNX Import and Map Attribute Grouping

December 13, 2018
A New Chapter for Omid December 6, 2018
December 6, 2018
amberwilsonla
Share

A New Chapter for Omid

yahoodevelopers: By Ohad Shacham, Yonatan Gottesman, Edward Bortnikov Scalable Systems Research, Verizon/Oath Omid, an open source transaction processing platform for Big Data, was born as a research project at Yahoo (now part of Verizon), and became an Apache Incubator project in 2015. Omid complements Apache HBase, a distributed key-value store in Apache Hadoop suite, with a capability to clip multiple operations into logically indivisible (atomic) units named transactions. This programming model has been extremely popular since the dawn of SQL databases, and has more recently become indispensable in the NoSQL world. For example, it is the centerpiece for dynamic content indexing of search and media products at Verizon, powering a web-scale content management platform since 2015. Today, we are excited to share a new chapter in Omid’s history. Thanks to its scalability, reliability, and speed, Omid has been selected as transaction management provider for Apache Phoenix, a real-time converged OLTP and analytics platform for Hadoop. Phoenix provides a standard SQL interface to HBase key-value storage, which is much simpler and in many cases more performant than the native HBase API. With Phoenix, big data and machine learning developers get the best of all worlds: increased productivity coupled with high scalability. Phoenix is designed to scale to 10,000 query processing nodes in one instance and is expected to process hundreds of thousands or even millions of transactions per second (tps). It is widely used in the industry, including by Alibaba, Bloomberg, PubMatic, Salesforce, Sogou and many others. We have just released a new and significantly improved version of Omid (1.0.0), the first major release since its original launch. We have extended the system with multiple functional and performance features to power a modern SQL database technology, ready for deployment on both private and public cloud platforms. A few of the significant innovations include: Protocol re-design for low latency The early version of Omid was designed for use in web-scale data pipeline systems, which are throughput-oriented by nature. We re-engineered Omid’s internals to now support new ultra-low-latency OLTP (online transaction processing) applications, like messaging and algo-trading. The new protocol, Omid Low Latency (Omid LL), dissipates Omid’s major architectural bottleneck. It reduces the latency of short transactions by 5 times under light load, and by 10 to 100 times under heavy load. It also scales the overall system throughput to 550,000 tps while remaining within real-time latency SLAs. The figure below illustrates Omid LL scaling versus the previous version of Omid, for short and long transactions. Throughput vs latency, transaction size=1 op Throughput vs latency, transaction size=10 ops Figure 1. Omid LL scaling versus legacy Omid. The throughput scales beyond 550,000 tps while the latency remains flat (low milliseconds). ANSI SQL support Phoenix provides secondary indexes for SQL tables — a centerpiece tool for efficient access to data by multiple keys. The CREATE INDEX command is on-demand; it is not allowed to block already deployed applications. We added Omid support for accomplishing this without impeding concurrent database operations or sacrificing consistency. We further introduced a mechanism to avoid recursive read-your-own-writes scenarios in complex queries, like “INSERT INTO T … SELECT FROM T …” statements. This was achieved by extending Omid’s traditional Snapshot Isolation consistency model, which provides single-read-point-single-write-point semantics, with multiple read and write points. Performance improvements Phoenix extensively employs stored procedures implemented as HBase filters in order to eliminate the overhead of multiple round-trips to the data store. We integrated Omid’s code within such HBase-resident procedures, allowing for a smooth integration with Phoenix and also reduced the overhead of transactional reads (for example, filtering out redundant data versions). We collaborated closely with the Phoenix developer community while working on this project, and contributed code to Phoenix that made Omid’s integration possible. We look forward to seeing Omid’s adoption through a wide range of Phoenix applications. We always welcome new developers to join the community and help push Omid forward!

A New Chapter for Omid

December 6, 2018
Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th November 27, 2018
November 27, 2018
Share

Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th

Hi Vespa Community, If you are in Seattle on November 29th, please join Jon Bratseth (Distinguished Architect, Oath) at a machine learning meetup hosted by Zillow. Jon will share a Vespa overview and answer any questions about Oath’s open source big data serving engine. Eric Ringger (Director of Machine Learning for Personalization, Zillow) will discuss some of the models used to help users find homes, including collaborative filtering, a content-based model, and deep learning. Learn more and RSVP here. Hope you can join! The Vespa Team

Join us at the Machine Learning Meetup hosted by Zillow in Seattle on November 29th

November 27, 2018
Oath’s VP of AI invites you to learn how to build a Terabyte Scale Machine Learning Application at TDA Conference November 26, 2018
November 26, 2018
Share

Oath’s VP of AI invites you to learn how to build a Terabyte Scale Machine Learning Application at TDA Conference

By Ganesh Harinath, VP Engineering, AI Platform & Applications, Oath If you’re attending the upcoming Telco Data Analytics and AI Conference in San Francisco, make sure to join my keynote talk. I’ll be presenting “Building a Terabyte Scale Machine Learning Application” on November 28th at 10:10 am PST. You’ll learn about how Oath builds AI platforms at scale. My presentation will focus on our approach and experience at Oath in architecting and using frameworks to build machine learning models at terabyte scale, near real-time. I’ll also highlight Trapezium, an open source framework based on Spark, developed by Oath’s Big Data and Artificial Intelligence (BDAI) team. I hope to catch you at the conference. If you would like to connect, reach out to me. If you’re unable to attend the conference and are curious about the topics shared in my presentation, follow @YDN on Twitter and we’ll share highlights during and after the event.

Oath’s VP of AI invites you to learn how to build a Terabyte Scale Machine Learning Application at TDA Conference

November 26, 2018
Introducing the Dash Open Podcast, sponsored by Yahoo Developer... November 19, 2018
November 19, 2018
Share

Introducing the Dash Open Podcast, sponsored by Yahoo Developer...

Introducing the Dash Open Podcast, sponsored by Yahoo Developer Network By Ashley Wolf, Principal Technical Program Manager, Oath Is open source the wave of the future, or has it seen its best days already? Which Big Data and AI trends should you be aware of and why? What is 5G and how will it impact the apps you enjoy using? You’ve got questions and we know smart people; together we’ll get answers. Introducing the Dash Open podcast, sponsored by the Yahoo Developer Network and produced by the Open Source team at Oath. Dash Open will share interesting conversations about tech and the people who spend their day working in tech. We’ll look at the state of technology through the lens of open source; keeping you up-to-date on the trends we’re seeing across the internet. Why Dash Open? Because it’s like a command line argument reminding the command to be open. What can you expect from Dash Open? Interviews with interesting people, occasional witty banter, and a catchy theme song. In the first episode, Rosalie Bartlett, Open Source community manager at Oath, interviews Gil Yehuda, Senior Director of Open Source at Oath. Tune in to hear one skeptic’s journey from resisting the open source movement to heading one of the more prolific Open Source Program Offices (OSPO). Gil highlights the benefits of open source to companies and provides actionable advice on how technology companies can start or improve their OSPO. Give Dash Open a listen and tell us what topics you’d like to hear next. – Ashley Wolf manages the Open Source Program at Oath/Verizon Media Group.

Introducing the Dash Open Podcast, sponsored by Yahoo Developer...

November 19, 2018
Git Shallow Clone November 12, 2018
November 12, 2018
Share

Git Shallow Clone

Previously, Screwdriver would clone the entire commit tree of a Git repository. In most cases, this was unnecessary since most builds only require the latest single commit. For repositories containing immense commit trees, this behavior led to unnecessarily long build times. To address this issue, Screwdriver now defaults to shallow cloning Git repositories with a depth of 50. Screwdriver will also enable the --no-single-branch flag by default in order enable access to other branches in the repository. To disable shallow cloning, simply set the GIT_SHALLOW_CLONE environment variable to false. Example jobs: main: environment: GIT_SHALLOW_CLONE: false image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Here is a comparison of the build speed improvement for a repository containing over ~160k commits. Before: After: For more information, please consult the Screwdriver V4 FAQ. Compatibility List In order to use the new build cache feature, you will need these minimum versions: - screwdrivercd/screwdriver:v0.5.501Contributors Thanks to the following people for making this feature possible: - Filbird Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support

Git Shallow Clone

November 12, 2018
Hadoop Contributors Meetup at Oath November 8, 2018
November 8, 2018
amberwilsonla
Share

Hadoop Contributors Meetup at Oath

yahoodevelopers: By Scott Bush, Director, Hadoop Software Engineering, Oath On Tuesday, September 25, we hosted a special day-long Hadoop Contributors Meetup at our Sunnyvale, California campus. Much of the early Hadoop development work started at Yahoo, now part of Oath, and has continued over the past decade. Our campus was the perfect setting for this meetup, as we continue to make Hadoop a priority. More than 80 Hadoop users, contributors, committers, and PMC members gathered to hear talks on key issues facing the Hadoop user community. Speakers from Ampool, Cloudera, Hortonworks, Microsoft, Oath, and Twitter detailed some of the challenges and solutions pertinent to their parts of the Hadoop ecosystem. The talks were followed by a number of parallel, birds of a feather breakout sessions to discuss HDFS, Tez, containers and low latency processing. The day ended with a reception and consensus that the event went well and should be repeated in the near future. Presentation recordings (YouTube playlist) and slides (links included in the video description) are available here: - Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda Tan, Hortonworks - Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara, Botong Huang - “HDFS Scalability and Security”, Daryn Sharp, Senior Engineer, Oath - The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool - Moving the Oath Grid to Docker, Eric Badger, Software Developer Engineer, Oath - Vespa: Open Source Big Data Serving Engine, Jon Bratseth, Distinguished Architect, Oath - Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shane Kumpf, Hortonworks - How Twitter Hadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu Thank you to all the presenters and the attendees both in person and remote! P.S. We’re hiring! Learn more about career opportunities at Oath.

Hadoop Contributors Meetup at Oath

November 8, 2018
Build Cache November 7, 2018
November 7, 2018
Share

Build Cache

Screwdriver now has the ability to cache and restore files and directories from your builds for use in other builds! This feature gives you the option to cache artifacts in builds using Gradle, NPM, Maven etc. so subsequent builds can save time on commonly-run steps such as dependency installation and package build. You can now specify a top-level setting in your screwdriver.yaml called cache that contains file paths from your build that you would like to cache. You can limit access to the cache at a pipeline, event, or job-level scope. Scope guide - pipeline-level: all builds in the same pipeline (across different jobs and events) - event-level: all builds in the same event (across different jobs) - job-level: all builds for the same job (across different events in the same pipeline) Example cache: event: - $SD_SOURCE_DIR/node_modules pipeline: - ~/.gradle job: test-job: [/tmp/test] In the above example, we cache the .gradle folder so that subsequent builds in the pipeline can save time on gradle install. Without cache: With cache: Compatibility List In order to use the new build cache feature, you will need these minimum versions: - screwdrivercd/queue-worker:v2.2.2 - screwdrivercd/screwdriver:v0.5.492 - screwdrivercd/launcher:v5.0.37 - screwdrivercd/store:v3.3.11 Note: Please ensure the store service has sufficient available memory to handle the payload. For cache cleanup, we use AWS S3 Lifecycle Management. If your store service is not configured to use S3, you might need to add a cleanup mechanism. Contributors Thanks to the following people for making this feature possible: - d2lam - pranavrc Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support

Build Cache

November 7, 2018
Sharing Vespa at the SF Big Analytics Meetup October 19, 2018
October 19, 2018
Share

Sharing Vespa at the SF Big Analytics Meetup

By Jon Bratseth, Distinguished Architect, Oath I had the wonderful opportunity to present Vespa at the SF Big Analytics Meetup on September 26th, hosted by Amplitude. Several members of the Vespa team (Kim, Frode and Kristian) also attended. We all enjoyed meeting with members of the Big Analytics community to discuss how Vespa could be helpful for their companies. Thank you to Chester Chen, T.J. Bay, and Jin Hao Wan for planning the meetup, and here’s our presentation, in case you missed it (slides are also available here):

Sharing Vespa at the SF Big Analytics Meetup

October 19, 2018
Sharing Vespa (Open Source Big Data Serving Engine) at the SF Big Analytics Meetup October 17, 2018
October 17, 2018
amberwilsonla
Share

Sharing Vespa (Open Source Big Data Serving Engine) at the SF Big Analytics Meetup

yahoodevelopers: By Jon Bratseth, Distinguished Architect, Oath I had the wonderful opportunity to present Vespa at the SF Big Analytics Meetup on September 26th, hosted by Amplitude. Several members of the Vespa team (Kim, Frode and Kristian) also attended. We all enjoyed meeting with members of the Big Analytics community to discuss how Vespa could be helpful for their companies. Thank you to Chester Chen, T.J. Bay, and Jin Hao Wan for planning the meetup, and here’s our presentation, in case you missed it (slides are also available here): Largely developed by Yahoo engineers, Vespa is our big data processing and serving engine, available as open source on GitHub. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance and Oath Ads Platforms.  Vespa use is growing even more rapidly; since it is open source under a permissive Apache license, Vespa can power other external third-party apps as well.  A great example is Zedge, which uses Vespa for search and recommender systems to support content discovery for personalization of mobile phones (Android, iOS, and Web). Zedge uses Vespa in production to serve millions of monthly active users. Visit https://vespa.ai/ to learn more and download the code. We encourage code contributions and welcome opportunities to collaborate.

Sharing Vespa (Open Source Big Data Serving Engine) at the SF Big Analytics Meetup

October 17, 2018
Open-Sourcing Panoptes, Oath’s distributed network telemetry collector October 4, 2018
October 4, 2018
amberwilsonla
Share

Open-Sourcing Panoptes, Oath’s distributed network telemetry collector

yahoodevelopers: By Ian Flint, Network Automation Architect and Varun Varma, Senior Principal Engineer The Oath network automation team is proud to announce that we are open-sourcing Panoptes, a distributed system for collecting, enriching and distributing network telemetry.   We developed Panoptes to address several issues inherent in legacy polling systems, including overpolling due to multiple point solutions for metrics, a lack of data normalization, consistent data enrichment and integration with infrastructure discovery systems.   Panoptes is a pluggable, distributed, high-performance data collection system which supports multiple polling formats, including SNMP and vendor-specific APIs. It is also extensible to support emerging streaming telemetry standards including gNMI. Architecture The following block diagram shows the major components of Panoptes: Panoptes is written primarily in Python, and leverages multiple open-source technologies to provide the most value for the least development effort. At the center of Panoptes is a metrics bus implemented on Kafka. All data plane transactions flow across this bus; discovery publishes devices to the bus, polling publishes metrics to the bus, and numerous clients read the data off of the bus for additional processing and forwarding. This architecture enables easy data distribution and integration with other systems. For example, in preparing for open-source, we identified a need for a generally available time series datastore. We developed, tested and released a plugin to push metrics into InfluxDB in under a week. This flexibility allows Panoptes to evolve with industry standards. Check scheduling is accomplished using Celery, a horizontally scalable, open-source scheduler utilizing a Redis data store. Celery’s scalable nature combined with Panoptes’ distributed nature yields excellent scalability. Across Oath, Panoptes currently runs hundreds of thousands of checks per second, and the infrastructure has been tested to more than one million checks per second. Panoptes ships with a simple, CSV-based discovery system. Integrating Panoptes with a CMDB is as simple as writing an adapter to emit a CSV, and importing that CSV into Panoptes. From there, Panoptes will manage the task of scheduling polling for the desired devices. Users can also develop custom discovery plugins to integrate with their CMDB and other device inventory data sources. Finally, any metrics gathering system needs a place to send the metrics. Panoptes’ initial release includes an integration with InfluxDB, an industry-standard time series store. Combined with Grafana and the InfluxData ecosystem, this gives teams the ability to quickly set up a fully-featured monitoring environment. Deployment at Oath At Oath, we anticipate significant benefits from building Panoptes. We will consolidate four siloed polling solutions into one, reducing overpolling and the associated risk of service interruption. As vendors move toward streaming telemetry, Panoptes’ flexible architecture will minimize the effort required to adopt these new protocols. There is another, less obvious benefit to a system like Panoptes. As is the case with most large enterprises, a massive ecosystem of downstream applications has evolved around our existing polling solutions. Panoptes allows us to continue to populate legacy datastores without continuing to run the polling layers of those systems. This is because Panoptes’ data bus enables multiple metrics consumers, so we can send metrics to both current and legacy datastores. At Oath, we have deployed Panoptes in a tiered, federated model. We install the software in each of our major data centers and proxy checks out to smaller installations such as edge sites.  All metrics are polled from an instance close to the devices, and metrics are forwarded to a centralized time series datastore. We have also developed numerous custom applications on the platform, including a load balancer monitor, a BGP session monitor, and a topology discovery application. The availability of a flexible, extensible platform has greatly reduced the cost of producing robust network data systems. Easy Setup Panoptes’ open-source release is packaged for easy deployment into any Linux-based environment. Deployment is straightforward, so you can have a working system up in hours, not days. We are excited to share our internal polling solution and welcome engineers to contribute to the codebase, including contributing device adapters, metrics forwarders, discovery plugins, and any other relevant data consumers.   Panoptes is available at https://github.com/yahoo/panoptes, and you can connect with our team at network-automation@oath.com.

Open-Sourcing Panoptes, Oath’s distributed network telemetry collector

October 4, 2018
Configurable Build Resources October 2, 2018
October 2, 2018
Share

Configurable Build Resources

We’ve expanded build resource configuration options for Screwdriver! Screwdriver allows users to specify varying tiers of build resources via annotations. Previously, users were able to configure cpu and ram between the three tiers: micro, low(default), and high. In our recent change, we are introducing a new configurable resource, disk, which can be set to either low (default) or high. Furthermore, we are adding an extra tier turbo to both the cpu and ram resources! Please note that although Screwdriver provides default values for each tier, their actual values are determined by the cluster admin. Resources tier: Screwdriver Users In order to use these new settings, Screwdriver users can configure their screwdriver.yamls using annotations as shown below: Example: jobs: main: annotations: screwdriver.cd/cpu: TURBO screwdriver.cd/disk: HIGH screwdriver.cd/ram: MICRO image: node:8 steps: - hello: echo hello requires: [~pr, ~commit] Cluster Admins Screwdriver cluster admins can refer to the following issues to set up turbo and disk resources properly. - Turbo resources: https://github.com/screwdriver-cd/screwdriver/issues/1318#issue-364993739 - Disk resources: https://github.com/screwdriver-cd/screwdriver/issues/757#issuecomment-425589405Compatibility List In order to use these new features, you will need these minimum versions: - screwdrivercd/queue-worker:v2.2.2Contributors Thanks to the following people for making this feature possible: - Filbird - minz1027 Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support

Configurable Build Resources

October 2, 2018
Apache Pulsar graduates to Top-Level Project September 25, 2018
September 25, 2018
amberwilsonla
Share

Apache Pulsar graduates to Top-Level Project

yahoodevelopers: By Joe Francis, Director, Storage & Messaging We’re excited to share that The Apache Software Foundation announced today that Apache Pulsar has graduated from the incubator to a Top-Level Project. Apache Pulsar is an open-source distributed pub-sub messaging system, created by Yahoo in June 2015 and submitted to the Apache Incubator in June 2017. Apache Pulsar is integral to the streaming data pipelines supporting Oath’s core products including Yahoo Mail, Yahoo Finance, Yahoo Sports and Oath Ad Platforms. It handles hundreds of billions of data events each day and is an integral part of our hybrid cloud strategy. It enables us to stream data between our public and private clouds and allows data pipelines to connect across the clouds.   Oath continues to support Apache Pulsar, with contributions including best-effort messaging, load balancer and end-to-end encryption. With growing data needs handled by Apache Pulsar at Oath, we’re focused on reducing memory pressure in brokers and bookkeepers, and creating additional connectors to other large-scale systems. Apache Pulsar’s future is bright and we’re thrilled to be part of this great project and community. P.S. We’re hiring! Learn more here.

Apache Pulsar graduates to Top-Level Project

September 25, 2018
Pipeline pagination on the Search page September 20, 2018
September 20, 2018
Share

Pipeline pagination on the Search page

We’ve recently added pagination to the pipelines on the Search page! Before pipeline pagination, when a user visited the Search page (e.g. /search), all pipelines were fetched from the API and sorted alphabetically in the UI. In order to improve the total page load time, we moved the burden of pagination from the UI to the API. Now, when a user visits the Search page, only the first page of pipelines is fetched by default. Clicking the Show More button triggers the fetching of the next page of pipelines. All the pagination and search logic is moved to the datastore, so the overall load time for fetching a page of search results is under 2 seconds now as compared to before where some search queries could take more than 10 seconds.Screwdriver Cluster Admins In order to use these latest changes fully, Screwdriver cluster admins will need to do some SQL queries to migrate data from scmRepo to the new name field. This name field will be used for sorting and searching in the Search UI. Without migrating If no migration is done, pipelines will show up sorted by id in the Search page. Pipelines will not be returned in search results until a sync or update is done on them (either directly from the UI or by interacting with the pipeline in some way in the UI). Steps to migrate 1. Pull in the new API (v0.5.466). This is necessary for the name column to be created in the DB. 2. Take a snapshot or backup your DB. 3. Set pipeline name. This requires two calls in postgres: one to extract the pipeline name data, the second to remove the curly braces ({ and }) injected by the regexp call. In postgresql, run: UPDATE public.pipelines SET name = regexp_matches("scmRepo", '.*name":"(.*)",.*') UPDATE public.pipelines SET name = btrim(name, '{}') 4. Pull in the new UI (v1.0.331). 5. Optionally, you can post a banner at to let users know they might need to sync their pipelines if it is not showing up in search results. Make an API call to POST /banners with proper auth and body like: { "message": "If your pipeline is not showing up in Search results, go to the pipeline Options tab and Sync the pipeline.", "isActive": true, "type": "info" } Compatibility List The Search page pipeline pagination requires the following minimum versions of Screwdriver: - API: v0.5.466 - UI: v1.0.331 Contributors Thanks to the following people who made this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Pipeline pagination on the Search page

September 20, 2018
Introducing HaloDB, a fast, embedded key-value storage engine written in Java September 19, 2018
September 19, 2018
amberwilsonla
Share

Introducing HaloDB, a fast, embedded key-value storage engine written in Java

yahoodevelopers: By Arjun Mannaly, Senior Software Engineer  At Oath, multiple ad platforms use a high throughput, low latency distributed key-value database that runs in data centers all over the world. The database stores billions of records and handles millions of read and write requests per second at millisecond latencies. The data we have in this database must be persistent, and the working set is larger than what we can fit in memory. Therefore, a key component of the database performance is a fast storage engine. Our current solution had served us well, but it was primarily designed for a read-heavy workload and its write throughput started to be a bottleneck as write traffic increased. There were other additional concerns as well; it took hours to repair a corrupted DB, or iterate over and delete records. The storage engine also didn’t expose enough operational metrics. The primary concern though was the write performance, which based on our projections, would have been a major obstacle for scaling the database. With these concerns in mind, we began searching for an alternative solution. We searched for a key-value storage engine capable of dealing with IO-bound workloads, with submillisecond read latencies under high read and write throughput. After concluding our research and benchmarking alternatives, we didn’t find a solution that worked for our workload, thus we were inspired to build HaloDB. Now, we’re glad to announce that it’s also open source and available to use under the terms of the Apache license. HaloDB has given our production boxes a 50% improvement in write capacity while consistently maintaining a submillisecond read latency at the 99th percentile. Architecture HaloDB primarily consists of append-only log files on disk and an index of keys in memory. All writes are sequential writes which go to an append-only log file and the file is rolled-over once it reaches a configurable size. Older versions of records are removed to make space by a background compaction job. The in-memory index in HaloDB is a hash table which stores all keys and their associated metadata. The size of the in-memory index, depending on the number of keys, can be quite large, hence for performance reasons, is stored outside the Java heap, in native memory. When looking up the value for a key, corresponding metadata is first read from the in-memory index and then the value is read from disk. Each lookup request requires at most a single read from disk. Performance   The chart below shows the results of performance tests with real production data. The read requests were kept at 50,000 QPS while the write QPS was increased. HaloDB scaled very well as we increased the write QPS while consistently maintaining submillisecond read latencies at the 99th percentile. The chart below shows the 99th percentile latency from a production server before and after migration to HaloDB.  If HaloDB sounds like a helpful solution to you, please feel free to use it, open issues, and contribute!

Introducing HaloDB, a fast, embedded key-value storage engine written in Java

September 19, 2018
Join us in San Francisco on September 26th for a Meetup September 18, 2018
September 18, 2018
Share

Join us in San Francisco on September 26th for a Meetup

Hi Vespa Community, Several members from our team will be traveling to San Francisco on September 26th for a meetup and we’d love to chat with you there. Jon Bratseth (Distinguished Architect) will present a Vespa overview and answer any questions. To learn more and RSVP, please visit: https://www.meetup.com/SF-Big-Analytics/events/254461052/. Hope to see you! The Vespa Team

Join us in San Francisco on September 26th for a Meetup

September 18, 2018
Build step logs download September 18, 2018
September 18, 2018
Share

Build step logs download

Downloading Step Logs We have added a Download button in the top right corner of the build log console. Upon clicking the button, the browser will query all or the rest of the log content from our API and compose a client-side downloadable text blob by leveraging the URL.createObjectURL() Web API. Minor Improvement On Workflow Graph Thanks to s-yoshika, the link edge is no longer covering the name text of the build node. Also, for build jobs with names that exceed 20 characters, it will be automatically ellipsized to avoid being clipped off by the containing DOM element. Compatibility List These UI improvements require the following minimum versions of Screwdriver: - screwdrivercd/ui: v1.0.329Contributors Thanks to the following people for making this feature possible: - DekusDenial - s-yoshika Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Build step logs download

September 18, 2018
Introducing Oak: an Open Source Scalable Key-Value Map for Big Data Analytics September 13, 2018
September 13, 2018
amberwilsonla
Share

Introducing Oak: an Open Source Scalable Key-Value Map for Big Data Analytics

yahoodevelopers: By Dmitry Basin, Edward Bortnikov, Anastasia Braginsky, Eshcar Hillel, Idit Keidar, Hagar Meir, Gali Sheffi Real-time analytics applications are on the rise. Modern decision support and machine intelligence engines strive to continuously ingest large volumes of data while providing up-to-date insights with minimum delay. For example, in Flurry Analytics, an Oath service which provides mobile developers with rich tools to explore user behavior in real time, it only takes seconds to reflect the events that happened on mobile devices in its numerous dashboards. The scalability demand is immense – as of late 2017, the Flurry SDK was installed on 2.6B devices and monitored 1M+ mobile apps. Mobile data hits the Flurry backend at a huge rate, updates statistics across hundreds of dimensions, and becomes queryable immediately. Flurry harnesses the open-source distributed interactive analytics engine named Druid to ingest data and serve queries at this massive rate. In order to minimize delays before data becomes available for analysis, technologies like Druid should avoid maintaining separate systems for data ingestion and query serving, and instead strive to do both within the same system. Doing so is nontrivial since one cannot compromise on overall correctness when multiple conflicting operations execute in parallel on modern multi-core CPUs. A promising approach is using concurrent data structure (CDS) algorithms which adapt traditional data structures to multiprocessor hardware. CDS implementations are thread-safe – that is, developers can use them exactly as sequential code while maintaining strong theoretical correctness guarantees. In recent years, CDS algorithms enabled dramatic application performance scaling and became popular programming tools. For example, Java programmers can use the ConcurrentNavigableMap JDK implementations for the concurrent ordered key-value map abstraction that is instrumental in systems like Druid. Today, we are excited to share Oak, a new open source project from Oath, available under the Apache License 2.0. The project was created by the Scalable Systems team at Yahoo Research. It extends upon our earlier research work, named KiWi. Oak is a Java package that implements OakMap – a concurrent ordered key-value map. OakMap’s API is similar to Java’s ConcurrentNavigableMap. Java developers will find it easy to switch most of their applications to it. OakMap provides the safety guarantees specified by ConcurrentNavigableMap’s programming model. However, it scales with the RAM and CPU resources well beyond the best-in-class ConcurrentNavigableMap implementations. For example, it compares favorably to Doug Lea’s seminal ConcurrentSkipListMap, which is used by multiple big data platforms, including Apache HBase, Druid, EVCache, etc. Our benchmarks show that OakMap harnesses 3x more memory, and runs 3x-5x faster on analytics workloads. OakMap’s implementation is very different from traditional implementations such as  ConcurrentSkipListMap. While the latter maintains all keys and values as individual Java objects, OakMap stores them in very large memory buffers allocated beyond the JVM-managed memory heap (hence the name Oak - abbr. Off-heap Allocated Keys). The access to the key-value pairs is provided by a lightweight two-level on-heap index. At its lower level, the references to keys are stored in contiguous chunks, each responsible for a distinct key range. The chunks themselves, which dominate the index footprint, are accessed through a lightweight top-level ConcurrentSkipListMap. The figure below illustrates OakMap’s data organization. OakMap structure. The maintenance of OakMap’s chunked index in a concurrent setting is the crux of its complexity as well as the key for its efficiency. Experiments have shown that our algorithm is advantageous in multiple ways: 1. Memory scaling. OakMap’s custom off-heap memory allocation alleviates the garbage collection (GC) overhead that plagues Java applications. Despite the permanent progress, modern Java GC algorithms do not practically scale beyond a few tens of GBs of memory, whereas OakMap scales beyond 128GB of off-heap RAM. 2. Query speed. The chunk-based layout increases data locality, which speeds up both single-key lookups and range scans. All queries enjoy efficient, cache-friendly access, in contrast with permanent dereferencing in object-based maps. On top of these basic merits, OakMap provides safe direct access to its chunks, which avoids an extra copy for rebuilding the original key and value objects. Our benchmarks demonstrate OakMap’s performance benefits versus ConcurrentSkipListMap: A) Up to 2x throughput for ascending scans. B) Up to 5x throughput for descending scans. C) Up to 3x throughput for lookups. 3. Update speed. Beyond avoiding the GC overhead typical for write-intensive workloads, OakMap optimizes the incremental maintenance of big complex values – for example, aggregate data sketches, which are indispensable in systems like Druid. It adopts in situ computation on objects embedded in its internal chunks to avoid unnecessary data copy, yet again. In our benchmarks, OakMap achieves up to 1.8x data ingestion rate versus ConcurrentSkipListMap. With key-value maps being an extremely generic abstraction, it is easy to envision a variety of use cases for OakMap in large-scale analytics and machine learning applications – such as unstructured key-value storage, structured databases, in-memory caches, parameter servers, etc. For example, we are already working with the Druid community on rebuilding Druid’s core Incremental Index component around OakMap, in order to boost its scalability and performance. We look forward to growing the Oak community! We invite you to explore the project, use OakMap in your applications, raise issues, suggest improvements, and contribute code. If you have any questions, please feel free to send us a note on the Oak developers list: oakproject@googlegroups.com. It would be great to hear from you!

Introducing Oak: an Open Source Scalable Key-Value Map for Big Data Analytics

September 13, 2018
Improvement on perceived performance September 12, 2018
September 12, 2018
Share

Improvement on perceived performance

In an effort to improve Screwdriver user experience, the Screwdriver team identified two major components on the UI that needed improvement with respect to load time — the event pipeline and build step log. To improve user-perceived performance on those components, we decided to adopt two corresponding UX approaches — pagination and lazy loading. Event Pipeline Before our pagination change, when a user visited the pipeline events page (e.g. /pipelines/{id}/events), all events and their builds were fetched from the API then artificially paginated in the UI. In order to improve the total page load time, it was important to move the burden of pagination from the UI to the API. Now, when a user visits the pipeline events page, only the latest page of events and builds are fetched by default. Clicking the Show More button triggers the fetching of the next page of events and builds. Since there is no further processing of the API data by the UI, the overall load time for fetching a page of events and their corresponding build info is well under a second now as compared to before where some pipelines could take more than ten seconds. Build Step Log As for the build step log, instead of chronologically fetching pages of completed step logs one page at a time until the entire log is fetched, it is now fetched in a reverse chronologically order and only a reasonable amount of logs is fetched and loaded lazily as the user scrolls up the log console. This change is meant to compensate for builds that generate tens of thousands lines of log. Since users had to wait for the entire log to load before they could interact with it, the previous implementation was extremely time consuming as the size of step logs increased. Now, the first page of a step log takes roughly two seconds or less to load. To put the significance of the change into perspective, consider a step that generates a total of 98743 lines of log: it would have taken 90 seconds to load and almost 10 seconds to fully render on the UI; now it takes less than 2 seconds to load and less than 1 second to render. Compatibility List These UI improvements require the following minimum versions of Screwdriver: - screwdrivercd/screwdriver: v0.5.460 - screwdrivercd/ui: v1.0.327Contributors Thanks to the following people for making this feature possible: - DekusDenial - jithin1987 - minz1027 - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Improvement on perceived performance

September 12, 2018
Vespa at Zedge - providing personalization content to millions of iOS, Android & web users September 3, 2018
September 3, 2018
Share

Vespa at Zedge - providing personalization content to millions of iOS, Android & web users

This blog post describes Zedge’s use of Vespa for search and recommender systems to support content discovery for personalization of mobile phones (Android, iOS and Web). Zedge is now using Vespa in production to serve millions of monthly active users. See the architecture below.What is Zedge? Zedge’s main product is an app - Zedge Ringtones & Wallpapers - that provides wallpapers, ringtones, game recommendations and notification sounds customized for your mobile device.  Zedge apps have been downloaded more than 300 million times combined for iOS and Android and is used by millions of people worldwide each month. Zedge is traded on NYSE under the ticker ZDGE. People use Zedge apps for self-expression. Setting a wallpaper or ringtone on your mobile device is in many ways similar to selecting clothes, hairstyle or other fashion statements. In fact people try a wallpaper or ringtone in a similar manner as they would try clothes in a dressing room before making a purchase decision, they try different wallpapers or ringtones before deciding on one they want to keep for a while. The decision for selecting a wallpaper is not taken lightly, since people interact and view their mobile device screen (and background wallpaper) a lot (hundreds of times per day). Why Zedge considered Vespa Zedge apps - for iOS, Android and Web - depend heavily on search and recommender services to support content discovery. These services have been developed over several years and constituted of multiple subsystems - both internally developed and open source - and technologies for both search and recommender serving. In addition there were numerous big data processing jobs to build and maintain data for content discovery serving. The time and complexity of improving search and recommender services and corresponding processing jobs started to become high, so simplification was due. Vespa seemed like a promising open source technology to consider for Zedge, in particular since it was proven in several ways within Oath (Yahoo): 1. Scales to handle very large systems, e.g.  2. Flickr with billions of images and 3. Yahoo Gemini Ads Platform with more than one hundred thousand request per second to serve ads to 1 billion monthly active users for services such as Techcrunch, Aol, Yahoo!, Tumblr and Huffpost. 4. Runs stable and requires very little operations support - Oath has a few hundred - many of them large - Vespa based applications requiring less than a handful operations people to run smoothly.  5. Rich set of features that Zedge could gain from using 6. Built-in tensor processing support could simplify calculation and serving of related wallpapers (images) & ringtones/notifications (audio) 7. Built-in support of Tensorflow models to simplify development and deployment of machine learning based search and recommender ranking (at that time in development according to Oath). 8. Search Chains 9. Help from core developers of VespaThe Vespa pilot project Given the content discovery technology need and promising characteristics of Vespa we started out with a pilot project with a team of software engineers, SRE and data scientists with the goals of: 1. Learn about Vespa from hands-on development  2. Create a realistic proof of concept using Vespa in a Zedge app 3. Get initial answers to key questions about Vespa, i.e. enough to decide to go for it fully 4. Which of today’s API services can it simplify and replace? 5. What are the (cloud) production costs with Vespa at Zedge’s scale? (OPEX) 6. How will maintenance and development look like with Vespa? (future CAPEX) 7. Which new (innovation) opportunities does Vespa give? The result of the pilot project was successful - we developed a good proof of concept use of Vespa with one of our Android apps internally and decided to start a project transferring all recommender and search serving to Vespa. Our impression after the pilot was that the main benefit was by making it easier to maintain and develop search/recommender systems, in particular by reducing amount of code and complexity of processing jobs. Autosuggest for search with Vespa Since autosuggest (for search) required both low latency and high throughput we decided that it was a good candidate to try for production with Vespa first. Configuration wise it was similar to regular search (from the pilot), but snippet generation (document summary) requiring access to document store was superfluous for autosuggest. A good approach for autosuggest was to: 1. Make all document fields searchable with autosuggest of type (in-memory) attribute 2. https://docs.vespa.ai/documentation/attributes.html  3. https://docs.vespa.ai/documentation/reference/search-definitions-reference.html#attribute  4. https://docs.vespa.ai/documentation/search-definitions.html (basics) 5. Avoid snippet generation and using the document store by overriding the document-summary setting in search definitions to only access attributes 6. https://docs.vespa.ai/documentation/document-summaries.html  7. https://docs.vespa.ai/documentation/nativerank.html The figure above illustrates the autosuggest architecture. When the user starts typing in the search field, we fire a query with the search prefix to the Cloudflare worker - which in case of a cache hit returns the result (possible queries) to the client. In case of a cache miss the Cloudflare worker forwards the query to our Vespa instance handling autosuggest. Regarding external API for autosuggest we use Cloudflare Workers (supporting Javascript on V8 and later perhaps multiple languages with Webassembly) to handle API queries from Zedge apps in front of Vespa running in Google Cloud. This setup allow for simple close-to-user caching of autosuggest results. Search, Recommenders and Related Content with Vespa Without going into details we had several recommender and search services to adapt to Vespa. These services were adapted by writing custom Vespa searchers and in some cases search chains: - https://docs.vespa.ai/documentation/searcher-development.html  - https://docs.vespa.ai/documentation/chained-components.html  The main change compared to our old recommender and related content services was the degree of dynamicity and freshness of serving, i.e. with Vespa more ranking signals are calculated on the fly using Vespa’s tensor support instead of being precalculated and fed into services periodically. Another benefit of this was that the amount of computational (big data) resources and code for recommender & related content processing was heavily reduced. Continuous Integration and Testing with Vespa A main focus was to make testing and deployment of Vespa services with continuous integration (see figure below). We found that a combination of Jenkins (or similar CI product or service) with Docker Compose worked nicely in order to test new Vespa applications, corresponding configurations and data (samples) before deploying to the staging cluster with Vespa on Google Cloud. This way we can have a realistic test setup - with Docker Compose - that is close to being exactly similar to the production environment (even at hostname level).Monitoring of Vespa with Prometheus and Grafana For monitoring we created a tool that continuously read Vespa metrics, stored them in Prometheus (a time series database) and visualized them them with Grafana. This tool can be found on https://github.com/vespa-engine/vespa_exporter. More information about Vespa metrics and monitoring: - https://docs.vespa.ai/documentation/reference/metrics-health-format.html - https://docs.vespa.ai/documentation/jdisc/metrics.html - https://docs.vespa.ai/documentation/operations/admin-monitoring.htmlConclusion The team quickly got up to speed with Vespa with its good documentation and examples, and it has been running like a clock since we started using it for real loads in production. But this was only our first step with Vespa - i.e. consolidating existing search and recommender technologies into a more homogeneous and easier to maintain form. With Vespa as part of our architecture we see many possible paths for evolving our search and recommendation capabilities (e.g. machine learning based ranking such as integration with Tensorflow and ONNX). Best regards, Zedge Content Discovery Team

Vespa at Zedge - providing personalization content to millions of iOS, Android & web users

September 3, 2018
Private channel support for Slack notifications August 27, 2018
August 27, 2018
Share

Private channel support for Slack notifications

In January, we introduced Slack notifications for build statuses in public channels. This week, we are happy to announce that we also support Slack notifications for private channels as well!Usage for a Screwdriver.cd User Slack notifications can be configured the exact same way as before, but private repos are now supported. First, you must invite the Screwdriver Slack bot (most likely screwdriver-bot), created by your admin, to your Slack channel(s). Then, you must configure your screwdriver.yaml file, which stores all your build settings: settings: slack: channels: - channel_A # public - channel_B # private statuses: # statuses to notify on - SUCCESS - FAILURE - ABORTED statuses denote the build statuses that trigger a notification. The full possible list of statuses to listen on can be found in our data-schema. If omitted, it defaults to only notifying you when a build returns a FAILURE status. See our previous Slack blog post and Slack user documentation and cluster admin documentation for more information.Compatibility List Private channel support for Slack notifications requires the following minimum versions of Screwdriver: - screwdrivercd/screwdriver: v0.5.451Contributors Thanks to the following people for making this feature possible: - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Private channel support for Slack notifications

August 27, 2018
User configurable shell August 22, 2018
August 22, 2018
Share

User configurable shell

Previously, Screwdriver ran builds in sh. This caused problems for some users that have bash syntax in their steps. With version LAUNCHER v5.0.13 and above, users can run builds in the shell of their choice by setting the environment variable USER_SHELL_BIN. This value can also be the full path such as /bin/bash. Example screwdriver.yaml (can be found under the screwdriver-cd-test/user-shell-example repo): shared: image: node:6 jobs: # This job will fail because `source` is not available in sh test-sh: steps: - fail: echo "echo hello" > /tmp/test && source /tmp/test requires: [~pr, ~commit] # This job will pass because `source` is available in bash test-bash: # Set USER_SHELL_BIN to bash to run the in bash environment: USER_SHELL_BIN: bash steps: - pass: echo "echo hello" > /tmp/test && source /tmp/test requires: [~pr, ~commit] Compatibility List User-configurable shell support requires the following minimum versions of Screwdriver: - screwdrivercd/launcher: v5.0.13Contributors Thanks to the following people for making this feature possible: - d2lam Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

User configurable shell

August 22, 2018
Introducing JSON queries August 8, 2018
August 8, 2018
Share

Introducing JSON queries

We recently introduced a new addition to the Search API - JSON queries. The search request can now be executed with a POST request, which includes the query-parameters within its payload. Along with this new query we also introduce a new parameter SELECT with the sub-parameters WHERE and GROUPING, which is equivalent to YQL. The new query With the Search APIs newest addition, it is now possible to send queries with HTTP POST. The query-parameters has been moved out of the URL and into a POST request body - therefore, no more URL-encoding. You also avoid getting all the queries in the log, which can be an advantage. This is how a GET query looks like: GET /search/?param1=value1¶m2=value2&... The general form of the new POST query is: POST /search/ { param1 : value1, param2 : value2, ... } The dot-notation is gone, and the query-parameters are now nested under the same key instead. Let’s take this query: GET /search/?yql=select+%2A+from+sources+%2A+where+default+contains+%22bad%22%3B&ranking.queryCache=false&ranking.profile=vespaProfile&ranking.matchPhase.ascending=true&ranking.matchPhase.maxHits=15&ranking.matchPhase.diversity.minGroups=10&presentation.bolding=false&presentation.format=json&nocache=true and write it in the new POST request-format, which will look like this: POST /search/ { "yql": "select * from sources * where default contains \"bad\";", "ranking": { "queryCache": "false", "profile": "vespaProfile", "matchPhase": { "ascending": "true", "maxHits": 15, "diversity": { "minGroups": 10 } } }, "presentation": { "bolding": "false", "format": "json" }, "nocache": true } With Vespa running (see Quick Start or Blog Search Tutorial), you can try building POST-queries with the new querybuilder GUI at http://localhost:8080/querybuilder/, which can help you build queries with e.g. autocompletion of YQL: The Select-parameter The SELECT-parameter is used with POST queries and is the JSON equivalent of YQL queries, so they can not be used together. The query-parameter will overwrite SELECT, and decide the query’s querytree. Where The SQL-like syntax is gone and the tree-syntax has been enhanced. If you’re used to the query-parameter syntax you’ll feel right at home with this new language. YQL is a regular language and is parsed into a query-tree when parsed in Vespa. You can now build that tree in the WHERE-parameter with JSON. Lets take a look at the yql: select * from sources * where default contains foo and rank(a contains "A", b contains "B");, which will create the following query-tree: You can build the tree above with the WHERE-parameter, like this: { "and" : [ { "contains" : ["default", "foo"] }, { "rank" : [ { "contains" : ["a", "A"] }, { "contains" : ["b", "B"] } ]} ] } Which is equivalent with the YQL. Grouping The grouping can now be written in JSON, and can now be written with structure, instead of on the same line. Instead of parantheses, we now use curly brackets to symbolise the tree-structure between the different grouping/aggregation-functions, and colons to assign function-arguments. A grouping, that will group first by year and then by month, can be written as such: | all(group(time.year(a)) each(output(count()) all(group(time.monthofyear(a)) each(output(count()))) and equivalentenly with the new GROUPING-parameter: "grouping" : [ { "all" : { "group" : "time.year(a)", "each" : { "output" : "count()" }, "all" : { "group" : "time.monthofyear(a)", "each" : { "output" : "count()" }, } } } ] Wrapping it up In this post we have provided a gentle introduction to the new Vepsa POST query feature, and the SELECT-parameter. You can read more about writing POST queries in the Vespa documentation. More examples of the POST query can be found in the Vespa tutorials. Please share experiences. Happy searching!

Introducing JSON queries

August 8, 2018
Introducing Screwdriver Commands for sharing binaries July 30, 2018
July 30, 2018
Share

Introducing Screwdriver Commands for sharing binaries

Oftentimes, there are small scripts or commands that people will use in multiple jobs that are not complex enough to warrant creating a Screwdriver template. Options such as Git repositories, yum packages, or node modules exist, but there was no clear way to share binaries or scripts across multiple jobs. Recently, we have released Screwdriver Commands (also known as sd-cmd) which solves this problem, allowing users to easily share binary commands or scripts across multiple containers and jobs. Using a command The following is an example of using an sd-cmd. You can configure any commands or scripts in screwdriver.yaml like this: Example: jobs: main: requires: [~pr, ~commit] steps: - exec: sd-cmd exec foo/bar@1 -baz sample Format for using sd-cmd: sd-cmd exec /@ - namespace/name - the fully-qualified command name - version - a semver-compatible format or tag - arguments - passed directly to the underlying command In this example, Screwdriver will download the command “foobar.sh” from the Store, which is defined by namespace, name, and version, and will execute it with args “-baz sample”. Actual command will be run as: $ /opt/sd/commands/foo/bar/1.0.1/foobar.sh -baz sample Creating a command Next, this section covers how to publish your own binary commands or scripts. Commands or scripts must be published using a Screwdriver pipeline. The command will then be available in the same Screwdriver cluster. Writing a command yaml To create a command, create a repo with a sd-command.yaml file. The file should contain a namespace, name, version, description, maintainer email, format, and a config that depends on a format. Optionally, you can set the usage field, which will replace the default usage set in the documentation in the UI. Example sd-command.yaml: Binary example: namespace: foo # Namespace for the command name: bar # Command name version: '1.0' # Major and Minor version number (patch is automatic), must be a string description: | Lorem ipsum dolor sit amet. usage: | # Optional usage field for documentation purposes sd-cmd exec foo/bar@

Introducing Screwdriver Commands for sharing binaries

July 30, 2018
User teardown steps July 12, 2018
July 12, 2018
Share

User teardown steps

Users can now specify their own teardown steps in Screwdriver, which will always run regardless of build status. These steps need to be defined at the end of the job and start with teardown-. Note: These steps run in separate shells. As a result, environment variables set by previous steps will not be available. Update 8/22/2018: Environment variables set by user steps are now available in teardown steps. Example screwdriver.yaml jobs: main: image: node:8 steps: - fail: command-does-not-exist - teardown-step1: echo hello - teardown-step2: echo goodbye requires: - ~commit - ~pr In this example, the steps teardown-step1 and teardown-step2 will run even though the build fails: Compatibility List User teardown support requires the following minimum versions of Screwdriver: - screwdrivercd/launcher: v4.0.116 - screwdrivercd/screwdriver: v0.5.405Contributors Thanks to the following people for making this feature possible: - d2lam - tk3fftk (from Yahoo! JAPAN) Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

User teardown steps

July 12, 2018
Pipeline API Tokens in Screwdriver July 9, 2018
July 9, 2018
Share

Pipeline API Tokens in Screwdriver

We released pipeline-scoped API Tokens, which enable your scripts to interact with a specific Screwdriver pipeline. You can use these tokens with fine-grained access control for each pipeline instead of User Access Tokens. Creating Tokens If you go to Screwdriver’s updated pipeline Secrets page, you can find a list of all your pipeline access tokens along with the option to modify, refresh, or revoke them. At the bottom of the list is a form to generate a new token. Enter a name and optional description, then click Add. Your new pipeline token value will be displayed at the top of the Access Tokens section, but it will only be displayed once, so make sure you save it somewhere safe! This token provides admin-level access to your specific pipeline, so treat it as you would a password. Using Tokens to Authenticate To authenticate with your pipeline’s newly-created token, make a GET request to https://${API_URL}/v4/auth/token?api_token=${YOUR_PIPELINE_TOKEN_VALUE}. This returns a JSON object with a token field. The value of this field will be a JSON Web Token, which you can use in an Authorization header to make further requests to the Screwdriver API. This JWT will be valid for 2 hours, after which you must re-authenticate. Example: Starting a Specific Pipeline You can use a pipeline token similar to how you would a user token. Here’s a short example written in Python showing how you can use a Pipeline API token to start a pipeline. This script will directly call the Screwdriver API. # Authenticate with token auth_request = get('https://api.screwdriver.cd/v4/auth/token?api_token=%s' % environ['SD_KEY']) jwt = auth_request.json()['token'] # Set headers headers = { 'Authorization': 'Bearer %s' % jwt } # Get the jobs in the pipeline jobs_request = get('https://api.screwdriver.cd/v4/pipelines/%s/jobs' % pipeline_id, headers=headers) jobId = jobs_request.json()[0]['id'] # Start the first job start_request = post('https://api.screwdriver.cd/v4/builds', headers=headers, data=dict(jobId=jobId)) Compatibility List For pipeline tokens to work, you will need these minimum versions: - screwdrivercd/screwdriver: v0.5.389 - screwdrivercd/ui: v1.0.290Contributors Thanks to the following people for making this feature possible: - kumada626 (from Yahoo! JAPAN) - petey - s-yoshika (from Yahoo! JAPAN) Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Pipeline API Tokens in Screwdriver

July 9, 2018
Multibyte Artifact Name Support July 6, 2018
July 6, 2018
Share

Multibyte Artifact Name Support

A multibyte character is a character composed of sequences of one or more bytes. It’s often used in Asia (e.g. Japanese, Chinese, Thai). Screwdriver now supports reading artifacts that contain multibyte characters.Example screwdriver.yaml jobs: main: image: node:8 requires: [ ~pr, ~commit ] steps: - touch_multibyte_artifact: echo 'foo' > $SD_ARTIFACTS_DIR/日本語ファイル名さんぷる.txt In this example, we are writing an artifact, 日本語ファイル名さんぷる, which means Japanese file name sample. The artifact name includes Kanji, Katakana, and Hiragana, which are multibyte characters. The artifacts of this example pipeline: The result from clicking the artifact link:Compatibility List Multibyte artifact name support requires the following minimum versions of Screwdriver: - screwdrivercd/screwdriver: v0.5.309Contributors Thanks to the following people for making this feature possible: - minz1027 - sakka2 (from Yahoo! JAPAN) - Zhongtang Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Multibyte Artifact Name Support

July 6, 2018
Introducing Template Namespaces June 29, 2018
June 29, 2018
Share

Introducing Template Namespaces

We’ve reworked templates to filter by namespace! Namespaces are meant for easier grouping in the UI. From a template creator perspective, template creation still works the same; however, you now have the ability to explicitly define a template namespace. For Screwdriver cluster admins, you will need to migrate existing templates in your database to the new schema in order for them to be displayed correctly in the new UI. These steps will be covered below.Screwdriver Users For Screwdriver template users, you can still use templates the same way. jobs: main: template: templateNamespace/templateName@1.2.3 requires: [~pr, ~commit] In the UI, you can navigate to the template namespace page by clicking on the namespace or going to /templates/namespaces/. Any templates with no defined template namespace will be available at /templates/namespaces/default.Template owners To create a template with a designated namespace, you can either: Implicitly define a namespace (same as before) Explicitly define a namespace Use the default namespace (same as before) Templates will still be used by users the same way. Implicit namespace Screwdriver will interpret anything before a template name’s slash (/) as the namespace. If you define a sd-template.yaml with name: nodejs/lib, Screwdriver will store namespace: nodejs and name: lib. User’s screwdriver.yaml: jobs: main: template: nodejs/lib@1.2.3 requires: [~pr, ~commit] Explicit namespace You can explicitly define a template namespace. If you do, you cannot have any slashes (/) in your template name. Template yaml snippet: namespace: nodejs name: lib ... User’s screwdriver.yaml: jobs: main: template: nodejs/lib@1.2.3 requires: [~pr, ~commit] Default namespace If you don’t explicitly or implicitly define a namespace, Screwdriver will assign namespace: default to your template. Users will still use your template as you defined it, but it will be grouped with other templates with default namespaces in the UI. Template yaml snippet: name: lib User’s screwdriver.yaml: jobs: main: template: lib@1.2.3 requires: [~pr, ~commit] Screwdriver Cluster Admins Database Migration This feature has breaking changes that will affect your DB if you already have existing templates. In order to migrate your templates properly, you will need to do the following steps: 1. Make sure you’ve updated your unique constraints on your templates and templateTags tables to include namespace. 2. Set a default namespace when no namespace exists. In postgresql, run: UPDATE public."templates" SET namespace = 'default' WHERE name !~ '.*/.*' 3. Set implicit namespaces if users defined them. This requires two calls in postgres, one to split by namespace and name, the second to remove the curly braces ({ and }) injected by the regexp call. UPDATE public."templates" SET namespace = regexp_matches(name, '(.*)/.*'), name = regexp_matches(name, '.*/(.*)') WHERE name ~ '.*/.*' UPDATE public."templates" SET namespace = btrim(namespace, '{}'), name = btrim(name, '{}') Compatibility List Template namespaces require the following minimum versions of Screwdriver: - screwdrivercd/screwdriver:v0.5.396 - screwdrivercd/ui:v1.0.297 - screwdrivercd/launcher:v4.0.117 - screwdrivercd/store:v3.1.2 - screwdrivercd/queue-worker:v1.12.18Contributors Thanks to the following people for making this feature possible: - jithin1987 - lusol - tkyi Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.

Introducing Template Namespaces

June 29, 2018
Introducing ONNX support June 25, 2018
June 25, 2018
Share

Introducing ONNX support

ONNX (Open Neural Network eXchange) is an open format for the sharing of neural network and other machine learned models between various machine learning and deep learning frameworks. As the open big data serving engine, Vespa aims to make it simple to evaluate machine learned models at serving time at scale. By adding ONNX support in Vespa in addition to our existing TensorFlow support, we’ve made it possible to evaluate models from all the commonly used ML frameworks with low latency over large amounts of data. With the rise of deep learning in the last few years, we’ve naturally enough seen an increase of deep learning frameworks as well: TensorFlow, PyTorch/Caffe2, MxNet etc. One reason for these different frameworks to exist is that they have been developed and optimized around some characteristic, such as fast training on distributed systems or GPUs, or efficient evaluation on mobile devices. Previously, complex projects with non-trivial data pipelines have been unable to pick the best framework for any given subtask due to lacking interoperability between these frameworks. ONNX is a solution to this problem. ONNX is an open format for AI models, and represents an effort to push open standards in AI forward. The goal is to help increase the speed of innovation in the AI community by enabling interoperability between different frameworks and thus streamlining the process of getting models from research to production. There is one commonality between the frameworks mentioned above that enables an open format such as ONNX, and that is that they all make use of dataflow graphs in one way or another. While there are differences between each framework, they all provide APIs enabling developers to construct computational graphs and runtimes to process these graphs. Even though these graphs are conceptually similar, each framework has been a siloed stack of API, graph and runtime. The goal of ONNX is to empower developers to select the framework that works best for their project, by providing an extensible computational graph model that works as a common intermediate representation at any stage of development or deployment. Vespa is an open source project which fits well within such an ecosystem, and we aim to make the process of deploying and serving models to production that have been trained on any framework as smooth as possible. Vespa is optimized toward serving and evaluating over potentially very large datasets while still responding in real time. In contrast to other ML model serving options, Vespa can more efficiently evaluate models over many data points. As such, Vespa is an excellent choice when combining model evaluation with serving of various types of content. Our ONNX support is quite similar to our TensorFlow support. Importing ONNX models is as simple as adding the model to the Vespa application package (under “models/”) and referencing the model using the new ONNX ranking feature: expression: sum(onnx("my_model.onnx")) The above expression runs the model and sums it to a single scalar value to use in ranking. You will have to provide the inputs to the graph. Vespa expects you to provide a macro with the same name as the input tensor. In the macro you can specify where the input should come from, be it a document field, constant or a parameter sent along with the query. More information can be had in the documentation about ONNX import. Internally, Vespa converts the ONNX operations to Vespa’s tensor API. We do the same for TensorFlow import. So the cost of evaluating ONNX and TensorFlow models are the same. We have put a lot of effort in optimizing the evaluation of tensors, and evaluating neural network models can be quite efficient. ONNX support is also quite new to Vespa, so we do not support all current ONNX operations. Part of the reason we don’t support all operations yet is that some are potentially too expensive to evaluate per document, such as convolutional neural networks and recurrent networks (LSTMs etc). ONNX also contains an extension, ONNX-ML, which contains additional operations for non-neural network cases. Support for this extension will come later at some point. We are continually working to add functionality, so please reach out to us if there is something you would like to have added. Going forward we are continually working on improving performance as well as supporting more of the ONNX (and ONNX-ML) standard. You can read more about ranking with ONNX models in the Vespa documentation. We are excited to announce ONNX support. Let us know what you are building with it!

Introducing ONNX support

June 25, 2018
Innovating on Authentication Standards June 25, 2018
June 25, 2018
mikesefanov
Share

Innovating on Authentication Standards

yahoodevelopers: By George Fletcher and Lovlesh Chhabra When Yahoo and AOL came together a year ago as a part of the new Verizon subsidiary Oath,  we took on the challenge of unifying their identity platforms based on current identity standards. Identity standards have been a critical part of the Internet ecosystem over the last 20+ years. From single-sign-on and identity federation with SAML; to the newer identity protocols including OpenID Connect, OAuth2, JOSE, and SCIM (to name a few); to the explorations of “self-sovereign identity” based on distributed ledger technologies; standards have played a key role in providing a secure identity layer for the Internet. As we navigated this journey, we ran across a number of different use cases where there was either no standard or no best practice available for our varied and complicated needs. Instead of creating entirely new standards to solve our problems, we found it more productive to use existing standards in new ways. One such use case arose when we realized that we needed to migrate the identity stored in mobile apps from the legacy identity provider to the new Oath identity platform. For most browser (mobile or desktop) use cases, this doesn’t present a huge problem; some DNS magic and HTTP redirects and the user will sign in at the correct endpoint. Also it’s expected for users accessing services via their browser to have to sign in now and then. However, for mobile applications it’s a completely different story. The normal user pattern for mobile apps is for the user to sign in (via OpenID Connect or OAuth2) and for the app to then be issued long-lived tokens (well, the refresh token is long lived) and the user never has to sign in again on the device (entering a password on the device is NOT a good experience for the user). So the issue is, how do we allow the mobile app to move from one identity provider to another without the user having to re-enter their credentials? The solution came from researching what standards currently exist that might addres this use case (see figure “Standards Landscape” below) and finding the OAuth 2.0 Token Exchange draft specification (https://tools.ietf.org/html/draft-ietf-oauth-token-exchange-13). The Token Exchange draft allows for a given token to be exchanged for new tokens in a different domain. This could be used to manage the “audience” of a token that needs to be passed among a set of microservices to accomplish a task on behalf of the user, as an example. For the use case at hand, we created a specific implementation of the Token Exchange specification (a profile) to allow the refresh token from the originating Identity Provider (IDP) to be exchanged for new tokens from the consolidated IDP. By profiling this draft standard we were able to create a much better user experience for our consumers and do so without inventing proprietary mechanisms. During this identity technical consolidation we also had to address how to support sharing signed-in users across mobile applications written by the same company (technically, signed with the same vendor signing key). Specifically, how can a signed-in user to Yahoo Mail not have to re-sign in when they start using the Yahoo Sports app? The current best practice for this is captured in OAuth 2.0 for Natives Apps (RFC 8252). However, the flow described by this specification requires that the mobile device system browser hold the user’s authenticated sessions. This has some drawbacks such as users clearing their cookies, or using private browsing mode, or even worse, requiring the IDPs to support multiple users signed in at the same time (not something most IDPs support). While, RFC 8252 provides a mechanism for single-sign-on (SSO) across mobile apps provided by any vendor, we wanted a better solution for apps provided by Oath. So we looked at how could we enable mobile apps signed by the vendor to share the signed-in state in a more “back channel” way. One important fact is that mobile apps cryptographically signed by the same vender can securely share data via the device keychain on iOS and Account Manager on Android. Using this as a starting point we defined a new OAuth2 scope, device_sso, whose purpose is to require the Authorization Server (AS) to return a unique “secret” assigned to that specific device. The precedent for using a scope to define specification behaviour is OpenID Connect itself, which defines the “openid” scope as the trigger for the OpenID Provider (an OAuth2 AS) to implement the OpenID Connect specification. The device_secret is returned to a mobile app when the OAuth2 code is exchanged for tokens and then stored by the mobile app in the device keychain and with the id_token identifying the user who signed in. At this point, a second mobile app signed by the same vendor can look in the keychain and find the id_token, ask the user if they want to use that identity with the new app, and then use a profile of the token exchange spec to obtain tokens for the second mobile app based on the id_token and the device_secret. The full sequence of steps looks like this: As a result of our identity consolidation work over the past year, we derived a set of principles identity architects should find useful for addressing use cases that don’t have a known specification or best practice. Moreover, these are applicable in many contexts outside of identity standards: 1. Spend time researching the existing set of standards and draft standards. As the diagram shows, there are a lot of standards out there already, so understanding them is critical. 2. Don’t invent something new if you can just profile or combine already existing specifications. 3. Make sure you understand the spirit and intent of the existing specifications. 4. For those cases where an extension is required, make sure to extend the specification based on its spirit and intent. 5. Ask the community for clarity regarding any existing specification or draft. 6. Contribute back to the community via blog posts, best practice documents, or a new specification. As we learned during the consolidation of our Yahoo and AOL identity platforms, and as demonstrated in our examples, there is no need to resort to proprietary solutions for use cases that at first look do not appear to have a standards-based solution. Instead, it’s much better to follow these principles, avoid the NIH (not-invented-here) syndrome, and invest the time to build solutions on standards.

Innovating on Authentication Standards

June 25, 2018
Parent-child in Vespa June 5, 2018
June 5, 2018
Share

Parent-child in Vespa

Parent-child relationships let you model hierarchical relations in your data. This blog post talks about why and how we added this feature to Vespa, and how you can use it in your own applications. We’ll show some performance numbers and discuss practical considerations. Introduction The shortest possible background Traditional relational databases let you perform joins between tables. Joins enable efficient normalization of data through foreign keys, which means any distinct piece of information can be stored in one place and then referred to (often transitively), rather than to be duplicated everywhere it might be needed. This makes relational databases an excellent fit for a great number of applications. However, if we require scalable, real-time data processing with millisecond latency our options become more limited. To see why, and to investigate how parent-child can help us, we’ll consider a hypothetical use case. A grand business idea Let’s assume we’re building a disruptive startup for serving the cutest possible cat picture advertisements imaginable. Advertisers will run multiple campaigns, each with their own set of ads. Since they will (of course) pay us for this privilege, campaigns will have an associated budget which we have to manage at serving time. In particular, we don’t want to serve ads for a campaign that has spent all its money, as that would be free advertising. We must also ensure that campaign budgets are frequently updated when their ads have been served. Our initial, relational data model might look like this: Advertiser: id: (primary key) company_name: string contact_person_email: string Campaign: id: (primary key) advertiser_id: (foreign key to advertiser.id) name: string budget: int Ad: id: (primary key) campaign_id: (foreign key to campaign.id) cuteness: float cat_picture_url: string This data normalization lets us easily update the budgets for all ads in a single operation, which is important since we don’t want to serve ads for which there is no budget. We can also get the advertiser name for all individual ads transitively via their campaign. Scaling our expectations Since we’re expecting our startup to rapidly grow to a massive size, we want to make sure we can scale from day one. As the number of ad queries grow, we ideally want scaling up to be as simple as adding more server capacity. Unfortunately, scaling joins beyond a single server is a significant design and engineering challenge. As a consequence, most of the new data stores released in the past decade have been of the “NoSQL” variant (which might also be called “non-relational databases”). NoSQL’s horizontal scalability is usually achieved by requiring an application developer to explicitly de-normalize all data. This removes the need for joins altogether. For our use case, we have to store budget and advertiser name across multiple document types and instances (duplicated data here marked with bold text): Advertiser: id: (primary key) company_name: string contact_person_email: string Campaign: id: (primary key) advertiser_company_name: string name: string budget: int Ad: id: (primary key) campaign_budget: int campaign_advertiser_company_name: string cuteness: float cat_picture_url: string Now we can scale horizontally for queries, but updating the budget of a campaign requires updating all its ads. This turns an otherwise O(1) operation into O(n), and we likely have to implement this update logic ourselves as part of our application. We’ll be expecting thousands of budget updates to our cat ad campaigns per second. Multiplying this by an unknown number is likely to overload our servers or lose us money. Or both at the same time. A pragmatic middle ground In the middle between these two extremes of “arbitrary joins” and “no joins at all” we have parent-child relationships. These enable a subset of join functionality, but with enough restrictions that they can be implemented efficiently at scale. One core restriction is that your data relationships must be possible to represented as a directed, acyclic graph (DAG). As it happens, this is the case with our cat picture advertisement use case; Advertiser is a parent to 0-n Campaigns, each of which in turn is a parent to 0-n Ads. Being able to represent this natively in our application would get us functionally very close to the original, relational schema. We’ll see very shortly how this can be directly mapped to Vespa’s parent-child feature support. Parent-child support in Vespa Creating the data model Vespa’s fundamental data model is that of documents. Each document belongs to a particular schema and has a user-provided unique identifier. Such a schema is known as a document type and is specified in a search definition file. A document may have an arbitrary number of fields of different types. Some of these may be indexed, some may be kept in memory, all depending on the schema. A Vespa application may contain many document types. Here’s how the Vespa equivalent of the above denormalized schema could look (again bolding where we’re duplicating information): advertiser.sd: search advertiser { document advertiser { field company_name type string { indexing: attribute | summary } field contact_person_email type string { indexing: summary } } } campaign.sd: search campaign { document campaign { field advertiser_company_name type string { indexing: attribute | summary } field name type string { indexing: attribute | summary } field budget type int { indexing: attribute | summary } } } ad.sd: search ad { document ad { field campaign_budget type int { indexing: attribute | summary attribute: fast-search } field campaign_advertiser_company_name type string { indexing: attribute | summary } field cuteness type float { indexing: attribute | summary attribute: fast-search } field cat_picture_url type string { indexing: attribute | summary } } } Note that since all documents in Vespa must already have a unique ID, we do not need to model the primary key IDs explicitly. We’ll now see how little it takes to change this to its normalized equivalent by using parent-child. Parent-child support adds two new types of declared fields to Vespa; references and imported fields. A reference field contains the unique identifier of a parent document of a given document type. It is analogous to a foreign key in a relational database, or a pointer in Java/C++. A document may contain many reference fields, with each potentially referencing entirely different documents. We want each ad to reference its parent campaign, so we add the following to ad.sd: field campaign_ref type reference { indexing: attribute } We also add a reference from a campaign to its advertiser in campaign.sd: field advertiser_ref type reference { indexing: attribute } Since a reference just points to a particular document, it cannot be directly used in queries. Instead, imported fields are used to access a particular field within a referenced document. Imported fields are virtual; they do not take up any space in the document itself and they cannot be directly written to by put or update operations. Add to search campaign in campaign.sd: import field advertiser_ref.company_name as campaign_company_name {} Add to search ad in ad.sd: import field campaign_ref.budget as ad_campaign_budget {} You can import a parent field which itself is an imported field. This enables transitive field lookups. Add to search ad in ad.sd: import field campaign_ref.campaign_company_name as ad_campaign_company_name {} After removing the now redundant fields, our normalized schema looks like this: advertiser.sd: search advertiser { document advertiser { field company_name type string { indexing: attribute | summary } field contact_person_email type string { indexing: summary } } } campaign.sd: search campaign { document campaign { field advertiser_ref type reference { indexing: attribute } field name type string { indexing: attribute | summary } field budget type int { indexing: attribute | summary } } import field advertiser_ref.company_name as campaign_company_name {} } ad.sd: search ad { document ad { field campaign_ref type reference { indexing: attribute } field cuteness type float { indexing: attribute | summary attribute: fast-search } field cat_picture_url type string { indexing: attribute | summary } } import field campaign_ref.budget as ad_campaign_budget {} import field campaign_ref.campaign_company_name as ad_campaign_company_name {} } Feeding with references When feeding documents to Vespa, references are assigned like any other string field: [ { "put": "id:test:advertiser::acme", "fields": { “company_name”: “ACME Inc. cats and rocket equipment”, “contact_person_email”: “wile-e@example.com” } }, { "put": "id:acme:campaign::catnip", "fields": { “advertiser_ref”: “id:test:advertiser::acme”, “name”: “Most excellent catnip deals”, “budget”: 500 } }, { "put": "id:acme:ad::1", "fields": { "campaign_ref": "id:acme:campaign::catnip", “cuteness”: 100.0, “cat_picture_url”: “/acme/super_cute.jpg” } ] We can efficiently update the budget of a single campaign, immediately affecting all its child ads: [ { "update": "id:test:campaign::catnip", "fields": { "budget": { "assign": 450 } } } ] Querying using imported fields You can use imported fields in queries as if they were a regular field. Here are some examples using YQL: Find all ads that still have a budget left in their campaign: select * from ad where ad_campaign_budget > 0; Find all ads that have less than $500 left in their budget and belong to an advertiser whose company name contains the word “ACME”: select * from ad where ad_campaign_budget < 500 and ad_campaign_company_name contains “ACME”; Note that imported fields are not part of the default document summary, so you must add them explicitly to a separate summary if you want their values returned as part of a query result: document-summary my_ad_summary { summary ad_campaign_budget type int {} summary ad_campaign_company_name type string {} summary cuteness type float {} summary cat_picture_url type string {} } Add summary=my_ad_summary as a query HTTP request parameter to use it. Global documents One of the primary reasons why distributed, generalized joins are so hard to do well efficiently is that performing a join on node A might require looking at data that is only found on node B (or node C, or D…). Vespa gets around this problem by requiring that all documents that may be joined against are always present on every single node. This is achieved by marking parent documents as global in the services.xml declaration. Global documents are automatically distributed to all nodes in the cluster. In our use case, both advertisers and campaigns are used as parents: You cannot deploy an application containing reference fields pointing to non-global document types. Vespa verifies this at deployment time. Performance Feeding of campaign budget updates Scenario: feed 2 million ad documents 4 times to a content cluster with one node, each time with a different ratio between ads and parent campaigns. Treat 1:1 as baseline (i.e. 2 million ads, 2 million campaigns). Measure relative speedup as the ratio of how many fewer campaigns must be fed to update the budget in all ads. Results - 1 ad per campaign: 35000 campaign puts/second - 10 ads per campaign: 29000 campaign puts/second, 8.2x relative speedup - 100 ads per campaign: 19000 campaign puts/second, 54x relative speedup - 1000 ads percampaign: 6200 campaign puts/second, 177x relative speedup Note that there is some cost associated with higher fan-outs due to internal management of parent-child mappings, so the speedup is not linear with the fan-out. Searching on ads based on campaign budgets Scenario: we want to search for all ads having a specific budget value. First measure with all ad budgets denormalized, then using an imported budget field from the ads’ referenced campaign documents. As with the feeding benchmark, we’ll use 1, 10, 100 and 1000 ads per campaign with a total of 2 million ads combined across all campaigns. Measure average latency over 5 runs. In each case, the budget attribute is declared as fast-search, which means it has a B-tree index. This allows for efficent value and range searches. Results - 1 ad per campaign: denormalized 0.742 ms, imported 0.818 ms, 10.2% slowdown - 10 ads per campaign: denormalized 0.986 ms, imported 1.186 ms, 20.2% slowdown - 100 ads per campaign: denormalized 0.830 ms, imported 0.958 ms, 15.4% slowdown - 1000 ads per campaign: denormalized 0.936 ms, imported 0.922 ms, 1.5% speedup The observed speedup for the biggest fan-out is likely an artifact of measurement noise. We can see that although there is generally some cost associated with the extra indirection, it is dwarfed by the speedup we get at feeding time. Practical concerns Although a powerful feature, parent-child does not make sense for every use case. Prefer to use parent-child if the relationships between your data items can be naturally represented with such a hierarchy. The 3-level ad → campaign → advertiser example we’ve covered is such a use case. Parent-child is limited to DAG relations and therefore can’t be used to model an arbitrary graph. Parent-child in Vespa is currently only useful when searching in child documents. Queries can follow references from children to parents, but can’t go from parents to children. This is due to how Vespa maintains its internal reference mappings. You CAN search for - “All campaigns with advertiser name X” (campaign → advertiser) - “All ads with a campaign whose budget is greater than X” (ad → campaign) - “All ads with advertiser name X” (ad → campaign → advertiser, via transitive import) You CAN’T search for - “All advertisers with campaigns that have a budget greater than X” (campaign ← advertiser) - “All campaigns that have more than N ads” (ad ← campaign) Parent-child references do not enforce referential integrity constraints. You can feed a child document containing a reference to a parent document that does not exist. Note that you can feed the missing parent document later. Vespa will automatically resolve references from existing child documents. A lot of work has gone into minimizing the performance impact of using imported fields, but there is still some performance penalty introduced by the parent-child indirection. This means that using a denormalized data model may still be faster at search time, while a normalized parent-child model will generally be faster to feed. You must determine what you expect to be the bottleneck in your application and perform benchmarks for your particular use case. There is a fixed per-document memory cost associated with maintaining the internal parent-child mappings. Fields that are imported from a parent must be declared as attribute in the parent document type. As mentioned in the Global documents section, all parent documents must be present on all nodes. This is one of the biggest caveats with the parent-child feature: all nodes must have sufficient capacity for all parents. A core assumption that we have made for the use of this feature is the number of parent documents is much lower than the number of child documents. At least an order of magnitude fewer documents per parent level is a reasonable heuristic. Comparisons with other systems ElasticSearch ElasticSearch also offers native support for parent-child in its data and query model. There are some distinct differences: - In ElasticSearch it’s the user’s responsibility to ensure child documents are explicitly placed on the same shard as its parents (source). This trades off ease of use with not requiring all parents on all nodes. - Changing a parent reference in ElasticSearch requires a manual delete of the child before it can be reinserted in the new parent shard (source). Parent references in Vespa can be changed with ordinary field updates at any point in time. - In ElasticSearch, referencing fields in parents is done explicitly with “has_parent” query operators (source), while Vespa abstracts this away as regular field accesses. - ElasticSearch has a “has_child” query operator which allows for querying parents based on properties of their children (source). Vespa does not currently support this. - ElasticSearch reports query slowdowns of 500-1000% when using parent-child (source), while expected overhead when using parent-child attribute fields in Vespa is on the order of 10-20%. - ElasticSearch uses a notion of a “global ordinals” index which must be rebuilt upon changes to the parent set. This may take several seconds and introduce latency spikes (source). All parent-child reference management in Vespa is fully real-time with no additional rebuild costs at feeding or serving time. Distributed SQL stores In the recent years there has been a lot of promising development happening on the distributed SQL database (“NewSQL”) front. In particular, both the open-source CockroachDB and Google’s proprietary Spanner architectures offer distributed transactions and joins at scale. As these are both aimed primarily at solving OLTP use cases rather than realtime serving, we will not cover these any further here. Summary In this blog post we’ve looked at Vespa’s new parent-child feature and how it can be used to normalize common data models. We’ve demonstrated how introducing parents both greatly speeds up and simplifies updating information shared between many documents. We’ve also seen that doing so introduces only a minor performance impact on search queries. Have an exciting use case for parent-child that you’re working on? Got any questions? Let us know! vespa.ai Vespa Engine on GitHub Vespa Engine on Gitter

Parent-child in Vespa

June 5, 2018
A Peek Behind the Mail Curtain May 18, 2018
May 18, 2018
marcelatoath
Share

A Peek Behind the Mail Curtain

USE IMAP TO ACCESS SOME UNIQUE FEATURES By Libby Lin, Principal Product Manager Well, we actually won’t show you how we create the magic in our big OATH consumer mail factory. But nevertheless we wanted to share how interested developers could leverage some of our unique features we offer for our Yahoo and AOL Mail customers. To drive experiences like our travel and shopping smart views or message threading, we tag qualified mails with something we call DECOS and THREADID. While we will not indulge in explaining how exactly we use them internally, we wanted to share how they can be used and accessed through IMAP. So let’s just look at a sample IMAP command chain. We’ll just assume that you are familiar with the IMAP protocol at this point and you know how to properly talk to an IMAP server. So here’s how you would retrieve DECO and THREADIDs for specific messages: 1. CONNECT    openssl s_client -crlf -connect imap.mail.yahoo.com:993 2. LOGIN    a login username password    a OK LOGIN completed 3. LIST FOLDERS    a list “” “*”    * LIST (\Junk \HasNoChildren) “/” “Bulk Mail”    * LIST (\Archive \HasNoChildren) “/” “Archive”    * LIST (\Drafts \HasNoChildren) “/” “Draft”    * LIST (\HasNoChildren) “/” “Inbox”    * LIST (\HasNoChildren) “/” “Notes”    * LIST (\Sent \HasNoChildren) “/” “Sent”    * LIST (\Trash \HasChildren) “/” “Trash”    * LIST (\HasNoChildren) “/” “Trash/l2”    * LIST (\HasChildren) “/” “test level 1”    * LIST (\HasNoChildren) “/” “test level 1/nestedfolder”    * LIST (\HasNoChildren) “/” “test level 1/test level 2”    * LIST (\HasNoChildren) “/” “&T2BZfXso-”    * LIST (\HasNoChildren) “/” “&gQKAqk7WWr12hA-”    a OK LIST completed 4.SELECT FOLDER    a select inbox    * 94 EXISTS    * 0 RECENT    * OK [UIDVALIDITY 1453335194] UIDs valid    * OK [UIDNEXT 40213] Predicted next UID    * FLAGS (\Answered \Deleted \Draft \Flagged \Seen $Forwarded $Junk $NotJunk)    * OK [PERMANENTFLAGS (\Answered \Deleted \Draft \Flagged \Seen $Forwarded $Junk $NotJunk)] Permanent flags    * OK [HIGHESTMODSEQ 205]    a OK [READ-WRITE] SELECT completed; now in selected state 5. SEARCH FOR UID    a uid search 1:*    * SEARCH 1 2 3 4 11 12 14 23 24 75 76 77 78 114 120 121 124 128 129 130 132 133 134 135 136 137 138 40139 40140 40141 40142 40143 40144 40145 40146 40147 40148     40149 40150 40151 40152 40153 40154 40155 40156 40157 40158 40159 40160 40161 40162 40163 40164 40165 40166 40167 40168 40172 40173 40174 40175 40176     40177 40178 40179 40182 40183 40184 40185 40186 40187 40188 40190 40191 40192 40193 40194 40195 40196 40197 40198 40199 40200 40201 40202 40203 40204     40205 40206 40207 40208 40209 40211 40212    a OK UID SEARCH completed 6. FETCH DECOS BASED ON UID    a uid fetch 40212 (X-MSG-DECOS X-MSG-ID X-MSG-THREADID)    * 94 FETCH (UID 40212 X-MSG-THREADID “108” X-MSG-ID “ACfIowseFt7xWtj0og0L2G0T1wM” X-MSG-DECOS (“FTI” “F1” “EML”))    a OK UID FETCH completed

A Peek Behind the Mail Curtain

May 18, 2018
Scaling TensorFlow model evaluation with Vespa May 7, 2018
May 7, 2018
Share

Scaling TensorFlow model evaluation with Vespa

In this blog post we’ll explain how to use Vespa to evaluate TensorFlow models over arbitrarily many data points while keeping total latency constant. We provide benchmark data from our performance lab where we compare evaluation using TensorFlow serving with evaluating TensorFlow models in Vespa. We recently introduced a new feature that enables direct import of TensorFlow models into Vespa for use at serving time. As mentioned in a previous blog post, our approach to support TensorFlow is to extract the computational graph and parameters of the TensorFlow model and convert it to Vespa’s tensor primitives. We chose this approach over attempting to integrate our backend with the TensorFlow runtime. There were a few reasons for this. One was that we would like to support other frameworks than TensorFlow. For instance, our next target is to support ONNX. Another was that we would like to avoid the inevitable overhead of such an integration, both on performance and code maintenance. Of course, this means a lot of optimization work on our side to make this as efficient as possible, but we do believe it is a better long term solution. Naturally, we thought it would be interesting to set up some sort of performance comparison between Vespa and TensorFlow for cases that use a machine learning ranking model. Before we get to that however, it is worth noting that Vespa and TensorFlow serving has an important conceptual difference. With TensorFlow you are typically interested in evaluating a model for a single data point, be that an image for an image classifier, or a sentence for a semantic representation etc. The use case for Vespa is when you need to evaluate the model over many data points. Examples are finding the best document given a text, or images similar to a given image, or computing a stream of recommendations for a user. So, let’s explore this by setting up a typical search application in Vespa. We’ve based the application in this post on the Vespa blog recommendation tutorial part 3. In this application we’ve trained a collaborative filtering model which computes an interest vector for each existing user (which we refer to as the user profile) and a content vector for each blog post. In collaborative filtering these vectors are commonly referred to as latent factors. The application takes a user id as the query, retrieves the corresponding user profile, and searches for the blog posts that best match the user profile. The match is computed by a simple dot-product between the latent factor vectors. This is used as the first phase ranking. We’ve chosen vectors of length 128. In addition, we’ve trained a neural network in TensorFlow to serve as the second-phase ranking. The user vector and blog post vector are concatenated and represents the input (of size 256) to the neural network. The network is fully connected with 2 hidden layers of size 512 and 128 respectively, and the network has a single output value representing the probability that the user would like the blog post. In the following we set up two cases we would like to compare. The first is where the imported neural network is evaluated on the content node using Vespa’s native tensors. In the other we run TensorFlow directly on the stateless container node in the Vespa 2-tier architecture. In this case, the additional data required to evaluate the TensorFlow model must be passed back from the content node(s) to the container node. We use Vespa’s fbench utility to stress the system under fairly heavy load. In this first test, we set up the system on a single host. This means the container and content nodes are running on the same host. We set up fbench so it uses 64 clients in parallel to query this system as fast as possible. 1000 documents per query are evaluated in the first phase and the top 200 documents are evaluated in the second phase. In the following, latency is measured in ms at the 95th percentile and QPS is the actual query rate in queries per second: - Baseline: 19.68 ms / 3251.80 QPS - Baseline with additional data: 24.20 ms / 2644.74 QPS - Vespa ranking: 42.8 ms / 1495.02 QPS - TensorFlow batch ranking: 42.67 ms / 1499.80 QPS - TensorFlow single ranking: 103.23 ms / 619.97 QPS Some explanation is in order. The baseline here is the first phase ranking only without returning the additional data required for full ranking. The baseline with additional data is the same but returns the data required for ranking. Vespa ranking evaluates the model on the content backend. Both TensorFlow tests evaluate the model after content has been sent to the container. The difference is that batch ranking evaluates the model in one pass by batching the 200 documents together in a larger matrix, while single evaluates the model once per document, i.e. 200 evaluations. The reason why we test this is that Vespa evaluates the model once per document to be able to evaluate during matching, so in terms of efficiency this is a fairer comparison. We see in the numbers above for this application that Vespa ranking and TensorFlow batch ranking achieve similar performance. This means that the gains in ranking batch-wise is offset by the cost of transferring data to TensorFlow. This isn’t entirely a fair comparison however, as the model evaluation architecture of Vespa and TensorFlow differ significantly. For instance, we measure that TensorFlow has a much lower degree of cache misses. One reason is that batch-ranking necessitates a more contiguous data layout. In contrast, relevant document data can be spread out over the entire available memory on the Vespa content nodes. Another significant reason is that Vespa currently uses double floating point precision in ranking and in tensors. In the above TensorFlow model we have used floats, resulting in half the required memory bandwidth. We are considering making the floating point precision in Vespa configurable to improve evaluation speed for cases where full precision is not necessary, such as in most machine learned models. So we still have some work to do in optimizing our tensor evaluation pipeline, but we are pleased with our results so far. Now, the performance of the model evaluation itself is only a part of the system-wide performance. In order to rank with TensorFlow, we need to move data to the host running TensorFlow. This is not free, so let’s delve a bit deeper into this cost. The locality of data in relation to where the ranking computation takes place is an important aspect and indeed a core design point of Vespa. If your data is too large to fit on a single machine, or you want to evaluate your model on more data points faster than is possible on a single machine, you need to split your data over multiple nodes. Let’s assume that documents are distributed randomly across all content nodes, which is a very reasonable thing to do. Now, when you need to find the globally top-N documents for a given query, you first need to find the set of candidate documents that match the query. In general, if ranking is done on some other node than where the content is, all the data required for the computation obviously needs to be transferred there. Usually, the candidate set can be large so this incurs a significant cost in network activity, particularly as the number of content nodes increase. This approach can become infeasible quite quickly. This is why a core design aspect of Vespa is to evaluate models where the content is stored. This is illustrated in the figure above. The problem of transferring data for ranking is compounded as the number of content nodes increase, because to find the global top-N documents, the top-K documents of each content node need to be passed to the external ranker. This means that, if we have C content nodes, we need to transfer C*K documents over the network. This runs into hard network limits as the number of documents and data size for each document increases. Let’s see the effect of this when we change the setup of the same application to run on three content nodes and a single stateless container which runs TensorFlow. In the following graph we plot the 95th percentile latency as we increase the number of parallel requests (clients) from 1 to 30: Here we see that with low traffic, TensorFlow and Vespa are comparable in terms of latency. When we increase the load however, the cost of transmitting the data is the driver for the increase in latency for TensorFlow, as seen in the red line in the graph. The differences between batch and single mode TensorFlow evaluation become smaller as the system as a whole becomes largely network-bound. In contrast, the Vespa application scales much better. Now, as we increase traffic even further, will the Vespa solution likewise become network-bound? In the following graph we plot the sustained requests per second as we increase clients to 200: Vespa ranking is unable to sustain the same amount of QPS as just transmitting the data (the blue line), which is a hint that the system has become CPU-bound on the evaluation of the model on Vespa. While Vespa can sustain around 3500 QPS, the TensorFlow solution maxes out at 350 QPS which is reached quite early as we increase traffic. As the system is unable to transmit data fast enough, the latency naturally has to increase which is the cause for the linearity in the latency graph above. At 200 clients the average latency of the TensorFlow solution is around 600 ms, while Vespa is around 60 ms. So, the obvious key takeaway here is that from a scalability point of view it is beneficial to avoid sending data around for evaluation. That is both a key design point of Vespa, but also for why we implemented TensorFlow support in the first case. By running the models where the content is allows for better utilization of resources, but perhaps the more interesting aspect is the ability to run more complex or deeper models while still being able to scale the system.

Scaling TensorFlow model evaluation with Vespa

May 7, 2018
Achieving Major Stability and Performance Improvements in Yahoo Mail with a Novel Redux Architecture April 18, 2018
April 18, 2018
mikesefanov
Share

Achieving Major Stability and Performance Improvements in Yahoo Mail with a Novel Redux Architecture

yahoodevelopers: By Mohit Goenka, Gnanavel Shanmugam, and Lance Welsh At Yahoo Mail, we’re constantly striving to upgrade our product experience. We do this not only by adding new features based on our members’ feedback, but also by providing the best technical solutions to power the most engaging experiences. As such, we’ve recently introduced a number of novel and unique revisions to the way in which we use Redux that have resulted in significant stability and performance improvements. Developers may find our methods useful in achieving similar results in their apps. Improvements to product metrics Last year Yahoo Mail implemented a brand new architecture using Redux. Since then, we have transformed the overall architecture to reduce latencies in various operations, reduce JavaScript exceptions, and better synchronized states. As a result, the product is much faster and more stable. Stability improvements: - when checking for new emails – 20% - when reading emails – 30% - when sending emails – 20% Performance improvements: - 10% improvement in page load performance - 40% improvement in frame rendering time We have also reduced API calls by approximately 20%. How we use Redux in Yahoo Mail Redux architecture is reliant on one large store that represents the application state. In a Redux cycle, action creators dispatch actions to change the state of the store. React Components then respond to those state changes. We’ve made some modifications on top of this architecture that are atypical in the React-Redux community. For instance, when fetching data over the network, the traditional methodology is to use Thunk middleware. Yahoo Mail fetches data over the network from our API. Thunks would create an unnecessary and undesirable dependency between the action creators and our API. If and when the API changes, the action creators must then also change. To keep these concerns separate we dispatch the action payload from the action creator to store them in the Redux state for later processing by “action syncers”. Action syncers use the payload information from the store to make requests to the API and process responses. In other words, the action syncers form an API layer by interacting with the store. An additional benefit to keeping the concerns separate is that the API layer can change as the backend changes, thereby preventing such changes from bubbling back up into the action creators and components. This also allowed us to optimize the API calls by batching, deduping, and processing the requests only when the network is available. We applied similar strategies for handling other side effects like route handling and instrumentation. Overall, action syncers helped us to reduce our API calls by ~20% and bring down API errors by 20-30%. Another change to the normal Redux architecture was made to avoid unnecessary props. The React-Redux community has learned to avoid passing unnecessary props from high-level components through multiple layers down to lower-level components (prop drilling) for rendering. We have introduced action enhancers middleware to avoid passing additional unnecessary props that are purely used when dispatching actions. Action enhancers add data to the action payload so that data does not have to come from the component when dispatching the action. This avoids the component from having to receive that data through props and has improved frame rendering by ~40%. The use of action enhancers also avoids writing utility functions to add commonly-used data to each action from action creators. In our new architecture, the store reducers accept the dispatched action via action enhancers to update the state. The store then updates the UI, completing the action cycle. Action syncers then initiate the call to the backend APIs to synchronize local changes. Conclusion Our novel use of Redux in Yahoo Mail has led to significant user-facing benefits through a more performant application. It has also reduced development cycles for new features due to its simplified architecture. We’re excited to share our work with the community and would love to hear from anyone interested in learning more.

Achieving Major Stability and Performance Improvements in Yahoo Mail with a Novel Redux Architecture

April 18, 2018
Secure Images March 20, 2018
March 20, 2018
marcelatoath
Share

Secure Images

oath-postmaster: By Marcel Becker The mail team at OATH is busy  integrating  Yahoo and AOL technology to deliver an even better experience across all our consumer mail products. While privacy and security are top priority for us, we also want to improve the experience and remove unnecessary clutter across all of our products. Starting this week we will be serving images in mails via our own secure proxy servers. This will not only increase speed and security in our own mail products and reduce the risk of phishing and other scams,  but it will also mean that our users don’t have to fiddle around with those “enable images” settings. Messages and inline images will now just show up as originally intended. We are aware that commercial mail senders are relying on images (so-called pixels) to track delivery and open rates. Our proxy solution will continue to support most of these cases and ensure that true mail opens are recorded. For senders serving dynamic content based on the recipient’s location (leveraging standard IP-based browser and app capabilities) we recommend falling back on other tools and technologies which do not rely on IP-based targeting. All of our consumer mail applications (Yahoo and AOL) will benefit from this change. This includes our desktop products as well as our mobile applications across iOS and Android. If you have any feedback or want to discuss those changes with us personally, just send us a note to mail-questions@oath.com.

Secure Images

March 20, 2018
Introducing TensorFlow support March 14, 2018
March 14, 2018
Share

Introducing TensorFlow support

In previous blog posts we have talked about Vespa’s tensor API which enables some advanced ranking capabilities. The primary use case is for machine learned ranking, where you train your models using some machine learning framework, convert the models to Vespa’s tensor format, and deploy them to Vespa. This works well, but converting trained models to Vespa form is cumbersome. We are now happy to announce a new feature that makes this process a lot easier: TensorFlow import. With this feature you can directly deploy models you’ve trained in TensorFlow to Vespa, and use these models during ranking. This means that the models are executed in parallel over multiple threads and machines for a single query, which makes it possible to evaluate the model over any number of data items and still bound the total response time. In addition the data items to evaluate with the TensorFlow model can be selected dynamically with a query, and with a cheaper first-phase rank function if needed. Since the TensorFlow models are evaluated on the nodes storing the data, we avoid sending any data over the wire for evaluation. In this post we’d like to introduce this new feature by discussing how it works, some assumptions behind working with TensorFlow and Vespa, and how to use the feature. Vespa is optimized to evaluate models repeatedly over many data items (documents).  To do this efficiently, we do not evaluate the model using the TensorFlow inference engine. TensorFlow adds a non-trivial amount of overhead and instrumentation which it uses to manage potentially large scale computations. This is significant in our case, since we need to evaluate models on a micro-second scale. Hence our approach is to extract the parameters (weights) into Vespa tensors, and use the model specification in the TensorFlow graph to generate efficient Vespa tensor expressions. Importing TensorFlow models is as simple as saving the TensorFlow model using the SavedModel API, adding those files to the Vespa application package, and referencing the model using the new TensorFlow ranking feature. For instance, if your files are in models/my_model in the application package: first-phase {     expression: sum(tensorflow(“my_model/saved”)) } The above expressions runs the model, and sums it to a single scalar value to use in ranking.  One thing you will have to provide is the input(s), or feed, to the graph. Vespa expects you to provide a macro with the same name as the input placeholder. In the macro you can specify where the input should come from, be it a parameter sent along with the query, a document field (possibly in a parent document) or a constant. As mentioned, Vespa evaluates the imported models once per document. Depending on the requirements of the application, this can impose some natural limitations on the size and complexity of the models that can be evaluated. However, Vespa has a number of other search and rank features that can be used to reduce the search space before running the machine learned models. Typically, one would use the search and first ranking phases to select a relatively small number of candidate documents, which are then given their final rank score in the more computationally expensive second phase model evaluation. Also note that TensorFlow import is new to Vespa, and we currently only support a subset of the TensorFlow operations. While the supported operations should suffice for many relevant use cases, there are some that are not supported yet due to potentially being too expensive to evaluate per document. For instance, convolutional networks and recurrent networks (LSTMs etc) are not supported. We are continually working to add functionality, if you find that we have some glaring omissions, please let us know. Going forward we are focusing on further improving performance of our tensor framework for important use cases. We’ll follow up this post with one showing how the performance of evaluation in Vespa compares with TensorFlow serving. We will also add more supported frameworks and our next target is ONNX. You can read more about this feature in the ranking with TensorFlow model in Vespa documentation. We are excited to announce the TensorFlow support, and we’re eager to hear what you are building with it.

Introducing TensorFlow support

March 14, 2018
Success at Apache: A Newbie’s Narrative February 5, 2018
February 5, 2018
mikesefanov
Share

Success at Apache: A Newbie’s Narrative

yahoodevelopers: Kuhu Shukla (bottom center) and team at the 2017 DataWorks Summit By Kuhu Shukla This post first appeared here on the Apache Software Foundation blog as part of ASF’s “Success at Apache” monthly blog series. As I sit at my desk on a rather frosty morning with my coffee, looking up new JIRAs from the previous day in the Apache Tez project, I feel rather pleased. The latest community release vote is complete, the bug fixes that we so badly needed are in and the new release that we tested out internally on our many thousand strong cluster is looking good. Today I am looking at a new stack trace from a different Apache project process and it is hard to miss how much of the exceptional code I get to look at every day comes from people all around the globe. A contributor leaves a JIRA comment before he goes on to pick up his kid from soccer practice while someone else wakes up to find that her effort on a bug fix for the past two months has finally come to fruition through a binding +1. Yahoo – which joined AOL, HuffPost, Tumblr, Engadget, and many more brands to form the Verizon subsidiary Oath last year – has been at the frontier of open source adoption and contribution since before I was in high school. So while I have no historical trajectories to share, I do have a story on how I found myself in an epic journey of migrating all of Yahoo jobs from Apache MapReduce to Apache Tez, a then-new DAG based execution engine. Oath grid infrastructure is through and through driven by Apache technologies be it storage through HDFS, resource management through YARN, job execution frameworks with Tez and user interface engines such as Hive, Hue, Pig, Sqoop, Spark, Storm. Our grid solution is specifically tailored to Oath’s business-critical data pipeline needs using the polymorphic technologies hosted, developed and maintained by the Apache community. On the third day of my job at Yahoo in 2015, I received a YouTube link on An Introduction to Apache Tez. I watched it carefully trying to keep up with all the questions I had and recognized a few names from my academic readings of Yarn ACM papers. I continued to ramp up on YARN and HDFS, the foundational Apache technologies Oath heavily contributes to even today. For the first few weeks I spent time picking out my favorite (necessary) mailing lists to subscribe to and getting started on setting up on a pseudo-distributed Hadoop cluster. I continued to find my footing with newbie contributions and being ever more careful with whitespaces in my patches. One thing was clear – Tez was the next big thing for us. By the time I could truly call myself a contributor in the Hadoop community nearly 80-90% of the Yahoo jobs were now running with Tez. But just like hiking up the Grand Canyon, the last 20% is where all the pain was. Being a part of the solution to this challenge was a happy prospect and thankfully contributing to Tez became a goal in my next quarter. The next sprint planning meeting ended with me getting my first major Tez assignment – progress reporting. The progress reporting in Tez was non-existent – “Just needs an API fix,”  I thought. Like almost all bugs in this ecosystem, it was not easy. How do you define progress? How is it different for different kinds of outputs in a graph? The questions were many. I, however, did not have to go far to get answers. The Tez community actively came to a newbie’s rescue, finding answers and posing important questions. I started attending the bi-weekly Tez community sync up calls and asking existing contributors and committers for course correction. Suddenly the team was much bigger, the goals much more chiseled. This was new to anyone like me who came from the networking industry, where the most open part of the code are the RFCs and the implementation details are often hidden. These meetings served as a clean room for our coding ideas and experiments. Ideas were shared, to the extent of which data structure we should pick and what a future user of Tez would take from it. In between the usual status updates and extensive knowledge transfers were made. Oath uses Apache Pig and Apache Hive extensively and most of the urgent requirements and requests came from Pig and Hive developers and users. Each issue led to a community JIRA and as we started running Tez at Oath scale, new feature ideas and bugs around performance and resource utilization materialized. Every year most of the Hadoop team at Oath travels to the Hadoop Summit where we meet our cohorts from the Apache community and we stand for hours discussing the state of the art and what is next for the project. One such discussion set the course for the next year and a half for me. We needed an innovative way to shuffle data. Frameworks like MapReduce and Tez have a shuffle phase in their processing lifecycle wherein the data from upstream producers is made available to downstream consumers. Even though Apache Tez was designed with a feature set corresponding to optimization requirements in Pig and Hive, the Shuffle Handler Service was retrofitted from MapReduce at the time of the project’s inception. With several thousands of jobs on our clusters leveraging these features in Tez, the Shuffle Handler Service became a clear performance bottleneck. So as we stood talking about our experience with Tez with our friends from the community, we decided to implement a new Shuffle Handler for Tez. All the conversation points were tracked now through an umbrella JIRA TEZ-3334 and the to-do list was long. I picked a few JIRAs and as I started reading through I realized, this is all new code I get to contribute to and review. There might be a better way to put this, but to be honest it was just a lot of fun! All the whiteboards were full, the team took walks post lunch and discussed how to go about defining the API. Countless hours were spent debugging hangs while fetching data and looking at stack traces and Wireshark captures from our test runs. Six months in and we had the feature on our sandbox clusters. There were moments ranging from sheer frustration to absolute exhilaration with high fives as we continued to address review comments and fixing big and small issues with this evolving feature. As much as owning your code is valued everywhere in the software community, I would never go on to say “I did this!” In fact, “we did!” It is this strong sense of shared ownership and fluid team structure that makes the open source experience at Apache truly rewarding. This is just one example. A lot of the work that was done in Tez was leveraged by the Hive and Pig community and cross Apache product community interaction made the work ever more interesting and challenging. Triaging and fixing issues with the Tez rollout led us to hit a 100% migration score last year and we also rolled the Tez Shuffle Handler Service out to our research clusters. As of last year we have run around 100 million Tez DAGs with a total of 50 billion tasks over almost 38,000 nodes. In 2018 as I move on to explore Hadoop 3.0 as our future release, I hope that if someone outside the Apache community is reading this, it will inspire and intrigue them to contribute to a project of their choice. As an astronomy aficionado, going from a newbie Apache contributor to a newbie Apache committer was very much like looking through my telescope - it has endless possibilities and challenges you to be your best. About the Author: Kuhu Shukla is a software engineer at Oath and did her Masters in Computer Science at North Carolina State University. She works on the Big Data Platforms team on Apache Tez, YARN and HDFS with a lot of talented Apache PMCs and Committers in Champaign, Illinois. A recent Apache Tez Committer herself she continues to contribute to YARN and HDFS and spoke at the 2017 Dataworks Hadoop Summit on “Tez Shuffle Handler: Shuffling At Scale With Apache Hadoop”. Prior to that she worked on Juniper Networks’ router and switch configuration APIs. She likes to participate in open source conferences and women in tech events. In her spare time she loves singing Indian classical and jazz, laughing, whale watching, hiking and peering through her Dobsonian telescope.

Success at Apache: A Newbie’s Narrative

February 5, 2018
Optimizing realtime evaluation of neural net models on Vespa January 5, 2018
January 5, 2018
Share

Optimizing realtime evaluation of neural net models on Vespa

In this blog post we describe how we recently made neural network evaluation over 20 times faster on Vespa’s tensor framework. Vespa is the open source platform for building applications that carry out scalable real-time data processing, for instance search and recommendation systems. These require significant amounts of computation over large data sets. With advances in machine learning, it is desirable to run more advanced ranking models such as large linear or logistic regression models and artificial neural networks. Because of the tight computational budget at serving time, the evaluation of such models must be done in an efficient and scalable manner. We introduced the tensor API to help solve such problems. The tensor API allows the concise expression of general computations on many-dimensional data, while simultaneously leaving room for deep optimizations on the platform side.  What we mean by this is that the tensor API is very expressive and supports a large range of model types. The general evaluation of tensors is not necessarily efficient in all cases, so in addition to continually working to increase the baseline performance, we also perform specific optimizations for important use cases. In this blog post we will describe one such important optimization we recently did, which improved neural network evaluation performance by over 20x. To illustrate the types of optimization we can do, consider the following tensor expression representing a dot product between vectors v1 and v2: reduce(join(v1, v2, f(x, y)(x * y)), sum) The dot product is calculated by multiplying the vectors together by using the join operation, then summing the elements in the vector together using the reduce operation. The result is a single scalar. A naive implementation would first calculate the join and introduce a temporary tensor before the reduce sums up the cells to a single scalar. Particularly for large tensors with many dimensions, such a temporary tensor can be large and require significant memory allocations. This is obviously not the most efficient path to calculate the resulting tensor.  A general improvement would be to avoid the temporary tensor and reduce to the single scalar directly as the tensors are iterated through. In Vespa, when ranking expressions are compiled, the abstract syntax tree (AST) is analyzed for such optimizations. When known cases are recognized, the most efficient implementation is selected. In the above example, assuming the vectors are dense and they share dimensions, Vespa has optimized hardware accelerated code for doing dot products on vectors. For sparse vectors, Vespa falls back to a implementation for weighted sets which build hash tables for efficient lookups.  This method allows recognition of both large and small optimizations, from simple dot products to specialized implementations for more advanced ranking models. Vespa currently has a few optimizations implemented, and we are adding more as important use cases arise. We recently set out to improve the performance of evaluating simple neural networks, a case quite similar to the one presented in the previous blog post. The ranking expression to optimize was:    macro hidden_layer() {        expression: elu(xw_plus_b(nn_input, constant(W_fc1), constant(b_fc1), x))    }    macro final_layer() {        expression: xw_plus_b(hidden_layer, constant(W_fc2), constant(b_fc2), hidden)    }    first-phase {        expression: final_layer    } This represents a simple two-layer neural network.  Whenever a new version of Vespa is built, a large suite of integration and performance tests are run. When we want to optimize a specific use case, we first create a performance test to set a baseline.  With the performance tests we get both historical graphs as well as detailed profiling information and performance statistics sampled from the system under load.  This allows us to identify and optimize any bottlenecks. Also, it adds a bit of gamification to the process. The graph below shows the performance of a test where 10 000 random documents are ranked according to the evaluation of a simple two-layer neural network: Here, the x-axis represent builds, and the y-axis is the end-to-end latency as measured from a machine firing off queries to a server running the test on Vespa. As can be seen, over the course of optimization the latency was reduced from 150-160 ms to 7 ms, an impressive 20x end-to-end latency improvement. When a query is received by Vespa, it is first processed in the stateless container. This is usually where applications would process the query, possibly enriching it with additional information. Vespa does a bit of default work here as well, and also transforms the query a bit. For this test, no specific handling was done except this default handling. After initial processing, the query is dispatched to each node in the stateful content layer. For this test, only a single node is used in the content layer, but applications would typically have multiple. The query is processed in parallel on each node utilizing multiple cores and the ranking expression gets executed once for each document that matches the query. For this test with 10 000 documents, the ranking expression and thus the neural network gets evaluated in total 10 000 times before the top N documents are returned to the container layer. The following steps were taken to optimize this expression, with each step visible as a step in the graph above: 1. Recognize join with multiplication as part of an inner product. 2. Optimize for bias addition. 3. Optimize vector concatenation (which was part of the input to the neural network) 4. Replace appropriate sub-expressions with the dense vector-matrix product. It was particularly the final step which gave the biggest percent wise performance boost. The solution in total was to recognize the vector-matrix multiplication done in the neural network layer and replace that with specialized code that invokes the existing hardware accelerated dot product code. In the expression above, the operation xw_plus_b is replaced with a reduce of the multiplicative join and additive join. This is what is recognized and performed in one step instead of three. This strategy of optimizing specific use cases allows for a more rapid application development for users of Vespa. Consider the case where some exotic model needs to be run on Vespa. Without the generic tensor API users would have to implement their own custom rank features or wait for the Vespa core developers to implement them. In contrast, with the tensor API, teams can continue their development without external dependencies to the Vespa team.  If necessary, the Vespa team can in parallel implement the optimizations needed to meet performance requirements, as we did in this case with neural networks.

Optimizing realtime evaluation of neural net models on Vespa

January 5, 2018
Blog recommendation with neural network models December 15, 2017
December 15, 2017
Share

Blog recommendation with neural network models

Introduction The main objective of this post is to show how to deploy neural network models in Vespa using our Tensor Framework. In fact, any model that can be represented by a series of Tensor operations can be deployed in Vespa. Neural networks is just a popular example. In addition, we will introduce the multi-phase ranking model available in Vespa that can be used to run more expensive models in a phase based on a reduced number of documents returned by previous phases. This feature allow us to run models that would be prohibitively expensive to use if we had to run them at query-time across all the documents indexed in Vespa.Model Training In this section, we will define a neural network model, show how we created a suitable dataset to train the model and train the model using TensorFlow. The neural network model In the previous blog post, we computed latent factors for each user and each document and then used a dot-product between user and document vectors to rank the documents available for recommendation to a specific user. In this tutorial we will train a 2-layer fully connected neural network model that will take the same user (u) and document (d) latent factors as input and will output the probability of that specific user liking the document. More technically, our previous rank function r was given by r(u,d)=u∗d while in this tutorial it will be given by r(u,d,θ)=f(u,d,θ) where f represents the neural network model described below and θ is the neural network parameter values that we need to learn from training data. The specific form of the neural network model used here is p = sigmoid(h1×W2+b2) h1 = ReLU(x×W1+b1) where x=[u,d] is the concatenation of the user and document latent factor, ReLU is the rectifier activation function, sigmoid represents the sigmoid function, p is the output of the model and in this case can be interpreted as the probability of the user u liking a blog post d. The parameters of the model are represented by θ=(W1,W2,b1,b2). Training data For the training dataset, we will start with the (user_id, post_id) rows from the “training_set_ids” generated previously. Then, we remove every row for which there is no latent factors for the user_id or post_id contained in that row. This gives us a dataset with only positive feedback (label = 1), since each row represents one instance of a user_id liking a post_id. In order to train our model, we need to generate negative feedback (label = 0). So, for each row (user_id, post_id) in the current dataset we will generate N negative feedback rows by randomly sampling post_id_fake from the pool of post_id’s available in the current set, so that for each (user_id, post_id) row with label = 1 we will increase the dataset with N (user_id, post_id_fake) rows with label = 0. Find code to generate the dataset in the utility scripts. Training with TensorFlow With the training data in hand, we have split it into 80% training set and 20% validation set and used TensorFlow to train the model. The script used can be found in the utility scripts and executed by $ python vespaModel.py --product_features_file_path vespa_tutorial_data/user_item_cf_cv/product.json \                       --user_features_file_path vespa_tutorial_data/user_item_cf_cv/user.json \                       --dataset_file_path vespa_tutorial_data/nn_model/training_set.txt The progress of your training can be visualized using Tensorboard $ tensorboard --logdir runs/*/summaries/ Model deployment in Vespa Two Phase Ranking When a query is sent to Vespa, it will scan all documents available and select the ones (possibly all) that match the query. When the set of documents matching a query is found, Vespa must decide the order of these documents. Unless explicit sorting is used, Vespa decides this order by calculating a number for each document, the rank score, and sorts the documents by this number. The rank score can be any function that takes as arguments parameters sent by the query, document attributes defined in search definitions and global parameters not directly linked to query or document parameters. One example of rank score is the output of the neural network model defined in this tutorial. The model takes the latent factor u associated with a specific user_id (query parameter), the latent factor dd associated with document post_id (document attribute) and learned model parameters (global parameters not related to a specific query nor document) and returns the probability of user u to like document d. However, even though Vespa is designed to carry out such calculations optimally, complex expressions becomes expensive when they must be calculated over every one of a large set of matching documents. To relieve this, Vespa can be configured to run two ranking expressions - a smaller and less accurate one on all hits during the matching phase, and a more expensive and accurate one only on the best hits during the reranking phase. In general this allows a more optimal usage of the cpu budget by dedicating more of the total cpu towards the best candidate hits. The reranking phase, if specified, will by default be run on the 100 best hits on each search node, after matching and before information is returned upwards to the search container. The number of hits to rerank can be turned up or down as needed. Below is a toy example showing how to configure first and second phase ranking expressions in the rank profile section of search definitions where the second phase rank expression is run on the 200 best hits from first phase on each search node. search myapp {    …    rank-profile default inherits default {        first-phase {            expression: nativeRank + query(deservesFreshness) * freshness(timestamp)        }        second-phase {            expression {                0.7 * ( 0.7*fieldMatch(title) + 0.2*fieldMatch(description) + 0.1*fieldMatch(body) ) +                0.3 * attributeMatch(keywords)            }            rerank-count: 200        }    } } Constant Tensor files Once the model has been trained in TensorFlow, export the model parameters (W1,W2,b1,b2) to the application folder as Tensors according to the Vespa Document JSON format. The complete code to serialize the model parameters using Vespa Tensor format can be found in the utility scripts but the following code snipped shows how to serialize the hidden layer weights W1: serializer.serialize_to_disk(variable_name = "W_hidden", dimension_names = ['input', 'hidden']) Note that Vespa currently requires dimension names for all the Tensor dimensions (in this case W1 is a matrix, therefore dimension is 2). In the following section, we will use the following code in the blog_post search definition in order to be able to use the constant tensor W_hidden in our ranking expression.    constant W_hidden {        file: constants/W_hidden.json        type: tensor(input[20],hidden[40])    } A constant tensor is data that is not specific to a given document type. In the case above we define W_hidden to be a tensor with two dimensions (matrix), where the first dimension is named input and has size 20 and second dimension is named hidden and has size 40. The data were serialized to a JSON file located at constants/W_hidden.json relative to the application package folder. Vespa ranking expressions In order to evaluate the neural network model trained with TensorFlow in the previous section, we need to translate the model structure to a Vespa ranking expression to be defined in the blog_post search definition. To honor a low-latency response, we will take advantage of the Two Phase Ranking available in Vespa and define the first phase ranking to be the same ranking function used in the previous blog post, which is a dot-product between the user and latent factors. After the documents have been sorted by the first phase ranking function, we will rerank the top 200 document from each search node using the second phase ranking given by the neural network model presented above. Note that we define two ranking profiles in the search definition below. This allow us to decide which ranking profile to use at query time. We defined a ranking profile named tensor which only applies the dot-product between user and document latent factors for all matching documents and a ranking profile named nn_tensor, which rerank the top 200 documents using the neural network model discussed in the previous section. We will walk through each part of the blog_post search definition, see blog_post.sd. As always, we start the a search definition with the following line search blog_post { We define the document type blog_post the same way we have done in the previous tutorial.    document blog_post {      # Field definitions      # Examples:      field date_gmt type string {          indexing: summary      }      field language type string {          indexing: summary      }      # Remaining fields as found in previous tutorial    } We define a ranking profile named tensor which rank all the matching documents by the dot-product between the document latent factor and the user latent factor. This is the same ranking expression used in the previous tutorial, which include code to retrieve the user latent factor based on the user_id sent by the query to Vespa.    # Simpler ranking profile without    # second-phase ranking    rank-profile tensor {      first-phase {          expression {              sum(query(user_item_cf) * attribute(user_item_cf))          }      }    } Since we want to evaluate the neural network model we have trained, we need to define where to find the model parameters (W1,W2,b1,b2). See the previous section for how to write the TensorFlow model parameters to Vespa Tensor format.    # We need to specify the type and the location    # of the files storing tensor values for each    # Variable in our TensorFlow model. In this case,    # W_hidden, b_hidden, W_final, b_final    constant W_hidden {        file: constants/W_hidden.json        type: tensor(input[20],hidden[40])    }    constant b_hidden {        file: constants/b_hidden.json        type: tensor(hidden[40])    }    constant W_final {        file: constants/W_final.json        type: tensor(hidden[40], final[1])    }    constant b_final {        file: constants/b_final.json        type: tensor(final[1])    } Now, we specify a second rank-profile called nn_tensor that will use the same first phase as the rank-profile tensor but will rerank the top 200 documents using the neural network model as second phase. We refer to the Tensor Reference document for more information regarding the Tensor operations used in the code below.    # rank profile with neural network model as    # second phase    rank-profile nn_tensor {        # The input to the neural network is the        # concatenation of the document and query vectors.        macro nn_input() {            expression: concat(attribute(user_item_cf), query(user_item_cf), input)        }        # Computes the hidden layer        macro hidden_layer() {            expression: relu(sum(nn_input * constant(W_hidden), input) + constant(b_hidden))        }        # Computes the output layer        macro final_layer() {            expression: sigmoid(sum(hidden_layer * constant(W_final), hidden) + constant(b_final))        }        # First-phase ranking:        # Dot-product between user and document latent factors        first-phase {            expression: sum(query(user_item_cf) * attribute(user_item_cf))        }        # Second-phase ranking:        # Neural network model based on the user and latent factors        second-phase {            rerank-count: 200            expression: sum(final_layer)        }    } } Offline evaluation We will now query Vespa and obtain 100 blog post recommendations for each user_id in our test set. Below, we query Vespa using the tensor ranking function which contain the simpler ranking expression involving the dot-product between user and document latent factors. pig -x local -f tutorial_compute_metric.pig \  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param ENDPOINT=$(hostname):8080  -param NUMBER_RECOMMENDATIONS=100  -param RANKING_NAME=tensor  -param OUTPUT=blog-job/cf-metric We perform the same query routine below, but now using the ranking-profile nn_tensor which reranks the top 200 documents using the neural network model. pig -x local -f tutorial_compute_metric.pig \  -param VESPA_HADOOP_JAR=vespa-hadoop.jar \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param ENDPOINT=$(hostname):8080  -param NUMBER_RECOMMENDATIONS=100  -param RANKING_NAME=nn_tensor  -param OUTPUT=blog-job/cf-metric The tutorial_compute_metric.pig script can be found in our repo. Comparing the recommendations obtained by those two ranking profiles and our test set, we see that by deploying a more complex and accurate model in the second phase ranking, we increased the number of relevant documents (documents read by the user) retrieved from 11948 to 12804 (more than 7% increase) and those documents retrieved appeared higher up in the list of recommendations, as shown by the expected percentile ranking metric introduced in the Vespa tutorial pt. 2 which decreased from 37.1% to 34.5%.

Blog recommendation with neural network models

December 15, 2017
How to Make Your Web App More Reliable and Performant Using webpack: a Yahoo Mail Case Study December 13, 2017
December 13, 2017
mikesefanov
Share

How to Make Your Web App More Reliable and Performant Using webpack: a Yahoo Mail Case Study

yahoodevelopers: By Murali Krishna Bachhu, Anurag Damle, and Utkarsh Shrivastava As engineers on the Yahoo Mail team at Oath, we pride ourselves on the things that matter most to developers: faster development cycles, more reliability, and better performance. Users don’t necessarily see these elements, but they certainly feel the difference they make when significant improvements are made. Recently, we were able to upgrade all three of these areas at scale by adopting webpack® as Yahoo Mail’s underlying module bundler, and you can do the same for your web application. What is webpack? webpack is an open source module bundler for modern JavaScript applications. When webpack processes your application, it recursively builds a dependency graph that includes every module your application needs. Then it packages all of those modules into a small number of bundles, often only one, to be loaded by the browser. webpack became our choice module bundler not only because it supports on-demand loading, multiple bundle generation, and has a relatively low runtime overhead, but also because it is better suited for web platforms and NodeJS apps and has great community support. Comparison of webpack to other open source bundlers How did we integrate webpack? Like any developer does when integrating a new module bundler, we started integrating webpack into Yahoo Mail by looking at its basic config file. We explored available default webpack plugins as well as third-party webpack plugins and then picked the plugins most suitable for our application. If we didn’t find a plugin that suited a specific need, we wrote the webpack plugin ourselves (e.g., We wrote a plugin to execute Atomic CSS scripts in the latest Yahoo Mail experience in order to decrease our overall CSS payload**). During the development process for Yahoo Mail, we needed a way to make sure webpack would continuously run in the background. To make this happen, we decided to use the task runner Grunt. Not only does Grunt keep the connection to webpack alive, but it also gives us the ability to pass different parameters to the webpack config file based on the given environment. Some examples of these parameters are source map options, enabling HMR, and uglification. Before deployment to production, we wanted to optimize the javascript bundles for size to make the Yahoo Mail experience faster. webpack provides good default support for this with the UglifyJS plugin. Although the default options are conservative, they give us the ability to configure the options. Once we modified the options to our specifications, we saved approximately 10KB. Code snippet showing the configuration options for the UglifyJS plugin Faster development cycles for developers While developing a new feature, engineers ideally want to see their code changes reflected on their web app instantaneously. This allows them to maintain their train of thought and eventually results in more productivity. Before we implemented webpack, it took us around 30 seconds to 1 minute for changes to reflect on our Yahoo Mail development environment. webpack helped us reduce the wait time to 5 seconds. More reliability Consumers love a reliable product, where all the features work seamlessly every time. Before we began using webpack, we were generating javascript bundles on demand or during run-time, which meant the product was more prone to exceptions or failures while fetching the javascript bundles. With webpack, we now generate all the bundles during build time, which means that all the bundles are available whenever consumers access Yahoo Mail. This results in significantly fewer exceptions and failures and a better experience overall. Better Performance We were able to attain a significant reduction of payload after adopting webpack. 1. Reduction of about 75 KB gzipped Javascript payload 2. 50% reduction on server-side render time 3. 10% improvement in Yahoo Mail’s launch performance metrics, as measured by render time above the fold (e.g., Time to load contents of an email). Below are some charts that demonstrate the payload size of Yahoo Mail before and after implementing webpack. Payload before using webpack (JavaScript Size = 741.41KB) Payload after switching to webpack (JavaScript size = 669.08KB) Conclusion Shifting to webpack has resulted in significant improvements. We saw a common build process go from 30 seconds to 5 seconds, large JavaScript bundle size reductions, and a halving in server-side rendering time. In addition to these benefits, our engineers have found the community support for webpack to have been impressive as well. webpack has made the development of Yahoo Mail more efficient and enhanced the product for users. We believe you can use it to achieve similar results for your web application as well. **Optimized CSS generation with Atomizer Before we implemented webpack into the development of Yahoo Mail, we looked into how we could decrease our CSS payload. To achieve this, we developed an in-house solution for writing modular and scoped CSS in React. Our solution is similar to the Atomizer library, and our CSS is written in JavaScript like the example below: Sample snippet of CSS written with Atomizer Every React component creates its own styles.js file with required style definitions. React-Atomic-CSS converts these files into unique class definitions. Our total CSS payload after implementing our solution equaled all the unique style definitions in our code, or only 83KB (21KB gzipped). During our migration to webpack, we created a custom plugin and loader to parse these files and extract the unique style definitions from all of our CSS files. Since this process is tied to bundling, only CSS files that are part of the dependency chain are included in the final CSS.

How to Make Your Web App More Reliable and Performant Using webpack: a Yahoo Mail Case Study

December 13, 2017
Vespa Meetup in Sunnyvale December 1, 2017
December 1, 2017
Share

Vespa Meetup in Sunnyvale

vespaengine: WHAT: Vespa meetup with various presentations from the Vespa team. Several Vespa developers from Norway are in Sunnyvale, use this opportunity to learn more about the open big data serving engine Vespa and meet the team behind it. WHEN: Monday, December 4th, 6:00pm - 8:00pm PDT WHERE: Oath/Yahoo Sunnyvale Campus Building E, Classroom 9 & 10 700 First Avenue, Sunnyvale, CA 94089 MANDATORY REGISTRATION: https://goo.gl/forms/7kK2vlaipgsSSSH42 Agenda 6.00 pm:  Welcome & Intro 6.15 pm: Vespa tips and tricks 7.00 pm: Tensors in Vespa, intro and usecases 7.45 pm: Vespa future and roadmap 7.50 pm: Q&A This meetup is a good arena for sharing experience, get good tips, get inside details in Vespa, discuss and impact the roadmap, and it is a great opportunity for the Vespa team to meet our users. Hope to see many of you!

Vespa Meetup in Sunnyvale

December 1, 2017
Blog recommendation in Vespa December 1, 2017
December 1, 2017
Share

Blog recommendation in Vespa

Introduction This post builds upon the previous blog search application and extends the basic search engine to include machine learned models to help us recommend blog posts to users that arrive at our application. Assume that once a user arrives, we obtain his user identification number, denoted in here by user_id, and that we will send this information down to Vespa and expect to obtain a blog post recommendation list containing 100 blog posts tailored for that specific user. Prerequisites: - Install and build files - code source and build instructions for sbt and Spark is found at Vespa Tutorial pt. 2 - Install Pig and Hadoop - Put trainPosts.json in $VESPA_SAMPLE_APPS, the directory with the clone of vespa sample apps - Put vespa-hadoop.jar in $VESPA_SAMPLE_APPS - docker as in the blog search tutorialCollaborative Filtering We will start our recommendation system by implementing the collaborative filtering algorithm for implicit feedback described in (Hu et. al. 2008). The data is said to be implicit because the users did not explicitly rate each blog post they have read. Instead, the have “liked” blog posts they have likely enjoyed (positive feedback) but did not have the chance to “dislike” blog posts they did not enjoy (absence of negative feedback). Because of that, implicit feedback is said to be inherently noisy and the fact that a user did not “like” a blog post might have many different reasons not related with his negative feelings about that blog post. In terms of modeling, a big difference between explicit and implicit feedback datasets is that the ratings for the explicit feedback are typically unknown for the majority of user-item pairs and are treated as missing values and ignored by the training algorithm. For an implicit dataset, we would assume a rating of zero in case the user has not liked a blog post. To encode the fact that a value of zero could come from different reasons we will use the concept of confidence as introduced by (Hu et. al. 2008), which causes the positive feedback to have a higher weight than a negative feedback. Once we train the collaborative filtering model, we will have one vector representing a latent factor for each user and item contained in the training set. Those vectors will later be used in the Vespa ranking framework to make recommendations to a user based on the dot product between the user and documents latent factors. An obvious problem with this approach is that new users and new documents will not have those latent factors available to them. This is what is called a cold start problem and will be addressed with content-based techniques described in future posts.Evaluation metrics The evaluation metric used by Kaggle for this challenge was the Mean Average Precision at 100 (MAP@100). However, since we do not have information about which blog posts the users did not like (that is, we have only positive feedback) and our inability to obtain user behavior to the recommendations we make (this is an offline evaluation, different from the usual A/B testing performed by companies that use recommendation systems), we offer a similar remark as the one included in (Hu et. al. 2008) and prefer recall-oriented measures. Following (Hu et. al. 2008) we will use the expected percentile ranking.Evaluation Framework Generate training and test sets In order to evaluate the gains obtained by the recommendation system when we start to improve it with more accurate algorithms, we will split the dataset we have available into training and test sets. The training set will contain document (blog post) and user action (likes) pairs as well as any information available about the documents contained in the training set. There is no additional information about the users besides the blog posts they have liked. The test set will be formed by a series of documents available to be recommended and a set of users to whom we need to make recommendations. This list of test set documents constitutes the Vespa content pool, which is the set of documents stored in Vespa that are available to be served to users. The user actions will be hidden from the test set and used later to evaluate the recommendations made by Vespa. To create an application that more closely resembles the challenges faced by companies when building their recommendation systems, we decided to construct the training and test sets in such a way that: - There will be blog posts that had been liked in the training set by a set of users and that had also been liked in the test set by another set of users, even though this information will be hidden in the test set. Those cases are interesting to evaluate if the exploitation (as opposed to exploration) component of the system is working well. That is, if we are able to identify high quality blog posts based on the available information during training and exploit this knowledge by recommending those high quality blog posts to another set of users that might like them as well. - There will be blog posts in the test set that had never been seen in the training set. Those cases are interesting in order to evaluate how the system deals with the cold-start problem. Systems that are too biased towards exploitation will fail to recommend new and unexplored blog posts, leading to a feedback loop that will cause the system to focus into a small share of the available content. A key challenge faced by recommender system designers is how to balance the exploitation/exploration components of their system, and our training/test set split outlined above will try to replicate this challenge in our application. Notice that this split is different from the approach taken by the Kaggle competition where the blog posts available in the test set had never been seen in the training set, which removes the exploitation component of the equation. The Spark job uses trainPosts.json and creates the folders blog-job/training_set_ids and blog-job/test_set_ids containing files with post_id and user_idpairs: $ cd blog-recommendation; export SPARK_LOCAL_IP="127.0.0.1" $ spark-submit --class "com.yahoo.example.blog.BlogRecommendationApp" \  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \  --task split_set --input_file ../trainPosts.json \  --test_perc_stage1 0.05 --test_perc_stage2 0.20 --seed 123 \  --output_path blog-job/training_and_test_indices - test_perc_stage1: The percentage of the blog posts that will be located only on the test set (exploration component). - test_perc_stage2: The percentage of the remaining (post_id, user_id) pairs that should be moved to the test set (exploitation component). - seed: seed value used in order to replicate results if required. Compute user and item latent factors Use the complete training set to compute user and item latent factors. We will leave the discussion about tuning and performance improvement of the model used to the section about model tuning and offline evaluation. Submit the Spark job to compute the user and item latent factors: $ spark-submit --class "com.yahoo.example.blog.BlogRecommendationApp" \  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \  --task collaborative_filtering \  --input_file blog-job/training_and_test_indices/training_set_ids \  --rank 10 --numIterations 10 --lambda 0.01 \  --output_path blog-job/user_item_cf Verify the vectors for the latent factors for users and posts: $ head -1 blog-job/user_item_cf/user_features/part-00000 | python -m json.tool {    "user_id": 270,    "user_item_cf": {        "user_item_cf:0": -1.750116e-05,        "user_item_cf:1": 9.730623e-05,        "user_item_cf:2": 8.515047e-05,        "user_item_cf:3": 6.9297894e-05,        "user_item_cf:4": 7.343942e-05,        "user_item_cf:5": -0.00017635927,        "user_item_cf:6": 5.7642872e-05,        "user_item_cf:7": -6.6685796e-05,        "user_item_cf:8": 8.5506894e-05,        "user_item_cf:9": -1.7209566e-05    } } $ head -1 blog-job/user_item_cf/product_features/part-00000 | python -m json.tool {    "post_id": 20,    "user_item_cf": {        "user_item_cf:0": 0.0019320602,        "user_item_cf:1": -0.004728486,        "user_item_cf:2": 0.0032499845,        "user_item_cf:3": -0.006453364,        "user_item_cf:4": 0.0015929453,        "user_item_cf:5": -0.00420313,        "user_item_cf:6": 0.009350027,        "user_item_cf:7": -0.0015649397,        "user_item_cf:8": 0.009262732,        "user_item_cf:9": -0.0030964287    } } At this point, the vectors with latent factors can be added to posts and users.Add vectors to search definitions using tensors Modern machine learning applications often make use of large, multidimensional feature spaces and perform complex operations on those features, such as in large logistic regression and deep learning models. It is therefore necessary to have an expressive framework to define and evaluate ranking expressions of such complexity at scale. Vespa comes with a Tensor framework, which unify and generalizes scalar, vector and matrix operations, handles the sparseness inherent to most machine learning application (most cases evaluated by the model is lacking values for most of the features) and allow for models to be continuously updated. Additional information about the Tensor framework can be found in the tensor user guide. We want to have those latent factors available in a Tensor representation to be used during ranking by the Tensor framework. A tensor field user_item_cf is added to blog_post.sd to hold the blog post latent factor: field user_item_cf type tensor(user_item_cf[10]) { indexing: summary | attribute attribute: tensor(user_item_cf[10]) } field has_user_item_cf type byte { indexing: summary | attribute attribute: fast-search } A new search definition user.sd defines a document type named user to hold information for users: search user {    document user {        field user_id type string {            indexing: summary | attribute            attribute: fast-search        }        field has_read_items type array {            indexing: summary | attribute        }        field user_item_cf type tensor(user_item_cf[10]) {            indexing: summary | attribute            attribute: tensor(user_item_cf[10])        }        field has_user_item_cf type byte {            indexing: summary | attribute            attribute: fast-search        }    } } Where: - user_id: unique identifier for the user - user_item_cf: tensor that will hold the user latent factor - has_user_item_cf: flag to indicate the user has a latent factorJoin and feed data Build and deploy the application: $ mvn install Deploy the application (in the Docker container): $ vespa-deploy prepare /vespa-sample-apps/blog-recommendation/target/application && \  vespa-deploy activate Wait for app to activate (200 OK): $ curl -s --head http://localhost:8080/ApplicationStatus The code to join the latent factors in blog-job/user_item_cf into blog_post and user documents is implemented in tutorial_feed_content_and_tensor_vespa.pig. After joining in the new fields, a Vespa feed is generated and fed to Vespa directly from Pig : $ pig -Dvespa.feed.defaultport=8080 -Dvespa.feed.random.startup.sleep.ms=0 \  -x local \  -f ../blog-tutorial-shared/src/main/pig/tutorial_feed_content_and_tensor_vespa.pig \  -param VESPA_HADOOP_JAR=../vespa-hadoop*.jar \  -param DATA_PATH=../trainPosts.json \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param BLOG_POST_FACTORS=blog-job/user_item_cf/product_features \  -param USER_FACTORS=blog-job/user_item_cf/user_features \  -param ENDPOINT=localhost A successful data join and feed will output: Input(s): Successfully read 1196111 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/trainPosts.json" Successfully read 341416 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/training_and_test_indices/testing_set_ids" Successfully read 323727 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/user_item_cf/product_features" Successfully read 6290 records from: "file:///Users/kraune/github/vespa-engine/sample-apps/blog-recommendation/blog-job/user_item_cf/user_features" Output(s): Successfully stored 286237 records in: "localhost" Sample blog post and user: - localhost:8080/document/v1/blog-recommendation/user/docid/22702951 - localhost:8080/document/v1/blog-recommendation/blog_post/docid/1838008Ranking Set up a rank function to return the best matching blog posts given some user latent factor. Rank the documents using a dot product between the user and blog post latent factors, i.e. the query tensor and blog post tensor dot product (sum of the product of the two tensors) - from blog_post.sd: rank-profile tensor {    first-phase {        expression {            sum(query(user_item_cf) * attribute(user_item_cf))        }    } } Configure the ranking framework to expect that query(user_item_cf) is a tensor, and that it is compatible with the attribute in a query profile type - see search/query-profiles/types/root.xml and search/query-profiles/default.xml:     This configures a ranking feature named query(user_item_cf) with type tensor(user_item_cf[10]), which defines it as an indexed tensor with 10 elements. This is the same as the attribute, hence the dot product can be computed.Query Vespa with a tensor Test recommendations by sending a tensor with latenct factors: localhost:8080/search/?yql=select%20*%20from%20sources%20blog_post%20where%20has_user_item_cf%20=%201;&ranking=tensor&ranking.features.query(user_item_cf)=%7B%7Buser_item_cf%3A0%7D%3A0.1%2C%7Buser_item_cf%3A1%7D%3A0.1%2C%7Buser_item_cf%3A2%7D%3A0.1%2C%7Buser_item_cf%3A3%7D%3A0.1%2C%7Buser_item_cf%3A4%7D%3A0.1%2C%7Buser_item_cf%3A5%7D%3A0.1%2C%7Buser_item_cf%3A6%7D%3A0.1%2C%7Buser_item_cf%3A7%7D%3A0.1%2C%7Buser_item_cf%3A8%7D%3A0.1%2C%7Buser_item_cf%3A9%7D%3A0.1%7D The query string, decomposed: - yql=select * from sources blog_post where has_user_item_cf = 1 - this selects all documents of type blog_post which has a latent factor tensor - restrict=blog_post - search only in blog_post documents - ranking=tensor - use the rank-profile tensor in blog_post.sd. - ranking.features.query(user_item_cf) - send the tensor as user_item_cf. As this tensor is defined in the query-profile-type, the ranking framework knows its type (i.e. dimensions) and is able to do a dot product with the attribute of same type. The tensor before URL-encoding: {  {user_item_cf:0}:0.1,  {user_item_cf:1}:0.1,  {user_item_cf:2}:0.1,  {user_item_cf:3}:0.1,  {user_item_cf:4}:0.1,  {user_item_cf:5}:0.1,  {user_item_cf:6}:0.1,  {user_item_cf:7}:0.1,  {user_item_cf:8}:0.1,  {user_item_cf:9}:0.1 } Query Vespa with user id Next step is to query Vespa by user id, look up the user profile for the user, get the tensor from it and recommend documents based on this tensor (like the query in previous section). The user profiles is fed to Vespa in the user_item_cf field of the user document type. In short, set up a searcher to retrieve the user profile by user id - then run the query. When the Vespa Container receives a request, it will create a Query representing it and execute a configured list of such Searcher components, called a search chain. The query object contains all the information needed to create a result to the request while the Result encapsulates all the data generated from a Query. The Execution object keeps track of the call state for an execution of the searchers of a search chain: package com.yahoo.example; import com.yahoo.data.access.Inspectable; import com.yahoo.data.access.Inspector; import com.yahoo.prelude.query.IntItem; import com.yahoo.prelude.query.NotItem; import com.yahoo.prelude.query.WordItem; import com.yahoo.processing.request.CompoundName; import com.yahoo.search.Query; import com.yahoo.search.Result; import com.yahoo.search.Searcher; import com.yahoo.search.querytransform.QueryTreeUtil; import com.yahoo.search.result.Hit; import com.yahoo.search.searchchain.Execution; import com.yahoo.search.searchchain.SearchChain; import com.yahoo.tensor.Tensor; import java.util.ArrayList; import java.util.Iterator; import java.util.List; public class UserProfileSearcher extends Searcher {    public Result search(Query query, Execution execution) {        // Get tensor and read items from user profile        Object userIdProperty = query.properties().get("user_id");        if (userIdProperty != null) {            Hit userProfile = retrieveUserProfile(userIdProperty.toString(), execution);            if (userProfile != null) {                addUserProfileTensorToQuery(query, userProfile);                NotItem notItem = new NotItem();                notItem.addItem(new IntItem(1, "has_user_item_cf"));                for (String item : getReadItems(userProfile.getField("has_read_items"))){                    notItem.addItem(new WordItem(item, "post_id"));                }                QueryTreeUtil.andQueryItemWithRoot(query, notItem);            }        }        // Restric to search in blog_posts        query.getModel().setRestrict("blog_post");        // Rank blog posts using tensor rank profile        if(query.properties().get("ranking") == null) {            query.properties().set(new CompoundName("ranking"), "tensor");        }        return execution.search(query);    }    private Hit retrieveUserProfile(String userId, Execution execution) {        Query query = new Query();        query.getModel().setRestrict("user");        query.getModel().getQueryTree().setRoot(new WordItem(userId, "user_id"));        query.setHits(1);        SearchChain vespaChain = execution.searchChainRegistry().getComponent("vespa");        Result result = new Execution(vespaChain, execution.context()).search(query);        execution.fill(result); // This is needed to get the actual summary data        Iterator hiterator = result.hits().deepIterator();        return hiterator.hasNext() ? hiterator.next() : null;    }    private void addUserProfileTensorToQuery(Query query, Hit userProfile) {        Object userItemCf = userProfile.getField("user_item_cf");        if (userItemCf != null) {            if (userItemCf instanceof Tensor) {                query.getRanking().getFeatures().put("query(user_item_cf)", (Tensor)userItemCf);            }            else {                query.getRanking().getFeatures().put("query(user_item_cf)", Tensor.from(userItemCf.toString()));            }        }    }    private List getReadItems(Object readItems) {        List items = new ArrayList<>();        if (readItems instanceof Inspectable) {            for (Inspector entry : ((Inspectable)readItems).inspect().entries()) {                items.add(entry.asString());            }        }        return items;    } } The searcher is configured in in services.xml:     Deploy, then query a user to get blog recommendations: localhost:8080/search/?user_id=34030991&searchChain=user. To refine recommendations, add query terms: localhost:8080/search/?user_id=34030991&searchChain=user&yql=select%20*%20from%20sources%20blog_post%20where%20content%20contains%20%22pegasus%22;Model tuning and offline evaluation We will now optimize the latent factors using the training set instead of manually picking hyperparameter values as was done in Compute user and item latent factors: $ spark-submit --class "com.yahoo.example.blog.BlogRecommendationApp" \  --master local[4] ../blog-tutorial-shared/target/scala-*/blog-support*.jar \  --task collaborative_filtering_cv \  --input_file blog-job/training_and_test_indices/training_set_ids \  --numIterations 10 --output_path blog-job/user_item_cf_cv Feed the newly computed latent factors to Vespa as before. Note that we need to update the tensor specification in the search definition in case the size of the latent vectors change. We have used size 10 (rank = 10) in the Compute user and item latent factors section but our cross-validation algorithm above tries different values for rank (10, 50, 100). $ pig -Dvespa.feed.defaultport=8080 -Dvespa.feed.random.startup.sleep.ms=0 \  -x local \  -f ../blog-tutorial-shared/src/main/pig/tutorial_feed_content_and_tensor_vespa.pig \  -param VESPA_HADOOP_JAR=../vespa-hadoop*.jar \  -param DATA_PATH=../trainPosts.json \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param BLOG_POST_FACTORS=blog-job/user_item_cf_cv/product_features \  -param USER_FACTORS=blog-job/user_item_cf_cv/user_features \  -param ENDPOINT=localhost Run the following script that will use Java UDF VespaQuery from the vespa-hadoop to query Vespa for a specific number of blog post recommendations for each user_id in our test set. With the list of recommendation for each user, we can then compute the expected percentile ranking as described in section Evaluation metrics: $ pig \  -x local \  -f ../blog-tutorial-shared/src/main/pig/tutorial_compute_metric.pig \  -param VESPA_HADOOP_JAR=../vespa-hadoop*.jar \  -param TEST_INDICES=blog-job/training_and_test_indices/testing_set_ids \  -param BLOG_POST_FACTORS=blog-job/user_item_cf_cv/product_features \  -param USER_FACTORS=blog-job/user_item_cf_cv/user_features \  -param NUMBER_RECOMMENDATIONS=100 \  -param RANKING_NAME=tensor \  -param OUTPUT=blog-job/metric \  -param ENDPOINT=localhost:8080 At completion, observe: Input(s): Successfully read 341416 records from: "file:/sample-apps/blog-recommendation/blog-job/training_and_test_indices/testing_set_ids" Output(s): Successfully stored 5174 records in: "file:/sample-apps/blog-recommendation/blog-job/metric" In the next post we will improve accuracy using a simple neural network.Vespa and Hadoop Vespa was designed to keep low-latency performance even at Yahoo-like web scale. This means supporting a large number of concurrent requests as well as a very large number of documents. In the previous tutorial we used a data set that was approximately 5Gb. Data sets of this size do not require a distributed file system for data manipulation. However, we assume that most Vespa users would like at some point to scale their applications up. Therefore, this tutorial uses tools such as Apache Hadoop, Apache Pig and Apache Spark. These can be run locally on a laptop, like in this tutorial. In case you would like to use HDFS (Hadoop Distributed File System) for storing the data, it is just a matter of uploading it to HDFS with the following command: $ hadoop fs -put trainPosts.json blog-app/trainPosts.json If you go with this approach, you need to replace the local file paths with the equivalent HDFS file paths in this tutorial. Vespa has a set of tools to facilitate the interaction between Vespa and the Hadoop ecosystem. These can also be used locally. A Pig script example of feeding to Vespa is as simple as: REGISTER vespa-hadoop.jar DEFINE VespaStorage com.yahoo.vespa.hadoop.pig.VespaStorage(); A = LOAD '' [USING ] [AS ]; -- apply any transformations STORE A INTO '$ENDPOINT' USING VespaStorage(); Use Pig to feed a file into Vespa: $ pig -x local -f feed.pig -p ENDPOINT=endpoint-1,endpoint-2 Here, the -x local option is added to specify that this script is run locally, and will not attempt to retrieve scripts and data from HDFS. You need both Pig and Hadoop libraries installed on your machine to run this locally, but you don’t need to install and start a running instance of Hadoop. More examples of feeding to Vespa from Pig is found in sample apps.

Blog recommendation in Vespa

December 1, 2017
Yahoo Mail’s New Tech Stack, Built for Performance and Reliability June 27, 2017
June 27, 2017
mikesefanov
Share

Yahoo Mail’s New Tech Stack, Built for Performance and Reliability

By Suhas Sadanandan, Director of Engineering  When it comes to performance and reliability, there is perhaps no application where this matters more than with email. Today, we announced a new Yahoo Mail experience for desktop based on a completely rewritten tech stack that embodies these fundamental considerations and more. We built the new Yahoo Mail experience using a best-in-class front-end tech stack with open source technologies including React, Redux, Node.js, react-intl (open-sourced by Yahoo), and others. A high-level architectural diagram of our stack is below. New Yahoo Mail Tech Stack In building our new tech stack, we made use of the most modern tools available in the industry to come up with the best experience for our users by optimizing the following fundamentals: Performance A key feature of the new Yahoo Mail architecture is blazing-fast initial loading (aka, launch). We introduced new network routing which sends users to their nearest geo-located email servers (proximity-based routing). This has resulted in a significant reduction in time to first byte and should be immediately noticeable to our international users in particular. We now do server-side rendering to allow our users to see their mail sooner. This change will be immediately noticeable to our low-bandwidth users. Our application is isomorphic, meaning that the same code runs on the server (using Node.js) and the client. Prior versions of Yahoo Mail had programming logic duplicated on the server and the client because we used PHP on the server and JavaScript on the client.    Using efficient bundling strategies (JavaScript code is separated into application, vendor, and lazy loaded bundles) and pushing only the changed bundles during production pushes, we keep the cache hit ratio high. By using react-atomic-css, our homegrown solution for writing modular and scoped CSS in React, we get much better CSS reuse.   In prior versions of Yahoo Mail, the need to run various experiments in parallel resulted in additional branching and bloating of our JavaScript and CSS code. While rewriting all of our code, we solved this issue using Mendel, our homegrown solution for bucket testing isomorphic web apps, which we have open sourced.   Rather than using custom libraries, we use native HTML5 APIs and ES6 heavily and use PolyesterJS, our homegrown polyfill solution, to fill the gaps. These factors have further helped us to keep payload size minimal. With all the above optimizations, we have been able to reduce our JavaScript and CSS footprint by approximately 50% compared to the previous desktop version of Yahoo Mail, helping us achieve a blazing-fast launch. In addition to initial launch improvements, key features like search and message read (when a user opens an email to read it) have also benefited from the above optimizations and are considerably faster in the latest version of Yahoo Mail. We also significantly reduced the memory consumed by Yahoo Mail on the browser. This is especially noticeable during a long running session. Reliability With this new version of Yahoo Mail, we have a 99.99% success rate on core flows: launch, message read, compose, search, and actions that affect messages. Accomplishing this over several billion user actions a day is a significant feat. Client-side errors (JavaScript exceptions) are reduced significantly when compared to prior Yahoo Mail versions. Product agility and launch velocity We focused on independently deployable components. As part of the re-architecture of Yahoo Mail, we invested in a robust continuous integration and delivery flow. Our new pipeline allows for daily (or more) pushes to all Mail users, and we push only the bundles that are modified, which keeps the cache hit ratio high. Developer effectiveness and satisfaction In developing our tech stack for the new Yahoo Mail experience, we heavily leveraged open source technologies, which allowed us to ensure a shorter learning curve for new engineers. We were able to implement a consistent and intuitive onboarding program for 30+ developers and are now using our program for all new hires. During the development process, we emphasise predictable flows and easy debugging. Accessibility The accessibility of this new version of Yahoo Mail is state of the art and delivers outstanding usability (efficiency) in addition to accessibility. It features six enhanced visual themes that can provide accommodation for people with low vision and has been optimized for use with Assistive Technology including alternate input devices, magnifiers, and popular screen readers such as NVDA and VoiceOver. These features have been rigorously evaluated and incorporate feedback from users with disabilities. It sets a new standard for the accessibility of web-based mail and is our most-accessible Mail experience yet. Open source  We have open sourced some key components of our new Mail stack, like Mendel, our solution for bucket testing isomorphic web applications. We invite the community to use and build upon our code. Going forward, we plan on also open sourcing additional components like react-atomic-css, our solution for writing modular and scoped CSS in React, and lazy-component, our solution for on-demand loading of resources. Many of our company’s best technical minds came together to write a brand new tech stack and enable a delightful new Yahoo Mail experience for our users. We encourage our users and engineering peers in the industry to test the limits of our application, and to provide feedback by clicking on the Give Feedback call out in the lower left corner of the new version of Yahoo Mail.

Yahoo Mail’s New Tech Stack, Built for Performance and Reliability

June 27, 2017
Speed and Stability: Yahoo Mail’s Forward-Thinking Continuous Integration and Delivery Pipeline June 27, 2017
June 27, 2017
mikesefanov
Share

Speed and Stability: Yahoo Mail’s Forward-Thinking Continuous Integration and Delivery Pipeline

By Mohit Goenka, Senior Engineering Manager Building the technology powering the best consumer email inbox in the world is no easy task. When you start on such a journey, it is important to consider how to deliver such an experience to the users. After all, any consumer feature we build can only make a difference after it is delivered to everyone via the tech pipeline.  As we began building out the new version of Yahoo Mail, we wanted to ensure that our internal developer productivity would not be hindered by how our pipelines work. Keeping this in mind, we identified the following principles as most important while designing the delivery pipeline for the new Yahoo Mail experience:  - Product updates are pushed at regular intervals - Releases are stable - Builds are not blocked by irrational test failures - Developers are notified of code pushes - Hotfixes - Rollbacks - Heartbeat pushes  Product updates are pushed at regular intervals  We ensure that our engineers can push any code changes to all Mail users everyday, with the ability to push multiple times a day, if necessary or desired. This is possible because of the time we spent building a solid testing infrastructure, which continues to evolve as we scale to new users and add new features to the product. Every one of our builds runs 10,000+ unit tests and 5,000+ integration tests on various combinations of operating systems and browsers. It is important to push product updates regularly as it allows all our users to get the best Mail experience possible.  Releases are stable  Every code release starts with the company’s internal audience first, where all our employees get to try out the latest changes before they go out to production. This begins with our alpha and beta environments that our Mail engineers use by default. Our build then goes out to the canary environment, which is a small subset of production users, before making it to all users. This gives us the ability to analyze quality metrics on internal and canary servers before rolling the build out to 100% of users in production. Once we go through this process, the code pushed to all our users is thoroughly baked and tested.  Builds are not blocked by irrational test failures  Running tests using web drivers on multiple browsers, as is standard when testing frontend code, comes with the problem of tests irrationally failing. As part the Yahoo Mail continuous delivery pipeline, we employ various novel strategies to recover from such failures. One such strategy is recording the data related to failed tests in the first pass of a build, and then rerunning only the failed tests in the subsequent passes. This is achieved by creating a metadata file that stores all our build-related information. As part of this process, a new bundle is created with a new set of code changes. Once a bundle is created with build metadata information, the same build job can be rerun multiple times such that subsequent reruns would only run the failing tests. This significantly improves rerun times and eliminates the chances of build detentions introduced by irrational test failures. The recorded test information is analyzed independently to understand the pattern of failing tests. This helps us in improving the stability of those intermittently failing tests.  Developers are notified of code pushes  Our build and deployment pipelines collect data related to all the authors contributing to any release through code commits or by merging various pull requests. This enables the build pipeline to send out email notifications to all our Mail developers as their code flows through each environment in our build pipeline (alpha, beta, canary, and production). With this ability, developers are well aware of where their code is in the pipeline and can test their changes as needed.  Hotfixes  We have also created a pipeline to deploy major code fixes directly to production. This is needed even after the existence of tens of thousands of tests and multitudes of checks. Every now and then, a bug may make its way into production. For such instances, we have hotfixes that are very useful. These are code patches that we quickly deploy on top of production code to address critical issues impacting large sets of users.  Rollbacks  If we find any issues in production, we do our best to minimize the impact on users by swiftly utilizing rollbacks, ensuring there is zero to minimal impact time. In order to do rollbacks, we maintain lists of all the versions pushed to production along with their release bundles and change logs. If needed, we pick the stable version that was previously pushed to production and deploy it directly on all the machines running our production instance.  Heartbeat pushes As part of our continuous delivery efforts, we have also developed a concept we call heartbeat pushes. Heartbeat pushes are notifications we send users to refresh their browsers when we issue important builds that they should immediately adopt. These can include bug fixes, product updates, or new features. Heartbeat allows us to dynamically update the latest version of Yahoo Mail when we see that a user’s current version needs to be updated. Yahoo Mail Continuous Delivery Flow In building the new Yahoo Mail experience, we knew that we needed to revamp from the ground up, starting with our continuous integration and delivery pipeline. The guiding principles of our new, forward-thinking infrastructure allow us to deliver new features and code fixes at a very high launch velocity and ensure that our users are always getting the latest and greatest Yahoo Mail experience.

Speed and Stability: Yahoo Mail’s Forward-Thinking Continuous Integration and Delivery Pipeline

June 27, 2017
Open Sourcing Bullet, Yahoo’s Forward-Looking Query Engine for Streaming Data June 15, 2017
June 15, 2017
mikesefanov
Share

Open Sourcing Bullet, Yahoo’s Forward-Looking Query Engine for Streaming Data

By Michael Natkovich, Akshai Sarma, Nathan Speidel, Marcus Svedman, and Cat Utah Big Data is no longer just Apache server logs. Nowadays, the data may be user engagement data, performance metrics, IoT (Internet of Things) data, or something else completely atypical. Regardless of the size of the data, or the type of querying patterns on it (exploratory, ad-hoc, periodic, long-term, etc.), everyone wants queries to be as fast as possible and cheap to run in terms of resources. Data can be broadly split into two kinds: the streaming (generally real-time) kind or the batched-up-over-a-time-interval (e.g., hourly or daily) kind. The batch version is typically easier to query since it is stored somewhere like a data warehouse that has nice SQL-like interfaces or an easy to use UI provided by tools such as Tableau, Looker, or Superset. Running arbitrary queries on streaming data quickly and cheaply though, is generally much harder… until now. Today, we are pleased to share our newly open sourced, forward-looking general purpose query engine, called Bullet, with the community on GitHub. With Bullet, you can:  - Powerful and nested filtering - Fetching raw data records - Aggregating data using Group Bys (Sum, Count, Average, etc.), Count Distincts, Top Ks - Getting distributions of fields like Percentiles or Frequency histograms  One of the key differences between how Bullet queries data and the standard querying paradigm is that Bullet does not store any data. In most other systems where you have a persistence layer (including in-memory storage), you are doing a look-back when you query the layer. Instead, Bullet operates on data flowing through the system after the query is started – it’s a look-forward system that doesn’t need persistence. On a real-time data stream, this means that Bullet is querying data after the query is submitted. This also means that Bullet does not query any data that has already passed through the stream. The fact that Bullet does not rely on a persistence layer is exactly what makes it extremely lightweight and cheap to run.  To see why this is better for the kinds of use cases Bullet is meant for – such as quickly looking at some metric, checking some assumption, iterating on a query, checking the status of something right now, etc. – consider the following: if you had a 1000 queries in a traditional query system that operated on the same data, these query systems would most likely scan the data 1000 times each. By the very virtue of it being forward looking, 1000 queries in Bullet scan the data only once because the arrival of the query determines and fixes the data that it will see. Essentially, the data is coming to the queries instead of the queries being farmed out to where the data is. When the conditions of the query are satisfied (usually a time window or a number of events), the query terminates and returns you the result.  A Brief Architecture Overview High Level Bullet Architecture The Bullet architecture is multi-tenant, can scale linearly for more queries and/or more data, and has been tested to handle 700+ simultaneous queries on a data stream that had up to 1.5 million records per second, or 5-6 GB/s. Bullet is currently implemented on top of Storm and can be extended to support other stream processing engines as well, like Spark Streaming or Flink. Bullet is pluggable, so you can plug in any source of data that can be read in Storm by implementing a simple data container interface to let Bullet work with it.  The UI, web service, and the backend layers constitute your standard three-tier architecture. The Bullet backend can be split into three main subsystems: 1. Request Processor – receives queries, adds metadata, and sends it to the rest of the system 2. Data Processor – reads data from an input stream, converts it to a unified data format, and matches it against queries 3. Combiner – combines results for different queries, performs final aggregations, and returns results  The web service can be deployed on any servlet container, like Jetty. The UI is a Node-based Ember application that runs in the client browser. Our full documentation contains all the details on exactly how we perform computationally-intractable queries like Count Distincts on fields with cardinality in the millions, etc. (DataSketches).  Usage at Yahoo  An instance of Bullet is currently running at Yahoo in production against a small subset of Yahoo’s user engagement data stream. This data is roughly 100,000 records per second and is about 130 MB/s compressed. Bullet queries this with about 100 CPU Virtual Cores and 120 GB of RAM. This fits on less than 2 of our (64 Virtual Cores, 256 GB RAM each) test Storm cluster machines.  One of the most popular use cases at Yahoo is to use Bullet to manually validate the instrumentation of an app or web application. Instrumentation produces user engagement data like clicks, views, swipes, etc. Since this data powers everything we do from analytics to personalization to targeting, it is absolutely critical that the data is correct. The usage pattern is generally to:  1. Submit a Bullet query to obtain data associated with your mobile device or browser (filter on a cookie value or mobile device ID) 2. Open and use the application to generate the data while the Bullet query is running 3. Go back to Bullet and inspect the data  In addition, Bullet is also used programmatically in continuous delivery pipelines for functional testing instrumentation on product releases. Product usage is simulated, then data is generated and validated in seconds using Bullet. Bullet is orders of magnitude faster to use for this kind of validation and for general data exploration use cases, as opposed to waiting for the data to be available in Hive or other systems. The Bullet UI supports pivot tables and a multitude of charting options that may speed up analysis further compared to other querying options.  We also use Bullet to do a bunch of other interesting things, including instances where we dynamically compute cardinalities (using a Count Distinct Bullet query) of fields as a check to protect systems that can’t support extremely high cardinalities for fields like Druid.  What you do with Bullet is entirely determined by the data you put it on. If you put it on data that is essentially some set of performance metrics (data center statistics for example), you could be running a lot of queries that find the 95th and 99th percentile of a metric. If you put it on user engagement data, you could be validating instrumentation and mostly looking at raw data.  We hope you will find Bullet interesting and tell us how you use it. If you find something you want to change, improve, or fix, your contributions and ideas are always welcome! You can contact us here.  Helpful Links  - Quick Start - UI Querying Demo - Full Documentation - GitHub Links - DataSketches

Open Sourcing Bullet, Yahoo’s Forward-Looking Query Engine for Streaming Data

June 15, 2017
HBase Goes Fast and Lean with the Accordion Algorithm June 12, 2017
June 12, 2017
Share

HBase Goes Fast and Lean with the Accordion Algorithm

yahooresearch: By Edward Bortnikov, Anastasia Braginsky, and Eshcar Hillel Modern products powered by NoSQL key-value (KV-)storage technologies exhibit ever-increasing performance expectations. Ideally, NoSQL applications would like to enjoy the speed of in-memory databases without giving up on reliable persistent storage guarantees. Our Scalable Systems research team has implemented a new algorithm named Accordion, that takes a significant step toward this goal, into the forthcoming release of Apache HBase 2.0. HBase, a distributed KV-store for Hadoop, is used by many companies every day to scale products seamlessly with huge volumes of data and deliver real-time performance. At Yahoo, HBase powers a variety of products, including Yahoo Mail, Yahoo Search, Flurry Analytics, and more. Accordion is a complete re-write of core parts of the HBase server technology, named RegionServer. It improves the server scalability via a better use of RAM. Namely, it accommodates more data in memory and writes to disk less frequently. This manifests in a number of desirable phenomena. First, HBase’s disk occupancy and write amplification are reduced. Second, more reads and writes get served from RAM, and less are stalled by disk I/O. Traditionally, these different metrics were considered at odds, and tuned at each other’s expense. With Accordion, they all get improved simultaneously. We stress-tested Accordion-enabled HBase under a variety of workloads. Our experiments exercised different blends of reads and writes, as well as different key distributions (heavy-tailed versus uniform). We witnessed performance improvements across the board. Namely, we saw write throughput increases of 20% to 40% (depending on the workload), tail read latency reductions of up to 10%, disk write reductions of up to 30%, and also some modest Java garbage collection overhead reduction. The figures below further zoom into Accordion’s performance gains, compared to the legacy algorithm. Figure 1. Accordion’s write throughput compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf (heavy-tailed) and Uniform primary key distributions. Figure 2. Accordion’s read latency quantiles compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution. Figure 3. Accordion’s disk I/O compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution. Accordion is inspired by the Log-Structured-Merge (LSM) tree design pattern that governs HBase storage organization. An HBase region is stored as a sequence of searchable key-value maps. The topmost is a mutable in-memory store, called MemStore, which absorbs the recent write (put) operations. The rest are immutable HDFS files, called HFiles. Once a MemStore overflows, it is flushed to disk, creating a new HFile. HBase adopts multi-versioned concurrency control – that is, MemStore stores all data modifications as separate versions. Multiple versions of one key may therefore reside in MemStore and the HFile tier. A read (get) operation, which retrieves the value by key, scans the HFile data in BlockCache, seeking the latest version. To reduce the number of disk accesses, HFiles are merged in the background. This process, called compaction, removes the redundant cells and creates larger files. LSM trees deliver superior write performance by transforming random application-level I/O to sequential disk I/O. However, their traditional design makes no attempt to compact the in-memory data. This stems from historical reasons: LSM trees were designed in the age when RAM was in very short supply, and therefore the MemStore capacity was small. With recent changes in the hardware landscape, the overall MemStore size managed by RegionServer can be multiple gigabytes, leaving a lot of headroom for optimization.  Accordion reapplies the LSM principle to MemStore in order to eliminate redundancies and other overhead while the data is still in RAM. The MemStore memory image is therefore “breathing” (periodically expanding and contracting), similarly to how an accordion bellows. This work pattern decreases the frequency of flushes to HDFS, thereby reducing the write amplification and the overall disk footprint.  With fewer flushes, the write operations are stalled less frequently as the MemStore overflows, and as a result, the write performance is improved. Less data on disk also implies less pressure on the block cache, higher hit rates, and eventually better read response times. Finally, having fewer disk writes also means having less compaction happening in the background, i.e., fewer cycles are stolen from productive (read and write) work. All in all, the effect of in-memory compaction can be thought of as a catalyst that enables the system to move faster as a whole.  Accordion currently provides two levels of in-memory compaction: basic and eager. The former applies generic optimizations that are good for all data update patterns. The latter is most useful for applications with high data churn, like producer-consumer queues, shopping carts, shared counters, etc. All these use cases feature frequent updates of the same keys, which generate multiple redundant versions that the algorithm takes advantage of to provide more value. Future implementations may tune the optimal compaction policy automatically.  Accordion replaces the default MemStore implementation in the production HBase code. Contributing its code to production HBase could not have happened without intensive work with the open source Hadoop community, with contributors stretched across companies, countries, and continents. The project took almost two years to complete, from inception to delivery.  Accordion will become generally available in the upcoming HBase 2.0 release. We can’t wait to see it power existing and future products at Yahoo and elsewhere.

HBase Goes Fast and Lean with the Accordion Algorithm

June 12, 2017
Join Us at the 10th Annual Hadoop Summit / DataWorks Summit, San Jose (Jun 13-15) May 23, 2017
May 23, 2017
Share

Join Us at the 10th Annual Hadoop Summit / DataWorks Summit, San Jose (Jun 13-15)

We’re excited to co-host the 10th Annual Hadoop Summit, the leading conference for the Apache Hadoop community, taking place on June 13 – 15 at the San Jose Convention Center. In the last few years, the Hadoop Summit has expanded to cover all things data beyond just Apache Hadoop – such as data science, cloud and operations, IoT and applications – and has been aptly renamed the DataWorks Summit. The three-day program is bursting at the seams! Here are just a few of the reasons why you cannot miss this must-attend event: - Familiarize yourself with the cutting edge in Apache project developments from the committers - Learn from your peers and industry experts about innovative and real-world use cases, development and administration tips and tricks, success stories and best practices to leverage all your data – on-premise and in the cloud – to drive predictive analytics, distributed deep-learning and artificial intelligence initiatives - Attend one of our more than 170 technical deep dive breakout sessions from nearly 200 speakers across eight tracks - Check out our keynotes, meetups, trainings, technical crash courses, birds-of-a-feather sessions, Women in Big Data and more - Attend the community showcase where you can network with sponsors and industry experts, including a host of startups and large companies like Microsoft, IBM, Oracle, HP, Dell EMC and Teradata Similar to previous years, we look forward to continuing Yahoo’s decade-long tradition of thought leadership at this year’s summit. Join us for an in-depth look at Yahoo’s Hadoop culture and for the latest in technologies such as Apache Tez, HBase, Hive, Data Highway Rainbow, Mail Data Warehouse and Distributed Deep Learning at the breakout sessions below. Or, stop by Yahoo kiosk #700 at the community showcase. Also, as a co-host of the event, Yahoo is pleased to offer a 20% discount for the summit with the code YAHOO20. Register here for Hadoop Summit, San Jose, California! DAY 1. TUESDAY June 13, 2017 12:20 - 1:00 P.M. TensorFlowOnSpark - Scalable TensorFlow Learning On Spark Clusters Andy Feng - VP Architecture, Big Data and Machine Learning Lee Yang - Sr. Principal Engineer In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, that was open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application. 2:10 - 2:50 P.M. Handling Kernel Upgrades at Scale - The Dirty Cow Story Samy Gawande - Sr. Operations Engineer Savitha Ravikrishnan - Site Reliability Engineer Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016). 5:00 – 5:40 P.M. Data Highway Rainbow -  Petabyte Scale Event Collection, Transport, and Delivery at Yahoo Nilam Sharma - Sr. Software Engineer Huibing Yin - Sr. Software Engineer This talk presents the architecture and features of Data Highway Rainbow, Yahoo’s hosted multi-tenant infrastructure which offers event collection, transport and aggregated delivery as a service. Data Highway supports collection from multiple data centers & aggregated delivery in primary Yahoo data centers which provide a big data computing cluster. From a delivery perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and Kafka; with Storm & Kafka endpoints tailored towards latency sensitive consumers. DAY 2. WEDNESDAY June 14, 2017 9:05 - 9:15 A.M. Yahoo General Session - Shaping Data Platform for Lasting Value Sumeet Singh  – Sr. Director, Products With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo. 12:20 - 1:00 P.M. CaffeOnSpark Update - Recent Enhancements and Use Cases Mridul Jain - Sr. Principal Engineer Jun Shi - Principal Engineer By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recent development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment. 12:20 - 1:00 P.M. Tez Shuffle Handler - Shuffling at Scale with Apache Hadoop Jon Eagles - Principal Engineer   Kuhu Shukla - Software Engineer In this talk we introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler, the default shuffle service in Apache Tez. The Apache Tez Shuffle Handler adds composite fetch which has support for multi-partition fetch to mitigate performance slow down and provides deletion APIs to reduce disk usage for long running Tez sessions. As an emerging technology we will outline future roadmap for the Apache Tez Shuffle Handler and provide performance evaluation results from real world jobs at scale. 2:10 - 2:50 P.M. Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes Thiruvel Thirumoolan – Principal Engineer Francis Liu – Sr. Principal Engineer At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). We will walk through multi-tenancy features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0. 2:10 - 2:50 P.M. Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse Nick Huang – Director, Data Engineering, Yahoo Mail   Saurabh Dixit – Sr. Principal Engineer, Yahoo Mail Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail. In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption. DAY3. THURSDAY June 15, 2017 2:10 – 2:50 P.M. OracleStore - A Highly Performant RawStore Implementation for Hive Metastore Chris Drome - Sr. Principal Engineer   Jin Sun - Principal Engineer Today, Yahoo uses Hive in many different spaces, from ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive to real-time queries, such as those generated by interactive BI reporting systems. In order for Hive to succeed in this space, it must be performant in all aspects of query execution, from query compilation to job execution. One such component is the interaction with the underlying database at the core of the Metastore. As an alternative to ObjectStore, we created OracleStore as a proof-of-concept. Freed of the restrictions imposed by DataNucleus, we were able to design a more performant database schema that better met our needs. Then, we implemented OracleStore with specific goals built-in from the start, such as ensuring the deduplication of data. In this talk we will discuss the details behind OracleStore and the gains that were realized with this alternative implementation. These include a reduction of 97%+ in the storage footprint of multiple tables, as well as query performance that is 13x faster than ObjectStore with DirectSQL and 46x faster than ObjectStore without DirectSQL. 3:00 P.M. - 3:40 P.M. Bullet - A Real Time Data Query Engine Akshai Sarma - Sr. Software Engineer Michael Natkovich - Director, Engineering Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average. 3:00 P.M. - 3:40 P.M. Yahoo - Moving Beyond Running 100% of Apache Pig Jobs on Apache Tez Rohini Palaniswamy - Sr. Principal Engineer Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez. After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc. We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. 4:10 P.M. - 4:50 P.M. Leveraging Docker for Hadoop Build Automation and Big Data Stack Provisioning Evans Ye,  Software Engineer Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we’ll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation. Register here for Hadoop Summit, San Jose, California with a 20% discount code YAHOO20.  Questions? Feel free to reach out to us at bigdata@yahoo-inc.com. Hope to see you there!

Join Us at the 10th Annual Hadoop Summit / DataWorks Summit, San Jose (Jun 13-15)

May 23, 2017
Open Sourcing Daytona: A Framework For Automated and Application-agnostic Performance Analysis May 23, 2017
May 23, 2017
mikesefanov
Share

Open Sourcing Daytona: A Framework For Automated and Application-agnostic Performance Analysis

By Sapan Panigrahi and Deepesh Mittal Today, we are pleased to offer Daytona, an open-source framework for automated performance testing and analysis, to the community. Daytona is an application-agnostic framework to conduct integrated performance testing and analysis with repeatable test execution, standardized reporting, and built-in profiling support. Daytona gives you the capability to build a customized test harness in a single, unified framework to test and analyze the performance of any application. You’ll get easy repeatability, consistent reporting, and the ability to capture trends. Daytona’s UI accepts a performance testing script that can run on a command line. This includes websites, databases, networks, or any workload you need to test and tune for performance. You can submit tests to the scheduler queue from the Daytona UI or from your CI/CD tool. You can deploy Daytona as a hosted service in your on-prem environment or on the public cloud of your choice. In fact, you can even host test harnesses for multiple applications with a single centralized service so that developers, architects, and systems engineers from different parts of your organization can work together on a unified view and manage your performance analysis on a continuous basis. Daytona’s differentiation lies in its ability to aggregate and present essential aspects of application, system, and hardware performance metrics with a simple and unified user interface. This helps you maintain your focus on performance analysis without changing context across various sources and formats of data. The overall goal of performance analysis is to find ways of maximizing application throughput with minimum hardware resource and the best user experience. Metrics and insights from Daytona help achieve this objective. Prior to Daytona, we created multiple, heterogenous performance tools to meet the specific needs of various applications. This meant that we often stored test results inconsistently, making it harder to analyze performance in a comprehensive manner. We had a difficult time sharing results and analyzing differences in test runs in a standard manner, which could lead to confusion. With Daytona, we are now able to integrate all our load testing tools under a single framework and aggregate test results in one common central repository. We are gaining insight into the performance characteristics of many of our applications on a continuous basis. These insights help us optimize our applications which results in better utilization of our hardware resources and helps improve user experience by reducing the latency to serve end-user requests. Ultimately, Daytona helps us reduce capital expenditure on our large-scale infrastructure and makes our applications more robust under load. Sharing performance results in a common format encourages the use of common optimization techniques that we can leverage across many different applications. Daytona was built knowing that we would want to publish it as open source and share the technology with the community for validation and improvement of the framework. We hope the community can help extend its use cases and make it suitable for an even broader set of applications and workloads. Architecture Daytona is comprised of a centralized scheduler, a distributed set of agents running on SUTs (systems under test), a MySQL database to store all metadata for tests, and a PHP-based UI. A test harness can be customized by answering a simple set of questions about the application/workload. A test can be submitted to Daytona’s queue through the UI or through a CLI (Command Line Interface) from the CI/CD system. The scheduler process polls the database for a test to be run and sends all the actions associated with the execution of the test to the agent running on a SUT. An agent process executes the test, collects application and system performance metrics, and sends the metrics back as a package to the scheduler. The scheduler saves the test metadata in the database and test results in the local file system. Tests from multiple harnesses proceed concurrently. Architecture and Life Cycle Of A Test Looking Forward Our goal is to integrate Daytona with popular open source CI/CD tools and we welcome contributions from the community to make that happen. It is available under Apache License Version 2.0. To evaluate Daytona, we provide simple instructions to deploy it on your in-house bare metal, VM, or public cloud infrastructure. We also provide instructions so you can quickly have a test and development environment up and running on your laptop with Docker. Please join us on the path of making application performance analysis an enjoyable and insightful experience. Visit the Daytona Yahoo repo to get started!

Open Sourcing Daytona: A Framework For Automated and Application-agnostic Performance Analysis

May 23, 2017
Understanding Athenz Architecture May 9, 2017
May 9, 2017
Share

Understanding Athenz Architecture

By Mujib Wahab,  Henry Avetisyan and Lee Boynton Data Model Having a firm grasp on some fundamental concepts in Athenz data model will help you understand the Athenz architecture, the request flow for both centralized and decentralized authorization in system view, and how to set up role-based authorization. Domain: Domains are namespaces, strictly partitioned, providing a context for authoritative statements to be made about entities it contains. Administrative tasks can be delegated to created sub-domains to avoid reliance on central “super user” administrative roles. Resource/Action: Resources and Actions aren’t explicitly modeled in Athenz, they are referred to by name. A resource is something that is “owned” and controlled in a specific domain while the operations one can perform against that resource are defined as actions. A resource could be a concrete object like a machine or an abstract object like a security policy. For example, if a domain media.finance product wants to authorize access to a database called “storage” that it owns, the resource name for the database may look like this: media.finance:db.storage and the supported actions on this resource would be insert, update and delete. Policy: To implement access control, we have policies in our domain that govern the use of our resources. A policy is a set of assertions (rules) about granting or denying an operation/action on a resource to all the members in the configured role. Role: A role can be thought of as a group; anyone in the group can assume the role that takes a particular action. Every policy assertion describes what can be done by a role. A role can also delegate the determination of membership to another trusted domain; for example, a netops role managed outside a property domain. This is how we can model tenant relations between a provider domain and tenant domains. Because roles are defined in domains, they can be partitioned by domain, unlike users, which are global. This allows the distributed operation to be more easily scaled. Principal: The actors in Athenz that can assume a role are called principals. These principals are authenticated and can be users (for example, authenticated by their Unix or Kerberos credentials). Principals can also be services that are authenticated by a service management system. Athenz currently provides service identity and authentication support. User: Users are actually defined in some external authority, e.g. Unix or Kerberos system. A special domain is reserved for the purpose of namespacing users; the name of that domain is “user,” so some example users are: user.john or user.joe. The credentials that the external system requires are exchanged for a NToken before operating on any data. Service: The concept of a Service Identity is introduced as the identity of independent agents of execution. Services have a simple way of naming them, e.g. media.finance.storage identifies a service called “storage” in domain media.finance. A Service may be used as a principal when specifying roles, just like a user. Athenz provides support for registering such a Service, in a domain, along with its public key that can be used to later verify a N-Token that is presented by the service.System View Let’s look at all the services and libraries that work together to provide support for Athenz authorization system. ZMS (authZ Management System): ZMS is the source of truth for domains, roles, and policies for centralized authorization. In addition to allowing CRUD operations on the basic entities, ZMS provides an API to replicate the entities, per domain, to ZTS. ZMS supports a centralized call to check if a principal has access to a resource both for internal management system checks, as well as a simple centralized deployment. Because ZMS supports service identities, ZMS can authenticate services. For centralized authorization, ZMS may be the only Athenz subsystem that you need to interact with. ZTS (authZ Token System): ZTS, the authentication token service, is only needed to support decentralized functionality. In many ways, ZTS is like a local replica of ZMS’s data to check a principal’s authentication and confirm membership in roles within a domain. The authentication is in the form of a signed ZToken that can be presented to any decentralized service that wants to authorize access efficiently. Multiple ZTS instances can be distributed to different locations as needed to scale for issuing tokens. SIA (Service Identity Agent) Provider: SIA Provider is part of the container, although likely built with Athenz libraries. As services are authenticated by their private keys, the job of the SIA Provider is to generate a NToken and sign it with the given private key so that the service can present that NToken to ZMS/ZTS as its identity credentials. The corresponding public key must be registered in ZMS so Athenz services can validate the signature. ZPE (AuthZ Policy Engine): Like ZTS, ZPE, the authorization policy engine is only needed to support decentralized authorization. ZPE is the subsystem of Athenz that evaluates policies for a set of roles to yield an allowed or a denied response. ZPE is a library that your service calls and only refers to a local policy cache for your services domain (a small amount of data).

Understanding Athenz Architecture

May 9, 2017
Open Sourcing Athenz:    Fine-Grained, Role-Based Access Control May 9, 2017
May 9, 2017
Share

Open Sourcing Athenz:    Fine-Grained, Role-Based Access Control

By Lee Boynton, Henry Avetisyan, Ken Fox, Itsik Figenblat, Mujib Wahab, Gurpreet Kaur, Usha Parsa, and Preeti Somal Today, we are pleased to offer Athenz, an open-source platform for fine-grained access control, to the community. Athenz is a role-based access control (RBAC) solution, providing trusted relationships between applications and services deployed within an organization requiring authorized access. If you need to grant access to a set of resources that your applications or services manage, Athenz provides both a centralized and a decentralized authorization model to do so. Whether you are using container or VM technology independently or on bare metal, you may need a dynamic and scalable authorization solution. Athenz supports moving workloads from one node to another and gives new compute resources authorization to connect to other services within minutes, as opposed to relying on IP and network ACL solutions that take time to propagate within a large system. Moreover, in very high-scale situations, you may run out of the limited number of network ACL rules that your hardware can support. Prior to creating Athenz, we had multiple ways of managing permissions and access control across all services within Yahoo. To simplify, we built a fine-grained, role-based authorization solution that would satisfy the feature and performance requirements our products demand. Athenz was built with open source in mind so as to share it with the community and further its development. At Yahoo, Athenz authorizes the dynamic creation of compute instances and containerized workloads, secures builds and deployment of their artifacts to our Docker registry, and among other uses, manages the data access from our centralized key management system to an authorized application or service. Athenz provides a REST-based set of APIs modeled in Resource Description Language (RDL) to manage all aspects of the authorization system, and includes Java and Go client libraries to quickly and easily integrate your application with Athenz. It allows product administrators to manage what roles are allowed or denied to their applications or services in a centralized management system through a self-serve UI. Access Control Models Athenz provides two authorization access control models based on your applications’ or services’ performance needs. More commonly used, the centralized access control model is ideal for provisioning and configuration needs. In instances where performance is absolutely critical for your applications or services, we provide a unique decentralized access control model that provides on-box enforcement of authorization.   Athenz’s authorization system utilizes two types of tokens: principal tokens (N-Tokens) and role tokens (Z-Tokens). The principal token is an identity token that identifies either a user or a service. A service generates its principal token using that service’s private key. Role tokens authorize a given principal to assume some number of roles in a domain for a limited period of time. Like principal tokens, they are signed to prevent tampering. The name “Athenz” is derived from “Auth” and the ‘N’ and ‘Z’ tokens. Centralized Access Control: The centralized access control model requires any Athenz-enabled application to contact the Athenz Management Service directly to determine if a specific authenticated principal (user and/or service) has been authorized to carry out the given action on the requested resource. At Yahoo, our internal continuous delivery solution uses this model. A service receives a simple Boolean answer whether or not the request should be processed or rejected. In this model, the Athenz Management Service is the only component that needs to be deployed and managed within your environment. Therefore, it is suitable for provisioning and configuration use cases where the number of requests processed by the server is small and the latency for authorization checks is not important. The diagram below shows a typical control plane-provisioning request handled by an Athenz-protected service. Athenz Centralized Access Control Model Decentralized Access Control: This approach is ideal where the application is required to handle large number of requests per second and latency is a concern. It’s far more efficient to check authorization on the host itself and avoid the synchronous network call to a centralized Athenz Management Service. Athenz provides a way to do this with its decentralized service using a local policy engine library on the local box. At Yahoo, this is an approach we use for our centralized key management system. The authorization policies defining which roles have been authorized to carry out specific actions on resources, are asynchronously updated on application hosts and used by the Athenz local policy engine to evaluate the authorization check. In this model, a principal needs to contact the Athenz Token Service first to retrieve an authorization role token for the request and submit that token as part of its request to the Athenz protected service. The same role token can then be re-used for its lifetime. The diagram below shows a typical decentralized authorization request handled by an Athenz-protected service.Athenz Decentralized Access Control Model With the power of an RBAC system in which you can choose a model to deploy according your performance latency needs, and the flexibility to choose either or both of the models in a complex environment of hosting platforms or products, it gives you the ability to run your business with agility and scale. Looking to the Future We are actively engaged in pushing the scale and reliability boundaries of Athenz. As we enhance Athenz, we look forward to working with the community on the following features: - Using local CA signed TLS certificates - Extending Athenz with a generalized model for service providers to launch instances with bootstrapped Athenz service identity TLS certificates - Integration with public cloud services like AWS. For example, launching an EC2 instance with a configured Athenz service identity or obtaining AWS temporary credentials based on authorization policies defined in ZMS. Our goal is to integrate Athenz with other open source projects that require authorization support and we welcome contributions from the community to make that happen. It is available under Apache License Version 2.0. To evaluate Athenz, we provide both AWS AMI and Docker images so that you can quickly have a test development environment up and running with ZMS (Athenz Management Service), ZTS (Athenz Token Service), and UI services. Please join us on the path to making application authorization easy. Visit http://www.athenz.io to get started!

Open Sourcing Athenz:    Fine-Grained, Role-Based Access Control

May 9, 2017
Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters February 13, 2017
February 13, 2017
Share

Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters

By Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team Introduction Today, we are pleased to offer TensorFlowOnSpark to the community, our latest open source framework for distributed deep learning on big-data clusters. Deep learning (DL) has evolved significantly in recent years. At Yahoo, we’ve found that in order to gain insight from massive amounts of data, we need to deploy distributed deep learning. Existing DL frameworks often require us to set up separate clusters for deep learning, forcing us to create multiple programs for a machine learning pipeline (see Figure 1 below). Having separate clusters requires us to transfer large datasets between them, introducing unwanted system complexity and end-to-end learning latency. Last year we addressed scaleout issues by developing and publishing CaffeOnSpark, our open source framework that allows distributed deep learning and big-data processing on identical Spark and Hadoop clusters. We use CaffeOnSpark at Yahoo to improve our NSFW image detection, to automatically identify eSports game highlights from live-streamed videos, and more. With the community’s valuable feedback and contributions, CaffeOnSpark has been upgraded with LSTM support, a new data layer, training and test interleaving, a Python API, and deployment on docker containers. This has been great for our Caffe users, but what about those who use the deep learning framework TensorFlow? We’re taking a page from our own playbook and doing for TensorFlow for what we did for Caffe.   After TensorFlow’s initial publication, Google released an enhanced TensorFlow with distributed deep learning capabilities in April 2016. In October 2016, TensorFlow introduced HDFS support. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. TensorFlow programs could not be deployed on existing big-data clusters, thus increasing the cost and latency for those who wanted to take advantage of this technology at scale. To address this limitation, several community projects wired TensorFlow onto Spark clusters. SparkNet added the ability to launch TensorFlow networks in Spark executors. DataBricks proposed TensorFrame to manipulate Apache Spark’s DataFrames with TensorFlow programs. While these approaches are a step in the right direction, after examining their code, we learned we would be unable to get the TensorFlow processes to communicate with each other directly, we would not be able to implement asynchronous distributed learning, and we would have to expend significant effort to migrate existing TensorFlow programs. TensorFlowOnSpark Our new framework, TensorFlowOnSpark (TFoS), enables distributed TensorFlow execution on Spark and Hadoop clusters. As illustrated in Figure 2 above, TensorFlowOnSpark is designed to work along with SparkSQL, MLlib, and other Spark libraries in a single pipeline or program (e.g. Python notebook). TensorFlowOnSpark supports all types of TensorFlow programs, enabling both asynchronous and synchronous training and inferencing. It supports model parallelism and data parallelism, as well as TensorFlow tools such as TensorBoard on Spark clusters. Any TensorFlow program can be easily modified to work with TensorFlowOnSpark. Typically, changing fewer than 10 lines of Python code are needed. Many developers at Yahoo who use TensorFlow have easily migrated TensorFlow programs for execution with TensorFlowOnSpark. TensorFlowOnSpark supports direct tensor communication among TensorFlow processes (workers and parameter servers). Process-to-process direct communication enables TensorFlowOnSpark programs to scale easily by adding machines. As illustrated in Figure 3, TensorFlowOnSpark doesn’t involve Spark drivers in tensor communication, and thus achieves similar scalability as stand-alone TensorFlow clusters. TensorFlowOnSpark provides two different modes to ingest data for training and inference: 1. TensorFlow QueueRunners: TensorFlowOnSpark leverages TensorFlow’s file readers and QueueRunners to read data directly from HDFS files. Spark is not involved in accessing data. 2. Spark Feeding: Spark RDD data is fed to each Spark executor, which subsequently feeds the data into the TensorFlow graph via feed_dict. Figure 4 illustrates how the synchronous distributed training of Inception image classification network scales in TFoS using QueueRunners with a simple setting: 1 GPU, 1 reader, and batch size 32 for each worker. Four TFoS jobs were launched to train 100,000 steps. When these jobs completed after 2+ days, the top-5 accuracy of these jobs were 0.730, 0.814, 0.854, and 0.879. Reaching top-5 accuracy of 0.730 takes 46 hours for a 1-worker job, 22.5 hours for a 2-worker job, 13 hours for a 4-worker job, and 7.5 hours for an 8-worker job. TFoS thus achieves near linear scalability for Inception model training. This is very encouraging, although TFoS scalability will vary for different models and hyperparameters. RDMA for Distributed TensorFlow In Yahoo’s Hadoop clusters, GPU nodes are connected by both Ethernet and Infiniband. Infiniband provides faster connectivity and supports direct access to other servers’ memories over RDMA. Current TensorFlow releases, however, only support distributed learning using gRPC over Ethernet. To speed up distributed learning, we have enhanced the TensorFlow C++ layer to enable RDMA over Infiniband. In conjunction with our TFoS release, we are introducing a new protocol for TensorFlow servers in addition to the default “grpc” protocol. Any distributed TensorFlow program can leverage our enhancement via specifying protocol=“grpc_rdma” in tf.train.ServerDef() or tf.train.Server(). With this new protocol, a RDMA rendezvous manager is created to ensure tensors are written directly into the memory of remote servers. We minimize the tensor buffer creation: Tensor buffers are allocated once at the beginning, and then reused across all training steps of a TensorFlow job. From our early experimentation with large models like the VGG-19 network, our RDMA implementation has demonstrated a significant speedup on training time compared with the existing gRPC implementation. Since RDMA support is a highly requested capability (see TensorFlow issue #2916), we decided to make our current implementation available as an alpha release to the TensorFlow community. In the coming weeks, we will polish our RDMA implementation further, and share detailed benchmark results. Simple CLI and API TFoS programs are launched by the standard Apache Spark command, spark-submit. As illustrated below, users can specify the number of Spark executors, the number of GPUs per executor, and the number of parameter servers in the CLI. A user can also state whether they want to use TensorBoard (–tensorboard) and/or RDMA (–rdma).       spark-submit –master ${MASTER} \       ${TFoS_HOME}/examples/slim/train_image_classifier.py \       –model_name inception_v3 \       –train_dir hdfs://default/slim_train \       –dataset_dir hdfs://default/data/imagenet \       –dataset_name imagenet \       –dataset_split_name train \       –cluster_size ${NUM_EXEC} \       –num_gpus ${NUM_GPU} \       –num_ps_tasks ${NUM_PS} \       –sync_replicas \       –replicas_to_aggregate ${NUM_WORKERS} \       –tensorboard \       –rdma   TFoS provides a high-level Python API (illustrated in our sample Python notebook): - TFCluster.reserve() … construct a TensorFlow cluster from Spark executors - TFCluster.start() … launch Tensorflow program on the executors - TFCluster.train() or TFCluster.inference() … feed RDD data to TensorFlow processes - TFCluster.shutdown() … shutdown Tensorflow execution on executors Open Source Yahoo is happy to release TensorFlowOnSpark at github.com/yahoo/TensorFlowOnSpark and a RDMA enhancement of TensorFlow at github.com/yahoo/tensorflow/tree/yahoo. Multiple example programs (including mnist, cifar10, inception, and VGG) are provided to illustrate the simple conversion process of TensorFlow programs to TensorFlowOnSpark, and leverage RDMA. An Amazon Machine Image is also available for applying TensorFlowOnSpark on AWS EC2. Going forward, we will advance TensorFlowOnSpark as we continue to do with CaffeOnSpark. We welcome the community’s continued feedback and contributions to CaffeOnSpark, and are interested in thoughts on ways TensorFlowOnSpark can be enhanced.

Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters

February 13, 2017
Call For Abstracts - DataWorks and Hadoop Summit, San Jose 2017 February 7, 2017
February 7, 2017
Share

Call For Abstracts - DataWorks and Hadoop Summit, San Jose 2017

The deadline to submit DataWorks Summit/Hadoop Summit Abstracts for San Jose is Feb 10th.   Please consider submitting an abstract and help encourage the community to do the same (tweet, post, blog, …).  The key details are below. DataWorks Summit/Hadoop Summit -Submit Your Abstract Now! Master the possibilities for next-gen big data. DataWorks Summit/Hadoop Summit is the industry’s premier event focusing on next-generation big data solutions. Join us and learn from industry experts and peers about how open source technologies such as Apache Hadoop, Apache Spark, Apache NiFi and extended Big Data ecosystem, enables you to leverage data to drive predictive analytics, distributed deep-learning and artificial intelligence initiatives across global organizations. Would you like to share your knowledge with the best and brightest in the open source Big Data community and be recognized as an industry expert? The DataWorks Summit/Hadoop Summit Organizing Committee invites you to submit an abstract to be considered for the summit in San Jose on June 13-15. We are looking for abstracts for the following tracks: Business Focus - Enterprise Adoption - Applications Technical Focus - Data Processing & Warehousing - Apache Hadoop - Governance & Security - IoT and Streaming - Cloud & Operations - Apache Spark & Data Science To learn more about the process and submit your abstract, please click here. San Jose Abstracts Deadline: Feb 10, 2017 Submit Abstract Proposal: http://tinyurl.com/dwsj17CFA

Call For Abstracts - DataWorks and Hadoop Summit, San Jose 2017

February 7, 2017
10 Years of Hadoop and its Israeli Pioneering Researchers November 18, 2016
November 18, 2016
Share

10 Years of Hadoop and its Israeli Pioneering Researchers

yahooisrael: By Edward Bortnikov The Apache Hadoop technology suite is the engine behind the Big Data revolution that has been transforming multiple industries over the last decade. Hadoop was born at Yahoo 10 years ago as a pioneering open-source project. It quickly outgrew the company’s boundaries to become a vehicle that powers thousands of businesses ranging from small enterprises to Web giants. These days, Yahoo is the largest Hadoop deployment in the industry. We run tens of thousands of Hadoop machines in our datacenters and manage more than 600 petabytes of data. Our products use Hadoop in a variety of ways that reflect a wealth of data processing patterns. Yahoo’s infrastructure harnesses Hadoop Distributed File System (HDFS) for ultra-scalable storage, Hadoop MapReduce for massive ad-hoc batch processing, Hive and Pig for database-style analytics, HBase for key-value storage, Storm for stream processing, and Zookeeper for reliable coordination. Yahoo’s commitment to Hadoop goes far beyond operating the technology at Web scale. The company’s engineers and scientists make contributions to both entrenched and incubating Hadoop projects. Our Scalable Platforms team at Yahoo Research in Haifa has championed multiple innovative efforts that have benefited Yahoo products as well as the entire Hadoop community. Just recently, we contributed new algorithms to HBase, Omid (transaction processing system for HBase), and Zookeeper. Our work significantly boosted the performance of these systems and hardened their fault-tolerance. For example, the enhancements to Omid were instrumental for turning it into an Apache Incubation project (candidate for top-level technology status), whereas the work in HBase was named one of its top new features this year. Our team launched approximately three years ago. Collectively we add many years of experience to Yahoo and the Hadoop community in distributed computing research and development. We specialize in scalability and high availability, arguably the biggest challenges in big data platforms. We love to identify hard problems in large-scale systems, design algorithms to solve them, develop the code, experiment with it, and finally contribute to the community. The team features researchers with deep theoretical backgrounds as well as the engineering maturity required to deal with complex production code. Our researchers regularly present their innovations at leading industrial conferences (Hadoop Summit and HBaseCon), as well as at top academic venues.  The researchers in our team comprise a blend of backgrounds in distributed computing, programming languages, and big systems, and most of us hold PhD degrees in these areas. We are especially proud to be a pioneering team of Hadoop developers in Israel. As such, we teach courses in big data technologies, organize technical meetups, and collaborate with academic colleagues. We are always happy to share our expertise with the ever-growing community of Hadoop users in the local hi-tech industry.  Great contributions to Hadoop from our team in Haifa, Israel!

10 Years of Hadoop and its Israeli Pioneering Researchers

November 18, 2016
Omid’s First Step in the Apache Community September 23, 2016
September 23, 2016
Share

Omid’s First Step in the Apache Community

By Francisco Perez-Sorrosal, Ohad Shacham, Kostas Tsioutsiouliklis, and Edward Bortnikov We are proud to announce that Omid (“Hope” in Persian), Yahoo’s transaction manager for HBase [1][2], has been accepted as an Apache Incubator project. Yahoo has been a long-time contributor to the Apache community in the Hadoop ecosystem, including HBase, YARN, Storm, and Pig. Our acceptance as an Apache Incubator project is another step forward following the success of ZooKeeper [3] and BookKeeper [4], which were born at Yahoo and graduated to top-level Apache projects. These days, most NoSQL databases, including HBase, do not provide the OLTP support available in traditional relational databases, forcing the applications running on top of them to trade transactional support for greater agility and scalability. However, transactions are essential in many applications using NoSQL datastores as the main source of data, for example, in incremental content processing systems. Omid enables these applications to benefit from the best of both worlds: the scalability provided by NoSQL datastores, such as HBase, and the concurrency and atomicity provided by transaction processing systems. Omid provides a high-performant ACID transactional framework with Snapshot Isolation guarantees on top of HBase [5], being able to scale to thousands of clients triggering transactions on application data. It’s one of the few open-source transactional frameworks that can scale beyond 100K transactions per second on mid-range hardware while incurring minimal impact on the latency accessing the datastore. At its core, Omid utilizes a lock-free approach to support multiple concurrent clients. Its design relies on a centralized conflict detection component called Transaction Status Oracle (TSO), which efficiently resolves write-set collisions among concurrent transactions [6]. Another important benefit is that Omid does not require any modification of the underlying key-value datastore – HBase in this case. Moreover, the recently-added high-availability algorithm eliminates the single point of failure represented by the TSO in those deployments that require a higher degree of dependability [7]. Last but not least, the API is very simple – mimicking the transaction manager APIs in the relational world: begin, commit, rollback – and the client and server configuration processes have been simplified to help both application developers and system administrators. Efforts toward growing the community have already been underway in the last few months. Apache Hive [8] contributors from Hortonworks expressed interest in storing Hive metadata in HBase using Omid, and this led to a fruitful collaboration that resulted in Omid now supporting HBase 1.x versions. Omid could also be used as the transaction manager in other SQL abstraction layers on top of HBase such as Apache Phoenix [9], or as the transaction coordinator in distributed systems, such as the Apache DistributedLog project [10] and Pulsar, a distributed pub-sub messaging platform recently open sourced by Yahoo. Since its inception in 2011 at Yahoo Research, Omid has matured to operate at Web scale in a production environment. For example, since 2014 Omid has been used at Yahoo – along with other Hadoop technologies – to power our incremental content ingestion platform for search and personalization products. In this role, Omid is serving millions of transactions per day over HBase data. We have decided to move the Omid project to “the Apache Way” because we think it is the next logical step after having battle-tested the project in production at Yahoo and having open-sourced the code in Yahoo’s public Github in 2012 (The Omid Github repository currently has 269 stars and 101 forks, and we were asked by our colleagues in the Open Source community to release it as an Apache Incubator project.). As we aim to form a larger Omid community outside Yahoo, we think that the Apache Software Foundation is the perfect umbrella to achieve this. We invite the Apache community to contribute by providing patches, reviewing code, proposing new features or improvements, and giving talks at conferences such as Hadoop Summit, HBaseCon, ApacheCon, etc. under the Apache rules. We see Omid being recognized as an Apache Incubator Project as the first step in growing a vibrant community around this technology. We are confident that contributors in the Apache community will add more features to Omid and further enhance the current performance and latency. Stay tuned to @ApacheOmid on Twitter! References [1] Apache Omid Gihthub repo: https://github.com/apache/incubator-omid [2] Apache Omid documentation: http://omid.incubator.apache.org/ [3] Apache ZooKeeper project: http://zookeeper.apache.org/ [4] Apache BookKeeper project: http://bookkeeper.apache.org/ [5] Blog Entry introducing Omid: http://yahoohadoop.tumblr.com/post/129089878751/introducing-omid-transaction-processing-for [6] Blog Entry on Omid’s Architecture and Protocol:  http://yahoohadoop.tumblr.com/post/132695603476/omid-architecture-and-protocol [7] Blog Entry on Omid’s High Availability: http://yahoohadoop.tumblr.com/post/138682361161/high-availability-in-omid [8] Apache Hive project: https://hive.apache.org/ [9] Apache Phoenix project: https://phoenix.apache.org/ [10] Apache DistributedLog project: http://distributedlog.incubator.apache.org/

Omid’s First Step in the Apache Community

September 23, 2016
Presenting the Latest in Hadoop August 25, 2016
August 25, 2016
Share

Presenting the Latest in Hadoop

If you are an avid Hadoop user – or even just getting started – there is a place in Silicon Valley you can go approximately once a quarter to learn and ask questions about the latest in the technology. That place is the Bay Area Hadoop User Group (HUG), and last week we hosted our 53rd meetup. In our get-togethers, we surface recent work in this Big Data space that benefits the entire development and user community. In case you missed this latest installment, or would like a recap, below you’ll find the three major topics we reviewed, complete with the videos and slide presentations. Feel free to keep the conversation going by sharing and/or asking us questions. We’ll get back to you!Open Source Big Data Ingest with StreamSets Data Collector Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can “drift” due to infrastructure, OS, and application changes, causing ETL tools and hand-coded solutions to fail. StreamSets Data Collector (SDC) is an open source platform for building big data ingest pipelines that allows you to design, execute, and monitor robust data flows. In this session, StreamSets community champion Pat Patterson looks at how SDC’s “intent-driven” approach keeps the data flowing, whether you’re processing data “off-cluster,” in Spark, or in MapReduce. Better Together: Fast Data with Apache Spark and Apache Ignite Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next-generation real-time applications is to use them together? In this session, Dmitriy Setrakyan, Apache Ignite Project Management Committee Chairman and co-founder and CPO at GridGain explains, in detail, how IgniteRDD — an implementation of native Spark RDD and DataFrame APIs — shares the state of the RDD across other Spark jobs, applications, and workers. Dmitriy also demonstrates how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Recent Development in Apache Oozie Yahoo Sr. Software Engineer Purshotam Shah gives the first part of this talk and describes the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, the talk focuses on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing, and BCP management. The second part of this talk, given by Yahoo Software Engineer Satish Saley, focuses on out-of-the-box support for Spark jobs.

Presenting the Latest in Hadoop

August 25, 2016
Celebrate a Decade of Excellence with Apache Hadoop and Save 20% off Registration at Hadoop Summit  2016 San Jose June 16, 2016
June 16, 2016
Share

Celebrate a Decade of Excellence with Apache Hadoop and Save 20% off Registration at Hadoop Summit  2016 San Jose

We are excited to co-host the 9th Annual Hadoop Summit, the leading conference for the Apache Hadoop community,  taking place on June 28-30 at the McEnery Convention Center in San Jose, California. This year’s Hadoop Summit features more than 200 speakers across 9 tracks and over 170 breakout sessions where attendees will learn about innovative use cases, development and administration tips and tricks, the cutting edge in project developments from the committers, and how the community is driving and accelerating Hadoop’s global adoption. The Summit is expected to bring together more than 5,500 community members, presenting excellent opportunities for software developers, architects, administrators, data analysts, and data scientists to learn from each other in advancing, extending or implementing Hadoop. Much like the prior years, we continue Yahoo’s decade long tradition and thought leadership with Apache Hadoop at the 2016 Summit. If you are in attendance, come encourage fellow Yahoos as they showcase their work on the latest in Hadoop and related Big Data technologies such as Apache Storm, Tez, HBase, Hive, Oozie and Distributed Deep Learning. DAY 1. TUESDAY June 28, 2016 12:20 - 1:00 P.M. Faster, Faster, Faster!: The True Story of a Mobile Analytics Data Mart on Hive Mithun Radhakrishnan – Principal Engineer, Apache Hive Committer,   Josh Walters – Sr. Software Engineer As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. This talk will examine the efficacy of using Hive for large-scale mobile analytics. 3:00 – 4:00 P.M. Investigating the Effects of Over Committing YARN Resources Jason Lowe – Distinguished Engineer, Apache Hadoop and Tez PMC and Committer YARN requires applications to specify the size (in MB and VCores) of the containers it wishes to utilize in the application. Applications need to request sufficient resource so that containers never run out - leading to significant amounts of unutilized resource. In cases of n-thousand node clusters, this can result in millions of dollars of unused capacity. The YARN community is actively working to address this problem. In the shorter term Yahoo has developed a simple approach that quickly provides useful insights into both the efficacy of over-committing resources, as well as some of the key issues that may be encountered. This talk will describe the dynamic over-commit implementation that Yahoo is running at scale, along with results and pitfalls encountered. DAY 2. WEDNESDAY June 29, 2016 9:00 -11:00 A.M. Yahoo Keynote Peter Monaco – VP, Engineering, Communications Products This Keynote will address how Mail and Communications applications at Yahoo have used Hadoop and its ecosystem components successfully. 12:20 - 1:00 P.M. Performance Comparison of Streaming Big Data Platforms Reza Farivar – Capital One, Apache Storm Contributor,   Kyle Nusbaum – Software Engineer, Apache Storm PMC Yahoo has been using Storm extensively, and the number of nodes running Storm has reached about 2,300 (and is still growing). However, several noteworthy competitors, including Apache Flink and Apache Spark Streaming, are gaining attention. To choose the best streaming tools for our needs, we decided to write a benchmark as close to real-world use cases as possible. In this session, we will examine how these streaming platforms performed against our benchmark tests, and discuss which is most appropriate for your big data real-time streaming needs. 2:10 - 2:50 P.M. Yahoo’s Next-Generation User Profile Platform Kai Liu – Sr. Engineering Manager,   Lu Niu – Software Engineer User profile is crucial to the success of any targeting and personalization systems. There are hundreds of billions of user events that have been receiving at Yahoo everyday. These events contain a variety of user activities including app usages, page views, search queries, ad views, ad clicks etc. In this presentation, we’ll talk about how we designed a modern user profile system using a hybrid architecture, that supports fast data ingestion, random access and interactive ad-hoc query. We’ll show you how we build the system with Spark, HBase, Impala to achieve these goals. 3:00 - 3:40 P.M. Omid: A Transactional Framework for HBase Francisco Perez-Sorrosal – Research Engg., Omid Committer,   Ohad Shacham – Research Scientist, Omid Committer Omid is a high performant ACID transactional framework with Snapshot Isolation for HBase. Omid doesn’t require any HBase modification. Most NoSQL databases do not provide OLTP support, and give up transactional support for greater agility and scalability. However, fault tolerant transactions are essential in many applications in the Hadoop ecosystem, especially in incremental content processing systems. Omid enables these applications to benefit from both, scalability provided by NoSQL datastores and concurrency and atomicity provided by transaction processing. Omid is now open-source. It provides a reliable, high-performant and easy-to-program platform, capable of serving transactional web-scale applications based on HBase. 3:00 - 3:40 P.M. Building and Managing Data Pipelines with Complex Dependencies Using Apache Oozie Purushotam Shah – Senior Software Engineer, Apache Oozie PMC and Committer At Yahoo, Apache Oozie is the standard for building and operating large-scale data pipelines and is responsible for over 80% of 34 million monthly jobs processed on the Hadoop platform. In this talk, we will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and reprocessing, SLA monitoring, administration, and BCP management. We will conclude the talk with enhancement ideas for future releases. 5:50 – 6:30 P.M. Resource Aware Scheduling in Storm Boyang (Jerry) Peng – Software Engineer, Apache Storm PMC and Committer Apache Storm is one of the most popular stream processing systems in industry today and is the primary platform used for stream processing at Yahoo. However, Storm, like many other stream processing systems, lacks an intelligent scheduling mechanism. So we designed and implemented resource-aware scheduling in Storm. The Resource-Aware Scheduler (RAS) uses specialized scheduling algorithms to maximize resource utilization, while minimizing network latency when scheduling applications. Multi-tenant support has already been added to RAS. In this presentation, we will introduce Resource-Aware Scheduling (RAS) in Storm and discuss how it has improved the performance of Storm and enabled Yahoo to overcome key challenges in operating stream processing systems in multi-tenant and heterogeneous environments. DAY3. THURSDAY June 30, 2016 9:00 – 11:00 A.M. Yahoo Keynote Mark Holderbaugh – Sr. Director, Engineering, Hadoop This Keynote will cover the interesting features introduced by Hadoop team at Yahoo, like Dynamic Over Commit for better resource utilization on clusters, Pig-on-Tez, and Resource Aware Scheduler (RAS) in Storm. 11:30 A.M. - 12:10 P.M. Yahoo’s Experience Running Pig on Tez at Scale Rohini Palaniswamy – Sr. Principal, Apache Pig, Oozie, Tez PMC,   Jon Eagles – Principal, Apache Hadoop, Tez PMC Yahoo has always been one of the first to adopt emerging Hadoop technologies, stabilize and run at web-scale in production well ahead of mainstream adoption - then Apache YARN and now Apache Tez. Yahoo has the largest footprint of Apache Pig with tens of thousands of scripts that power ETL and Machine Learning for major Yahoo properties. We have been migrating our scripts to run on Tez to capitalize on the orders of magnitude performance and huge savings on resource consumption. In this session, we will present how the effort paid off with actual performance and SLA numbers from production jobs and analyze aggregate cluster utilization graphs from before and after the migration. We will share our learning on running Tez at scale successfully, and share our experience in making this paradigm shift from Mapreduce to Tez. 12:20 - 1:00 P.M. Distributed Deep Learning on Hadoop Clusters Andy Feng – VP, Architecture, Apache Storm PMC,   Jun Shi – Principal Engineer At Yahoo we recently introduced distributed deep learning as a new capability of Hadoop clusters. These new clusters augment our existing CPU nodes and Ethernet connectivity with GPU nodes and Infiniband connectivity. We developed a distributed deep learning solution, CaffeOnSpark, that enables deep learning tasks to be launched via spark-submit command, as in any Spark application. In this talk, we will provide a technical overview of CaffeOnSpark, and explain how that conducts deep learning in a private cloud or public cloud (such as AWS EC2). We will share our experience at Yahoo through use cases (including photo auto tagging), and discuss the areas of collaboration with open source communities for Hadoop-based deep learning. 2:10 – 2:50 P.M. Managing Hadoop, HBase, and Storm Clusters at Yahoo Scale Dheeraj Kapur, Principal Engineer,   Savitha Ravikrishnan, Operations Engineer Hadoop at Yahoo is a massive infrastructure and a challenging platform to manage. We have come a long way from full downtime to now no longer requiring any downtime for upgrades and cater to massive workloads in our 40+ clusters in the ecosystem spread across multiple data centers. Things get even more complex with multi-tenancy, differing workload characteristics, and strict SLAs on Hadoop, HBase, Storm and other Support Services. We will talk about rolling upgrades, and automation & tools we have built to manage a massive grid infrastructure with support for multi-tenancy and full CI/CD. 3:00 P.M. - 3:40 P.M. A Performance and Scalability Review of Apache Hadoop on Commodity HW Configurations Sumeet Singh – Sr. Director, Cloud and Big Data Platforms,   Rajiv Chittajallu – Sr. Principal Engineer Since its humble beginnings in 2006, Hadoop has come a long way in the last 10 years in its evolution as an open-platform. In this talk, we will present a comprehensive review of Hadoop’s performance and scalability to validate how well the original design goals hold true. We intend to present performance and scale metrics from a representative cluster environment with 120 modern servers utilizing standard benchmark tests. Our focus will be on HDFS, YARN, MapReduce, Tez, Pig-on-Tez, Hive-on-Tez, and HBase. We intend to showcase Hadoop’s performance and throughput numbers and how Hadoop fares when it comes to utilizing system resources such as CPU, memory, disk, and network to make the best use of what is available. We will provide similar metrics from a 40,000 server footprint running production workloads so that our audience walk out with a fantastic baseline when it comes to Hadoop performance metrics. Register Now with a Yahoo Discount As a co-host for this event, Yahoo is pleased to offer a 20% discount on the registration price. Enter promotional code 16SJspO20 to receive your discount during the registration process. You, or your department, are responsible for the discounted registration fee and any travel expense involved with attending the Hadoop Summit. Register here for Hadoop Summit, San Jose, California!

Celebrate a Decade of Excellence with Apache Hadoop and Save 20% off Registration at Hadoop Summit  2016 San Jose

June 16, 2016
Reinforcing Our Commitment to Hadoop April 28, 2016
April 28, 2016
Share

Reinforcing Our Commitment to Hadoop

10 years after Hadoop was born right here at Yahoo, we’re as committed as ever to improving the technology and sharing it with the community. The past few weeks highlighted Yahoo’s efforts at two important, open forums: the 2016 Hadoop Summit in Dublin and the 52nd Bay Area Hadoop User Group (HUG) Meetup. Yahoo co-hosts Hadoop Summits annually with Hortonworks in both Europe and North America. The Dublin Summit earlier this month was our largest-ever held in Europe. Yahoo’s keynote and technical sessions were very-well received; in those talks, Sr. Director of Cloud and Big Data Platforms Sumeet Singh spoke about our data tech innovations including CaffeOnSpark (for distributed deep learning), Omid (for managing transactions atop HBase), Data Sketches (for interactive queries), and leading contributions to the advancement of Apache Hadoop (for better cluster efficiencies), Apache Storm (for releasing 1.0), and Apache HBase (for scaling clusters to host millions of regions). Check out Sumeet’s keynote and deep-dive talks here: Hadoop Summit Yahoo 2016 Keynote: Hadoop Platform at Yahoo A Year in Review: ******************** Last week, we also hosted our second HUG Meetup of the year. It was great to have over 150 industry colleagues join us on campus in Sunnyvale for three presentations on various aspects of Hadoop. We strongly believe there is mutual benefit to sharing our latest learnings and innovations, and invite all those in the community to join the group and future meetups. Watch the presentations here: CaffeOnSpark: Distributed Deep Learning on Spark Clusters: The latest of Apache Hadoop YARN and running your docker apps on YARN: Demystifying Big Data and Apache Spark:

Reinforcing Our Commitment to Hadoop

April 28, 2016
Apache Storm New UI Features April 13, 2016
April 13, 2016
Share

Apache Storm New UI Features

By Kishorkumar Patil, Apache Storm PMC and Committer Apache Storm has come a long way since its release in 2011. Over the past year, much work has been done to improve Storm’s scalability and performance. Many new user facing features have also been added, enhancing ease of use and debuggability in secure multi-tenant environments.  Important user facing features we’ve implemented at Yahoo include: 1. Dynamic Profiling of JVMs: With hundreds of JVMs running across the cluster, it is important to be able to take jstacks, heap-dumps and run JVM profiler without logging to each machine. Now, we support gathering these inputs directly from the UI. 2. Deep Search in Topology Logs: This enhances users’ ability to search errors across all logs for living/dead topologies directly from the UI. 3. Routing Logs to Mining systems: With gigabytes of logs being produced, routing logs to mining systems Logstash and ElasticSearch enables users to choose the best tool for log mining, analysis and visualization. 4. Dynamic Log Level for debugging: The ability to dynamically define new loggers and set log levels using the UI eliminates time consuming and expensive step of re-launching of topologies for debugging.

Apache Storm New UI Features

April 13, 2016
Still Time to Save 20% on  Hadoop Summit, Dublin Registration April 6, 2016
April 6, 2016
Share

Still Time to Save 20% on  Hadoop Summit, Dublin Registration

We hope you’re as excited as we are about Hadoop Summit Dublin, being held on April 13-14, 2016 at the Convention Centre Dublin.   With a record number of registrations still flowing in, this year’s event will be our biggest and best European Summit to date and we cannot wait to see you there!   In appreciation of your registration, we would like to offer a special 20% discount to anyone else in your organization that would like to attend. How can they take advantage of this offer? Follow these simple steps: 1. Click here 2. Enter promotion code: 16DiscSPYah20 3. Complete the online registration form A little reminder why your colleagues MUST attend this year’s event: - Over 90 breakout sessions of cutting edge education - Solve your big data challenges with the industry’s who’s who at the Community Showcase. Confirmed sponsors include EMC, Microsoft, BMC, HPE, Teradata, Splunk, Cloudera, MapR, Attunity, Pentaho, Platfora, Trifacta, Big Data Partnership, Datameer, Deep Sense, Talend, Engineering Group, ING, Guru Team and many many more - Up skill by attending Pre-Event Training or a Crash Course - Join likeminded industry colleagues at a Meetup orBirds of a Feather session - Celebrate 10-Years of Hadoop with us at the Guinness Storehouse – you will not want to miss this party! - Network, connect and do business! Time is running out and this special offer ends at 5pm on April 9 so jump online and register today! Register Now

Still Time to Save 20% on  Hadoop Summit, Dublin Registration

April 6, 2016
CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters February 24, 2016
February 24, 2016
Share

CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters

By Andy Feng(@afeng76), Jun Shi and Mridul Jain (@mridul_jain), Yahoo Big ML Team Introduction Deep learning (DL) is a critical capability required by Yahoo product teams (ex. Flickr, Image Search) to gain intelligence from massive amounts of online data. Many existing DL frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline (see Figure 1). The separated clusters require large datasets to be transferred among them, and introduce unwanted system complexity and latency for end-to-end learning. Figure 1: ML Pipeline with multiple programs on separated clusters As discussed in our earlier Tumblr post, we believe that deep learning should be conducted in the same cluster along with existing data processing pipelines to support feature engineering and traditional (non-deep) machine learning. We created CaffeOnSpark to allow deep learning training and testing to be embedded into Spark applications (see Figure 2).  Figure 2: ML Pipeline with single program on one cluster CaffeOnSpark: API & Configuration and CLI CaffeOnSpark is designed to be a Spark deep learning package. Spark MLlib supported a variety of non-deep learning algorithms for classification, regression, clustering, recommendation, and so on. Deep learning is a key capacity that Spark MLlib lacks currently, and CaffeOnSpark is designed to fill that gap. CaffeOnSpark API supports dataframes so that you can easily interface with a training dataset that was prepared using a Spark application, and extract the predictions from the model or features from intermediate layers for results and data analysis using MLLib or SQL. Figure 3: CaffeOnSpark as a Spark Deep Learning package 1:   def main(args: Array[String]): Unit = { 2:   val ctx = new SparkContext(new SparkConf()) 3:   val cos = new CaffeOnSpark(ctx) 4:   val conf = new Config(ctx, args).init()  5:   val dl_train_source = DataSource.getSource(conf, true)  6:   cos.train(dl_train_source)  7:   val lr_raw_source = DataSource.getSource(conf, false)  8:   val extracted_df = cos.features(lr_raw_source)  9:   val lr_input_df = extracted_df.withColumn(“Label”, cos.floatarray2doubleUDF(extracted_df(conf.label))) 10:     .withColumn(“Feature”, cos.floatarray2doublevectorUDF(extracted_df(conf.features(0)))) 11:  val lr = new LogisticRegression().setLabelCol(“Label”).setFeaturesCol(“Feature”) 12:  val lr_model = lr.fit(lr_input_df) 13:  lr_model.write.overwrite().save(conf.outputPath) 14: } Figure 4: Scala application using CaffeOnSpark both MLlib Scala program in Figure 4 illustrates how CaffeOnSpark and MLlib work together: - L1-L4 … You initialize a Spark context, and use it to create CaffeOnSpark and configuration object. - L5-L6 … You use CaffeOnSpark to conduct DNN training with a training dataset on HDFS. - L7-L8 …. The learned DL model is applied to extract features from a feature dataset on HDFS. - L9-L12 … MLlib uses the extracted features to perform non-deep learning (more specifically logistic regression for classification). - L13 … You could save the classification model onto HDFS. As illustrated in Figure 4, CaffeOnSpark enables deep learning steps to be seamlessly embedded in Spark applications. It eliminates unwanted data movement in traditional solutions (as illustrated in Figure 1), and enables deep learning to be conducted on big-data clusters directly. Direct access to big-data and massive computation power are critical for DL to find meaningful insights in a timely manner. CaffeOnSpark uses the configuration files for solvers and neural network as in standard Caffe uses. As illustrated in our example, the neural network will have a MemoryData layer with 2 extra parameters: 1. source_class specifying a data source class 2. source specifying dataset location. The initial CaffeOnSpark release has several built-in data source classes (including com.yahoo.ml.caffe.LMDB for LMDB databases and com.yahoo.ml.caffe.SeqImageDataSource for Hadoop sequence files). Users could easily introduce customized data source classes to interact with the existing data formats. CaffeOnSpark applications will be launched by standard Spark commands, such as spark-submit. Here are 2 examples of spark-submit commands. The first command uses CaffeOnSpark to train a DNN model saved onto HDFS. The second command is a custom Spark application that embedded CaffeOnSpark along with MLlib. First command: spark-submit \    –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt \    –num-executors 2  \    –class com.yahoo.ml.caffe.CaffeOnSpark  \       caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \       -train -persistent \       -conf caffenet_train_solver.prototxt \       -model hdfs:///sample_images.model \       -devices 2 Second command: spark-submit \    –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt \    –num-executors 2  \    –class com.yahoo.ml.caffe.examples.MyMLPipeline \         caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \        -features fc8 \        -label label \        -conf caffenet_train_solver.prototxt \        -model hdfs:///sample_images.model  \        -output hdfs:///image_classifier_model \        -devices 2 System Architecture Figure 5: System Architecture Figure 5 describes the system architecture of CaffeOnSpark. We launch Caffe engines on GPU devices or CPU devices within the Spark executor, via invoking a JNI layer with fine-grain memory management. Unlike traditional Spark applications, CaffeOnSpark executors communicate to each other via MPI allreduce style interface via TCP/Ethernet or RDMA/Infiniband. This Spark+MPI architecture enables CaffeOnSpark to achieve similar performance as dedicated deep learning clusters. Many deep learning jobs are long running, and it is important to handle potential system failures. CaffeOnSpark enables training state being snapshotted periodically, and thus we could resume from previous state after a failure of a CaffeOnSpark job. Open Source In the last several quarters, Yahoo has applied CaffeOnSpark on several projects, and we have received much positive feedback from our internal users. Flickr teams, for example, made significant improvements on image recognition accuracy with CaffeOnSpark by training with millions of photos from the Yahoo Webscope Flickr Creative Commons 100M dataset on Hadoop clusters. CaffeOnSpark is beneficial to deep learning community and the Spark community. In order to advance the fields of deep learning and artificial intelligence, Yahoo is happy to release CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 license. CaffeOnSpark can be tested on an  AWS EC2 cloud or on your own Spark clusters. Please find the detailed instructions at Yahoo github repository, and share your feedback at bigdata@yahoo-inc.com. Our goal is to make CaffeOnSpark widely available to deep learning scientists and researchers, and we welcome contributions from the community to make that happen. .

CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters

February 24, 2016
Hadoop Turns 10 February 5, 2016
February 5, 2016
Share

Hadoop Turns 10

by Peter Cnudde, VP of Engineering It is hard to believe that 10 years have already passed since Hadoop was started at Yahoo. We initially applied it to web search, but since then, Hadoop has become central to everything we do at the company. Today, Hadoop is the de facto platform for processing and storing big data for thousands of companies around the world, including most of the Fortune 500. It has also given birth to a thriving industry around it, comprised of a number of companies who have built their businesses on the platform and continue to invest and innovate to expand its capabilities. At Yahoo, Hadoop remains a cornerstone technology on which virtually every part of our business relies on to power our world-class products, and deliver user experiences that delight more than a billion users worldwide. Whether it is content personalization for increasing engagement, ad targeting and optimization for serving the right ad to the right consumer, new revenue streams from native ads and mobile search monetization, data processing pipelines, mail anti-spam or search assist and analytics – Hadoop touches them all. When it comes to scale, Yahoo still boasts one of the largest Hadoop deployments in the world. From a footprint standpoint, we maintain over 35,000 Hadoop servers as a central hosted platform running across 16 clusters with a combined 600 petabytes in storage capacity (HDFS), allowing us to execute 34 million monthly compute jobs on the platform. But we aren’t stopping there, and actively collaborate with the Hadoop community to further push the scalability boundaries and advance technological innovation. We have used MapReduce historically to power batch-oriented processing, but continue to invest in and adopt low latency data processing stacks on top of Hadoop, such as Storm for stream processing, and Tez and Spark for faster batch processing. What’s more, the applications of these innovations have spanned the gamut – from cool and fun features, like Flickr’s Magic View to one of our most exciting recent projects that involves combining Apache Spark and Caffe. The project allows us to leverage GPUs to power deep learning on Hadoop clusters. This custom deployment bridges the gap between HPC (High Performance Computing) and big data, and is helping position Yahoo as a frontrunner in the next generation of computing and machine learning. We’re delighted by the impact the platform has made to the big data movement, and can’t wait to see what the next 10 years has in store. Cheers!

Hadoop Turns 10

February 5, 2016
High Availability in Omid February 4, 2016
February 4, 2016
Share

High Availability in Omid

By Edward Bortnikov, Idit Keidar, Ohad Shacham (Search Systems, Yahoo Labs), and  Francisco Perez-Sorrosal (Yahoo Search) Omid, discussed in detail in our previous posts, offers transactional access to data persistently stored in HBase. Here, we explain how Omid is made highly available (HA). Omid’s availability is obviously critical for smooth operation of its applications, and should thus not be inferior to the availability guarantees of the underlying HBase store. High availability is a brand new feature in Omid. In very-high-end Omid-powered applications, the conjunction of Omid and HBase is expected to work round the clock, and exhibit a mean-time-to-recover (MTTR) of just a few seconds. Moreover, any measures taken for high availability should not hamper performance in the normal, fault-free, case. Omid supports both HA and non-HA modes. The latter serves settings in which the system administrator prefers manual recovery, and a longer MTTR can be tolerated; for example, these can be non-critical infrastructures where the additional resources for running a backup TSO cannot be spared. High Availability via Primary-Backup Architecture Omid is designed around a centralized transaction processing service (transaction status oracle, or TSO), which is responsible for serializing transaction commit points and resolving inter-transaction conflicts. This design renders the TSO critical for the entire system’s availability. Our focus is thus on the high-availability architecture behind the TSO. As most HA solutions, it is expected to satisfy two requirements: (1) low MTTR, and (2) negligible impact on the system’s mainstream (failure-free) operation. Omid’s HA solution is based on the primary-backup paradigm: the TSO is implemented as a process pair consisting of a primary process and a backup process. The former serves all client requests, whereas the latter is in hot-standby mode, ready to take over if the primary fails. The process of transferring the client-serving responsibilities from the primary to the backup is called failover. Failure detection is timeout-based – namely, if the primary TSO does not re-assert its existence within a configured period, it is deemed failed, and the backup starts acting as a new primary.   Note that the primary and backup run independently on different machines, and the time it takes the primary to inform the backup that it is alive can be unpredictable due to processing delays (e.g., garbage-collection stalls, long I/O operations) and unexpected network failures. On the other hand, in order to provide a low MTTR, we cannot set the timeout conservatively so as to ensure that a live primary is never detected as faulty. We therefore have to account for the case that the backup performs a failover and takes over the service while the primary is operational. To this end, we use a Zookeeper object to track the current primary. The primary regularly re-asserts itself, unless it sees that it has been supplanted; the backup constantly tracks this object, and if the current primary becomes stale, updates the object to reflect the fact that it is now the primary. The primary TSO advertises its identity to clients, also via Zookeeper. This way, the Omid library learns about the new primary upon failover and facilitates reconnection. Client applications must learn about the output of the pending commit requests to the old primary before retrying the transaction in order to avoid data corruption. The Failover Challenge A reliable system must honor all the operations successfully completed in the past, regardless of failovers. Namely, if a transaction receives a success response to its commit request, then future transactions must observe its updates. On the other hand, if a transaction aborts for whatever reason, then no future transaction should see its updates. In Omid, the TSO allocates monotonically increasing commit timestamps to committing transactions. In addition, when a transaction begins, it obtains a read timestamp, which reflects the commit timestamp of the last transaction to commit before it began. The transaction then reads the latest version of each object that does not exceed its read timestamp. As explained in our first post, this yields a correctness semantics called snapshot isolation. The critical part of system state that affects correctness is the persistent commit table (CT), which reliably stores the mapping from transaction ids (txid) to commit timestamps (commit ts). The state recorded in the CT captures the system’s guarantee to its clients. As described in the previous post, a transaction is committed if and only if a (txid, commit ts) pair for it exists in the CT. Today, we will scrutinize this approach in a failure-prone world. The key challenge faced by HA systems is known as split brain in the theory of distributed systems - the risk for conflicting updates occurring independently at distinct places. In primary-backup systems, split-brain manifests when the backup detects the primary as faulty whereas the latter is either still operating or the operations undertaken by it are in the process of taking effect. If treated naively, such lack of synchronization may lead to race conditions that ultimately affect the system’s  correctness. Let us take a closer look at this challenge now. There are scenarios in which the primary TSO can be falsely detected as failed, for example, due to a Java garbage collection stalls. The system therefore can end up with two concurrent TSO’s. The primary TSO therefore actively checks if a backup has replaced it, and if so, “commits suicide”, i.e., halts. However, it is still possible to have a (short) window between the failover and the primary’s discovery of the emergence of a new primary. When a TSO fails, there may be some pending transactions that began with it (i.e., performed their begin transaction using this TSO) and did not commit (they might have either not attempted to commit yet, or may have attempted to commit with the old TSO, but the TSO did not complete logging their commit in the CT). Such pending transactions are deemed aborted. To prevent new transactions from seeing partial updates of transactions handled by the old TSO, the new TSO needs to employ timestamps that exceed all those committed (or that might still be committed) by the old TSO. However, this separation is challenged by the potential concurrency of two TSOs. For example, if a TSO fails immediately after issuing a write to the CT that takes nonzero time, an old transaction may end up committing after the new TSO has begun handling new transactions. Unless handled carefully, this can cause a new transaction to see partial updates of an old one, as illustrated in the diagram below. To avoid this scenario, we must ensure that once a new transaction obtains a read timestamp, the commit/abort status of all transactions with smaller commit timestamps does not change. One way to address the above challenge is via mutual exclusion, that is, making sure that at most one TSO commits operations at a time. However, this solution would entail synchronization upon each commit, not only at failover times, which would adversely affect the system’s performance. We therefore forgo this option, and implement a different HA algorithm in Omid. This algorithm does not incur any penalty in failure-free scenarios. HA Algorithm The failover algorithm in Omid tolerates temporary overlap between the primary and backup TSO’s activity periods. To ensure correctness despite of such periods, we first have to ensure that the transactions committed by the old TSO and the new TSO are safely separated in time. Namely, (1) all the timestamps assigned by the new TSO exceed all those assigned by the old one, and (2) after a transaction with read timestamp tsr begins, no transaction that will end up with a commit timestamp tsc < tsr can update any additional data items (though it may still commit after this time). Beyond that, we have to allow the new TSO to safely figure out the status of pending transaction served by the old TSO. Recall from our previous post that in Omid, transactions write their updates tentatively before committing, and upon commit, update the written entries with their commit timestamp. Therefore, our failover mechanism has to ensure (3) when a transaction reads a tentative update, it can determine whether this update will be committed with a timestamp smaller than its read timestamp or not. One way to meet (1) and (2) is to have the TSO publish the read timestamp it allots as part of initiating a transaction  (e.g., via Zookeeper). Before committing, a TSO would check this location. If a timestamp greater than its last committed is detected, it would deduce that failover has happened, abort the transaction attempting to commit, and halt. This approach is plausible but would cast synchronization overhead on every begin and commit operation. Instead, the HA algorithm implemented in Omid uses locally-checkable leases. Leases are essentially locks that live for a limited time. With them, we can both detect TSO failures and allocate timestamp ranges in big chunks, thereby eliminating the synchronization overhead most of the time.   The challenge of meeting (3) is that transactions cannot consult the old TSO process, as it might have failed. In order to prevent in-flight writes of the old TSO to the CT from “changing the history” retroactively, we allow transactions served by the new TSO to proactively abort ones coming from the previous TSO. Specifically, when a read encounters a tentative update by a transaction that is not present in the CT, it forces that transaction to abort. We call this invalidation and illustrate it in the following figure. Invalidation is used judiciously only when failover might be taking place, as discussed in the next section of this post. Technically, the client performs the invalidation using an atomic read-modify-write (RMW) operation (put-if-absent flavor) to the CT, which adds an attribute to the CT record marking that the incomplete transaction has an “invalid” status. Any subsequent attempt to commit it (by adding it to the CT) will see this record, and thus fail. In addition, every read of a tentative update must check its invalid field in the CT, and ignore the update if the transaction has already been invalidated. Implementation Details Let us now dive into some implementation details, and see how they guarantee the system’s safety. The TSO process pair, namely the primary and the backup, coordinate their actions via two shared Zookeeper znodes. One serves for allocating timestamp ranges called epochs. A TSO claims ownership of an epoch before allocating timestamps of this range to transactions. Upon failover, the new primary picks the next epoch in a way that ensures property (1) above. The second znode implements the lease. The lease is active for

High Availability in Omid

February 4, 2016
Benchmarking Streaming Computation Engines at Yahoo! December 17, 2015
December 17, 2015
Share

Benchmarking Streaming Computation Engines at Yahoo!

yahooeng: (Yahoo Storm Team in alphabetical order) Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Tom Graves, Mark Holderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Jerry Peng and Paul Poulosky. Executive Summary -  Due to a lack of real-world streaming benchmarks, we developed one to compare Apache Flink, Apache Storm and Apache Spark Streaming. Storm 0.10.0, 0.11.0-SNAPSHOT and Flink 0.10.1 show sub- second latencies at relatively high throughputs with Storm having the lowest 99th percentile latency. Spark streaming 1.5.1 supports high throughputs, but at a relatively higher latency. At Yahoo, we have invested heavily in a number of open source big data platforms that we use daily to support our business. For streaming workloads, our platform of choice has been Apache Storm, which replaced our internally developed S4 platform. We have been using Storm extensively, and the number of nodes running Storm at Yahoo has now reached about 2,300 (and is still growing). Since our initial decision to use Storm in 2012, the streaming landscape has changed drastically. There are now several other noteworthy competitors including Apache Flink, Apache Spark (Spark Streaming), Apache Samza, Apache Apex and Google Cloud Dataflow. There is increasing confusion over which package offers the best set of features and which one performs better under which conditions (for instance see here, here, here, and here). To provide the best streaming tools to our internal customers, we wanted to know what Storm is good at and where it needs to be improved compared to other systems. To do this we started to look for stream processing benchmarks that we could use to do this evaluation, but all of them were lacking in several fundamental areas. Primarily, they did not test with anything close to a real world use case. So we decided to write one and released it as open source https://github.com/yahoo/streaming-benchmarks.  In our initial evaluation we decided to limit our test to three of the most popular and promising platforms (Storm, Flink and Spark), but welcome contributions for other systems, and to expand the scope of the benchmark. Benchmark Design The benchmark is a simple advertisement application. There are a number of advertising campaigns, and a number of advertisements for each campaign. The job of the benchmark is to read various JSON events from Kafka, identify the relevant events, and store a windowed count of relevant events per campaign into Redis. These steps attempt to probe some common operations performed on data streams. The flow of operations is as follows (and shown in the following figure): 1. Read an event from Kafka. 2. Deserialize the JSON string. 3. Filter out irrelevant events (based on event_type field) 4. Take a projection of the relevant fields (ad_id and event_time) 5. Join each event by ad_id with its associated campaign_id. This information is stored in Redis. 6. Take a windowed count of events per campaign and store each window in Redis along with a timestamp of the time the window was last updated in Redis. This step must be able to handle late events. The input data has the following schema: - user_id: UUID - page_id: UUID - ad_id: UUID - ad_type: String in {banner, modal, sponsored-search, mail, mobile} - event_type: String in {view, click, purchase} - event_time: Timestamp - ip_address: String Producers create events with timestamps marking creation time. Truncating this timestamp to a particular digit gives the begin-time of the time window the event belongs in. In Storm and Flink, updates to Redis are written periodically, but frequently enough to meet a chosen SLA. Our SLA was 1 second, so once per second we wrote updated windows to Redis. Spark operated slightly differently due to great differences in its design. There’s more details on that in the Spark section. Along with the data, we record the time at which each window in Redis was last updated. After each run, a utility reads windows from Redis and compares the windows’ times to their last_updated_at times, yielding a latency data point. Because the last event for a window cannot have been emitted after the window closed but will be very shortly before, the difference between a window’s time and its last_updated_at time minus its duration represents the time it took for the final tuple in a window to go from Kafka to Redis through the application. window.final_event_latency = (window.last_updated_at – window.timestamp) – window.duration This is a bit rough, but this benchmark was not purposed to get fine-grained numbers on these engines, but to provide a more high-level view of their behavior. Benchmark setup - 10 second windows - 1 second SLA - 100 campaigns - 10 ads per campaign - 5 Kafka nodes with 5 partitions - 1 Redis node - 10 worker nodes (not including coordination nodes like Storm’s Nimbus) - 5-10 Kafka producer nodes - 3 ZooKeeper nodes Since the Redis node in our architecture only performs in-memory lookups using a well-optimized hashing scheme, it did not become a bottleneck. The nodes are homogeneously configured, each with two Intel E5530 processors running at 2.4GHz, with a total of 16 cores (8 physical, 16 hyperthreading) per node. Each node has 24GiB of memory, and the machines are all located within the same rack, connected through a gigabit Ethernet switch. The cluster has a total of 40 nodes available. We ran multiple instances of the Kafka producers to create the required load since individual producers begin to fall behind at around 17,000 events per second. In total, we use anywhere between 20 to 25 nodes in this benchmark. The use of 10 workers for a topology is near the average number we see being used by topologies internal to Yahoo. Of course, our Storm clusters are larger in size, but they are multi-tenant and run many topologies. To begin the benchmarks Kafka is cleared, Redis is populated with initial data (ad_id to campaign_id mapping), the streaming job is started, and then after a bit of time to let the job finish launching, the producers are started with instructions to produce events at a particular rate, giving the desired aggregate throughput. The system was left to run for 30 minutes before the producers were shut down. A few seconds were allowed for all events to be processed before the streaming job itself was stopped. The benchmark utility was then run to generate a file containing a list of window.last_updated_at – window.timestamp numbers. These files were saved for each throughput we tested and then were used to generate the charts in this document. Flink The benchmark for Flink was implemented in Java by using Flink’s DataStream API.  The Flink DataStream API has many similarities to Storm’s streaming API.  For both Flink and Storm, the dataflow can be represented as a directed graph. Each vertex is a user defined operator and each directed edge represents a flow of data.  Storm’s API uses spouts and bolts as its operators while Flink uses map, flatMap, as well as many pre-built operators such as filter, project, and reduce. Flink uses a mechanism called checkpointing to guarantee processing. Unless checkpointing is used in the Flink job, Flink offers at most once processing similar to Storm with acking turned on.  For the Flink benchmark we did not use checkpointing.  Notable configs we used in Flink is listed below: - taskmanager.heap.mb: 15360 - taskmanager.numberOfTaskSlots: 16 The Flink version of the benchmark uses the FlinkKafkaConsumer to read data in from Kafka.  The data read in from Kafka—which is in a JSON formatted string— is then deserialized and parsed by a custom defined flatMap operator. Once deserialized, the data is filtered via a custom defined filter operator. Afterwards, the filtered data is projected by using the project operator. From there, the data is joined with data in Redis by a custom defined flapMap function. Lastly, the final results are calculated from the data and written to redis. The rate at which Kafka emitted data events into the Flink benchmark is varied from 50,000 events/sec to 170,000 events/sec. For each Kafka emit rate, the percentile latency for a tuple to be completely processed in the Flink benchmark is illustrated in the graph below. The percentile latency for all Kafka emit rates are relatively the same. The percentile latency rises linearly until around the 99th percentile, where the latency appears to increase exponentially.   Spark For the Spark benchmark, the code was written in Scala. Since the micro-batching methodology of Spark is different than the pure streaming nature of Storm, we needed to rethink parts of the benchmark. Storm and Flink benchmarks would update the Redis database once a second to try and meet our SLA, keeping the intermediate update values in a local cache. As a result, the batch duration in the Spark streaming version was set to 1 second, at least for smaller amounts of traffic. We had to increase the batch duration for larger throughputs. The benchmark is written in a typical Spark style using DStreams. DStreams are the streaming equivalent of regular RDDs, and create a separate RDD for every micro batch. Note that in the subsequent discussion, we use the term “RDD” instead of “DStream” to refer to the RDD representation of the DStream in the currently active microbatch. Processing begins with the direct Kafka consumer included with Spark 1.5. Since the Kafka input data in our benchmark is stored in 5 partitions, this Kafka consumer creates a DStream with 5 partitions as well. After that, a number of transformations are applied on the DStreams, including maps and filters. The transformation involving joining data with Redis is a special case. Since we do not want to create a separate connection to Redis for each record, we use a mapPartitions operation that can give control of a whole RDD partition to our code.  This way, we create one connection to Redis and use this single connection to query information from Redis for all the events in that RDD partition. The same approach is used later when we update the final results in Redis. It should be noted that our writes to Redis were implemented as a side-effect of the execution of the RDD transformation in order to keep the benchmark simple, so this would not be compatible with exactly-once semantics. We found that with high enough throughput, Spark was not able to keep up.  At 100,000 messages per second the latency greatly increased. We considered adjustments along two control dimensions to help Spark cope with increasing throughput. The first is the microbatch duration. This is a control dimension that is not present in a pure streaming system like Storm. Increasing the duration increases latency while reducing overhead and therefore increasing maximum throughput. The challenge is that the choice of the optimal batch duration that minimizes latency while allowing spark to handle the throughput is a time consuming process. Essentially, we have to set a batch duration, run the benchmark for 30 minutes, check the results and decrease/increase the duration. The second dimension is parallelism. However, increasing parallelism is simpler said than done in the case of Spark. For a true streaming system like Storm, one bolt instance can send its results to any number of subsequent bolt instances by using a random shuffle. To scale, one can increase the parallelism of the second bolt. In the case of a micro batch system like Spark, we need to perform a reshuffle operation similar to how intermediate data in a Hadoop MapReduce program are shuffled and merged across the cluster. But the reshuffling itself introduces considerable overhead. Initially, we thought our operations were CPU-bound, and so the benefits of reshuffling to a higher number of partitions would outweigh the cost of reshuffling.  Instead, we found the bottleneck to be scheduling, and so reshuffling only added overhead. We suspect that at higher throughput rates or with operations that are CPU-bound, the reverse would be true. The final results are interesting. There are essentially three behaviors for a Spark workload depending on the window duration. First, if the batch duration is set sufficiently large, the majority of the events will be handled within the current micro batch. The following figure shows the resulting percentile processing graph for this case (100K events, 10 seconds batch duration). But whenever 90% of events are processed in the first batch, there is possibility of improving latency. By reducing the batch duration sufficiently, we get into a region where the incoming events are processed within 3 or 4 subsequent batches. This is the second behavior, in which the batch duration puts the system on the verge of falling behind, but is still manageable, and results in better latency. This situation is shown in the following figure for a sample throughput rate (100K events, 3 seconds batch duration). Finally, the third behavior is when Spark streaming falls behind. In this case, the benchmark takes a few minutes after the input data finishes to process all of the events. This situation is shown in the following figure. Under this undesirable operating region, Spark spills lots of data onto disks, and in extreme cases we could end up running out of disk space. One final note is that we tried the new back pressure feature introduced in Spark 1.5. If the system is in the first operating region, enabling back pressure does nothing. In the second operating region, enabling back pressure results in longer latencies. The third operating region is where back pressure shows the most negative impact.  It changes the batch length, but Spark still cannot cope with the throughput and falls behind. This is shown in the next figures. Our experiments showed that the current back pressure implementation did not help our benchmark, and as a result we disabled it. Performance without back pressure (top), and with back pressure enabled (bottom). The latencies with the back pressure enabled are worse (70 seconds vs 120 seconds). Note that both of these results are unacceptable for a streaming system as both fall behind the incoming data. Batch duration was set to 2 seconds for each run, with 130,000 throughput. Storm Storm’s benchmark was written using the Java API. We tested both Apache Storm 0.10.0 release and a 0.11.0 snapshot. The snapshot’s commit hash was a8d253a. One worker process per host was used, and each worker was given 16 tasks to run in 16 executors - one for each core. Storm 0.10.0: Storm 0.11.0: Storm compared favorably to both Flink and Spark Streaming. Storm 0.11.0 beat Storm 0.10.0, showing the optimizations that have gone in since the 0.10.0 release. However, at high-throughput both versions of Storm struggled. Storm 0.10.0 was not able to handle throughputs above 135,000 events per second. Storm 0.11.0 performed similarly until we disabled acking. In the benchmarking topology, acking was used for flow control but not for processing guarantees. In 0.11.0, Storm added a simple back pressure controller, allowing us to avoid the overhead of acking. With acking enabled, 0.11.0 performed terribly at 150,000/s—slightly better than 0.10.0, but still far worse than anything else. With acking disabled, Storm even beat Flink for latency at high throughput. However, with acking disabled, the ability to report and handle tuple failures is disabled also. Conclusions and Future Work It is interesting to compare the behavior of these three systems. Looking at the following figure, we can see that Storm and Flink both respond quite linearly. This is because these two systems try to process an incoming event as it becomes available. On the other hand, the Spark Streaming system behaves in a stepwise function, a direct result from its micro-batching design. The throughput vs latency graph for the various systems is maybe the most revealing, as it summarizes our findings with this benchmark. Flink and Storm have very similar performance, and Spark Streaming, while it has much higher latency, is expected to be able to handle much higher throughput. We did not include the results for Storm 0.10.0 and 0.11.0 with acking enabled beyond 135,000 events per second, because they could not keep up with the throughput. The resulting graph had the final point for Storm 0.10.0 in the 45,000 ms range, dwarfing every other line on the graph. The longer the topology ran, the higher the latencies got, indicating that it was losing ground. All of these benchmarks except where otherwise noted were performed using default settings for Storm, Spark, and Flink, and we focused on writing correct, easy to understand programs without optimizing each to its full potential. Because of this each of the six steps were a separate bolt or spout. Flink and Spark both do operator combining automatically, but Storm (without Trident) does not. What this means for Storm is that events go through many more steps and have a higher overhead compared to the other systems. In addition to further optimizations to Storm, we would like to expand the benchmark in terms of functionality, and to include other stream processing systems like Samza and Apex. We would also like to take into account fault tolerance, processing guarantees, and resource utilization. The bottom line for us is Storm did great. Writing topologies is simple, and it’s easy to get low latency comparable to or better than Flink up to fairly high throughputs. Without acking, Storm even beat Flink at very high throughput, and we expect that with further optimizations like combining bolts, more intelligent routing of tuples, and improved acking, Storm with acking enabled would compete with Flink at very high throughput too. The competition between near real time streaming systems is heating up, and there is no clear winner at this point. Each of the platforms studied here have their advantages and disadvantages. Performance is but one factor among others, such as security or integration with tools and libraries. Active communities for these and other big data processing projects continue to innovate and benefit from each other’s advancements. We look forward to expanding this benchmark and testing newer releases of these systems as they come out.

Benchmarking Streaming Computation Engines at Yahoo!

December 17, 2015
Omid Architecture and Protocol November 7, 2015
November 7, 2015
Share

Omid Architecture and Protocol

By Edward Bortnikov, Idit Keidar (Search Systems, Yahoo Labs), and  Francisco Perez-Sorrosal (Yahoo Search) In our previous post, we introduced Omid, Yahoo’s new efficient and scalable transaction processing platform for Apache HBase. In this post, we first overview Omid’s architecture concepts, and then delve into the system design and protocols. Omid’s client API offers abstractions both for control and for data access. The abstractions are offered by a client library, and transaction consistency is ensured by a central entity called Transaction Status Oracle (TSO), whose operation we explain in the post. The Omid client library accesses the data directly in HBase, and interacts with the TSO only to begin, commit or rollback transactions. This separation between the control and the data planes is instrumental for system scalability. For simplicity, we defer discussion of the TSO’s reliability and internal scalability to future posts; for now, let us assume that this component scales infinitely and never fails. Architecture Overview As we detailed in the previous blog post, Omid provides a lock-free Snapshot Isolation (SI) implementation that scales far better than traditional two-phase locking approaches. Namely, transactions can execute concurrently until commit, at which time write-write conflicts are resolved. Ties between two transactions that overlap both in time and in space (so committing both of them would violate SI) are broken by aborting one of them, usually the one that attempts to commit later. The simplest way to break ties is via a central arbiter that serializes all commit requests and resolves conflicts based on this order. Distributed implementations, for example, two-phase commit, are more expensive, complex, and error-prone. Omid therefore takes the simpler, centralized approach. It employs a centralized management service for transactions, the Transaction Status Oracle, or TSO, which coordinates the actions of clients. The main task of the TSO is to detect write-write conflicts among concurrent transactions, as needed for ensuring SI semantics. Similarly to any centralized service, the TSO is vulnerable to becoming a single-point-of-failure and a performance bottleneck. We will discuss Omid’s approach to high availability (HA) and scalability in a forthcoming post. Here, we describe only the operation of a single TSO in failure-free scenarios. Omid leverages the multi-versioning support in the underlying HBase key-value store, which allows transactions to read consistent snapshots of changing data as needed for SI. Specifically, when the item associated with an existing key is overwritten, a new version (holding the key, its new value, and a new version number)  is created while the previous version persists. An old version might be required as long as there is some active transaction that had begun before the transaction that overwrote this version has committed. Though this may take a while, overwritten versions eventually become obsolete. Omid takes advantage of HBase’s coprocessors to implement a garbage-collecting algorithm in order to free up the disk space taken up by such obsolete versions when doing compactions. In addition to storing the application data, HBase is further used for persistent storage of transaction-related metadata, which is accessed only by the transactional API and not exposed to the user, as will be described shortly. One of the functions of the TSO is generating version numbers (timestamps) for all client transactions. This is achieved via a subcomponent, the so-called Timestamp Oracle, that implements a central logical clock. In order to preserve correctness in shutdown/restart scenarios, the Timestamp Oracle maintains an upper bound (maximum timestamp) of this clock in a reliable repository, which can be either an HBase table or a znode in Apache Zookeeper. The following diagram summarizes Omid’s system components and the interactions among them. Note that the TSO is only involved in the control path (for transaction begin/commit/rollback), whereas the Omid clients interact with HBase directly in the data path. This separation is paramount for scalability. Fig 1: Omid components. Omid clients use the TSO to create transactional contexts. Clients also allow to access data that resides in data tables in HBase transactionally. The TSO is in the control path for conflict detection when transactions are completed. Data is multi-versioned and a garbage-collecting coprocessor cleans up obsolete versions. The TSO and the Timestamp Oracle maintain some persistent and transient metadata in HBase (although it can be also stored in other storage systems). Data and Metadata As noted above, user data resides in HBase and is multi-versioned. An item’s version number is the transaction identifier, txid, of the transaction that wrote it. The txid is returned by the TSO in response to a begin call. Omid exploits HBase also for storing persistent metadata, which comes in two flavors. First, it augments each data item with a shadow cell, which indicates the commit status of the transaction that wrote it. Initially, when an item is written during a transaction, its shadow cell is set to tentative, i.e., potentially uncommitted. At commit time, the client obtains from the TSO the transaction’s commit timestamp (commit ts) and writes this timestamp to the shadow cells of its writeset, which contains all the items written by the transaction. In addition, Omid manages a commit table (CT) tracking the commit timestamps of transactions. The data maintained in the CT is transient, being removed by the client when the transaction completes. The diagram below summarizes Omid’s data model and flow. Fig 2: Omid data model. Clients use the TSO to obtain a transaction identifier (txid) when they begin a transaction, and a commit timestamp (commit ts) when they commit it. Data is stored in HBase with the txid as the version number, and the commit ts in a shadow cell. Before the commit ts is set, the written data is tentative, and the client consults the Commit Table to determine its status. Transaction Protocol Overview The begin API produces a unique txid, which is used by all subsequent requests.  In Omid, txid also serves as the read (snapshot) timestamp. The commit API produces a commit ts, which determines the transaction’s order in the sequence of committed transactions. Both timestamps are based on a logical clock maintained by the Timestamp Oracle. Recall from our previous post that SI allows transactions to appear to execute all reads at one logical point and all writes at another (later) point. In Omid, the txid is the time of the logical clock when the transaction begins, and it determines which versions the transaction will read; the commit ts, on the other hand, is the logical time during commit, and all the transaction’s writes are associated with this time (via the shadow cells). Since transaction commit needs to be an atomic step, an application triggering a transaction: first tentatively writes all its information to data items without a commit timestamp in the corresponding shadow cells through the Omid client API (e.g. using transactional put or delete operations); then atomically commits it (in case there are no conflicts) via the TSO; and finally, once the client has received the commit ack from the TSO, it updates the shadow cells of these data items to include its commit ts. A transaction is considered complete once it finishes updating the shadow cells of all items in its writeset. Only at that point, the control is returned to the application. The post-commit completion approach creates a window when the transaction is already committed but its writes are still tentative. However, a client may be delayed or even fail after committing a transaction and before completing it. Consider an incomplete committed transaction T. Transactions that begin during T’s completion phase obtain a txid that is larger than T’s commit ts, and yet during their operation, they may encounter in HBase data tables some items with tentative writes by T and others with complete writes by T. In order to ensure that such transactions see T’s updates consistently, Omid tracks the list of incomplete committed transactions in a persistent Commit Table (CT), which is also stored in HBase. Each entry in the CT maps a committed transaction’s id to its commit timestamp. The act of writing the (txid, commit ts) pair to the CT makes the transaction durable, regardless of subsequent client failures, and is considered the commit point of the transaction. Transactions that encounter tentative writes during their reads refer to the CT in order to find out whether the value has been committed or not. In case it has, they help complete the write. This process is called healing, and is an optimization that might reduce accesses to the commit table by other transactions. Fig 3: Omid Transaction Flow (TX = Transaction; TS=Timestamp) Client-Side Operation The Omid client depicted in the previous figure executes the following actions in the name of a client application using transactions: Begin: The client obtains from the TSO a start timestamp that exceeds all the write timestamps of committed transactions. This timestamp becomes the transaction identifier (txid).  It is also used to read the appropriate versions when reading data. Get(txid, key): The get operation performs the following actions (in pseudo-code): scan versions of key that are lower than txid, highest to lowest    if version is not tentative       if its commit ts does not exceed txid, return its value else, lookup the version’s id in CT       if present (the writing transaction has committed),            update the commit ts (healing process)             if commit ts does not exceed txid return the value      else, re-read the version and return the value if no longer tentative and          commit ts does not exceed txid.  In case no value has been returned, continue to the next version. Put(txid, key/value): Adds a new tentative version of the key/value pair and the txid as the version. Commit(txid, writeset): The client requests commit from the TSO for its txid, and provides in writeset the set of keys it wrote to. The TSO assigns it a new commit timestamp and checks for conflicts for the transaction’s writeset. If there are none, it commits the transaction by writing the (txid, commit ts) pair to the CT. Then the TSO returns the control to the client providing also this (txid, commit ts) pair. Finally, the client adds the commit ts to all data items it wrote to (so its writes are no longer tentative) and deletes the txid entry from the CT. Summary At Yahoo, we have recently open-sourced the Omid project, a transactional framework for HBase. In this blog post we discussed the architecture and protocols behind Omid. We’ve shown the main components involved and we’ve described their interactions, which enable our framework to provide transactions on top of HBase in an efficient and scalable manner. The system has been implemented with HBase in mind, and therefore our presentation is in HBase terms. That said, the design principles are generic and database-neutral. Omid can be adapted to work with any persistent key-value store with multi-version concurrency control.

Omid Architecture and Protocol

November 7, 2015
Yahoo’s Open Source Omid Project Brings Scalable Transaction Processing To HBase October 2, 2015
October 2, 2015
Share

Yahoo’s Open Source Omid Project Brings Scalable Transaction Processing To HBase

Yahoo’s Open Source Omid Project Brings Scalable Transaction Processing To HBase: By Frederic Lardinois (@fredericl) A while back, Yahoo quietly made the code to Omid, an open source transaction processing system for the Apache HBase Hadoop big data store, available on GitHub. This is the same software the company uses internally to help it power thousands of search transactions per second. Until now, Yahoo remained rather subdued about this project, but with the latest update, launching today, it feels the service is now robust enough for wider deployment and has proven its ability to scale. It’s also 10 times faster than the first version the company released to the public. Yahoo’s director of engineering Ralph Rabbat and senior director of product management Sumeet Singh told me earlier this week that the company hopes that other platforms in the Hadoop and HBase ecosystem will adopt Omid.

Yahoo’s Open Source Omid Project Brings Scalable Transaction Processing To HBase

October 2, 2015
Introducing Omid - Transaction Processing for Apache HBase October 1, 2015
October 1, 2015
Share

Introducing Omid - Transaction Processing for Apache HBase

By Edward Bortnikov (@ebortnik2) Scalable Search Systems Research, Sameer Paranjpye (@sparanjpye), Sr. Search Architect, Ralph Rabbat, Director of Software Development, and Francisco Perez-Sorrosal (@fperezsorrosal), Research Engineer Welcome to Omid (Hope, in Persian), an open source transaction processing system for Apache HBase. Omid was incepted as a research project at Yahoo back in 2011. Since then, it has matured in many aspects, and has been re-architected for scalability and high availability. We have made this new code publicly available at https://github.com/yahoo/omid. This is the first in a series of blog posts that will shed light on Omid’s APIs, administration, design principles, and internals.  Applications that need to bundle multiple read and write operations on HBase into logically indivisible units of work can use Omid to execute transactions with ACID (Atomicity, Consistency, Isolation, Durability) properties, just as they would use transactions in the relational database world. Omid extends the HBase key-value access APl with transaction semantics. It can be exercised either directly, or via higher level data management API’s. For example, Apache Phoenix (SQL-on-top-of-HBase) might use Omid as its transaction management component. ACID transactions are a hugely popular programming models that are featured by relational databases. While early NoSQL data store implementations did not include transaction support, the need for transactions soon emerged. Today, they are perceived as essential to modern ultra-scalable, dynamic content processing systems. Omid’s system design is inspired in part by Percolator, Google’s dynamic web indexing technology, which reintroduced transactions to the NoSQL world in 2010.  The current version of Omid provides an easy-to-program, easy-to-operate, reliable, high-performance platform, capable of serving transactional web-scale applications based on HBase. The following features make it an attractive choice for system designers: - Development. Omid is backward-compatible with HBase APIs, making it developer friendly. Minimal extensions are introduced to enable transactions.   - Semantics. Omid implements a popular, well-tracted Snapshot Isolation (SI) consistency paradigm that is supported by major SQL and NoSQL technologies (for example, Percolator). - Scalability. Omid  provides a highly scalable, lock-free implementation of SI. To the best of our knowledge, it is the only open source NoSQL platform that can scale beyond 100K transactions per second. - Reliability.  Omid has a high-availability (HA) mode, in which the core service operates as primary-backup process pair with automatic failover. The HA support has zero overhead on the mainstream operation.   - Simplicity. Omid leverages the HBase infrastructure for managing its own metadata. It entails no additional services apart from those used by HBase. - Track Record. Omid is already in use by very-large-scale production systems at Yahoo. To start working with Omid, you should be familiar with the key concepts of transaction  processing. Wherever appropriate, we will provide the theoretical background required to explain how Omid works. For a deeper understanding, we recommend Gray and Reuter’s book on transaction processing. A Web-Scale Use Case At Yahoo, Omid is a foundational technology for Sieve, a content management platform that powers our next-generation search and personalization products. Sieve essentially acts as a huge processing hub between content feeds and serving systems. It provides an environment for highly customizable, real-time, streamed information processing, with typical discovery-to-service latencies of just a few seconds. In terms of scale and availability, the development of Omid was largely driven by Sieve’s requirements. Sieve ingests a variety of content channels (e.g., web crawl, social feeds, and proprietary sources) and applies custom workflows to generate data artifacts for target applications (e.g., search indices). In this context, data is streamed through processing tasks that can form complex topologies. Each task consumes one or more data items and produces one or more new items. For example, the document processing task for Web search consumes an HTML page’s content and produces multiple new features for this page (term postings, hyperlinks, clickable text phrases named anchortext, and more). The subsequent link analysis task consumes the set of links for a given page and produces a reverse anchortext index for all link targets. Sieve stores and processes billions of items. Thousands of tasks execute concurrently as the processed data streams through the system (resulting in tens of thousands of transactions/sec). All tasks read and write (typically) their artifacts from a multi-petabyte shared HBase instance. Since the execution of individual tasks is completely uncoordinated, it is paramount to ensure that each task executes as a logically indivisible (atomic) unit, in isolation from other units, with predictable (consistent) results. Manually handling all possible race conditions and failure scenarios that could arise in a concurrent execution is prohibitively complex. Sieve developers therefore require a programming model that would allow them to focus on business logic rather than on system reliability and consistency issues. Omid provides precisely this building block. It offers an easy-to-use API with sound and well-understood semantics. From an operations perspective, Omid is easy-to-administer, highly available, and highly scalable, and hence a natural technology choice for business-critical service like Sieve. A Quick Tutorial  Transaction processing systems offer application developers the well-known begin and commit APIs to mark transaction boundaries. An application can also abort a non-committed transaction, e.g., in response to an error situation. All database reads and writes in the scope of a committed transaction appear to execute atomically; all reads and writes in the scope of an aborted transaction seem to have never happened. We are now ready to tip-toe into Omid’s API, and see a simple code snippet. All you need to know is two interfaces - TransactionManager and TTable. TransactionManager handles all control operations (begin(), commit(), and abort()) on behalf of the application. In this context, TransactionManager.begin() returns a unique transaction id (txid), to be used in subsequent calls. TTable is a transactional equivalent of HBase’s HTable. It augments HTable’s data access methods (get(), scan(), put(), and delete()) with the txid parameter, to convey the transaction’s context. The following example demonstrates how to use Omid’s transactional API to modify two rows in an HBase table with ACID guarantees, which it’s not possible with the standard HBase API. We assume prior HBase programming experience. For background, please see the HBase programming manual. Omid Code Example Configuration conf = HBaseConfiguration.create(); TransactionManager tm = HBaseTransactionManager.newBuilder()                                               .withConfiguration(conf)                                               .build(); TTable tt = new TTable(conf, “EXAMPLE_TABLE”); byte[] family = Bytes.toBytes(“EXAMPLE_CF”); byte[] qualifier = Bytes.toBytes(“EXAMPLE_QUAL”); Transaction tx = tm.begin(); Put row1 = new Put(Bytes.toBytes(“EXAMPLE_ROW1”)); row1.add(family, qualifier, Bytes.toBytes(“VALUE_1”)); tt.put(tx, row1); Put row2 = new Put(Bytes.toBytes(“EXAMPLE_ROW2”)); row2.add(family, qualifier, Bytes.toBytes(“VALUE_2”)); tt.put(tx, row2); tm.commit(tx); tt.close(); tm.close(); Please refer to this github page for more code examples.  A Case for Snapshot Isolation The “I” in ACID is for Isolation - preventing concurrently executing transactions from seeing each other’s partial updates. The isolation property is essential for consistency. Informally, it ensures that the information a transaction reads from the database “makes sense” in that it does not mix old and new values. For example, if a Sieve transaction updates the reverse-anchortext feature of multiple pages linked-to by the page being processed, then no concurrent transaction may observe the old value of that  feature for some of these pages and the new value for the rest. More formally, a system satisfies a consistency model if every externally observable execution history can be explained as a sequence of legal database state transitions. Omid employs an intuitive yet scalable Snapshot Isolation model to guarantee data consistency when multiple concurrent transactions access the same data elements. This is implemented in popular database technologies such as Oracle, PostgreSQL, and Microsoft SQL Server. Explaining this design choice requires a small detour into history, and can be skipped in the first reading. Over the years, the database community studied several transaction isolation models. The most intuitive one is serializability - a guarantee that the outcome of any parallel execution of transactions can be explained by some serial order. In other words, transactions seem to be executing sequentially without overlapping in time (or alternatively, every transaction can be reduced to a point in time when it takes effect). Serializability implementations were traditionally based on two-phase locking methods that date back to the 70’s. While offering an extremely simple abstraction, serializable systems have been shown to suffer from significant performance bottlenecks due to their lock-based concurrency control. Since the mid-90’s, researchers set out on a quest for models that translate to scalable implementations.  A seminal paper by Bernstein et al. suggested a new model, named Snapshot Isolation (SI), which relaxes the correctness criteria of serializability – most applications do not require such a strict correctness criteria – and it is suitable for lock-free implementations. In a nutshell, SI allows transactions to be reduced to two points instead of one: a reading point and a writing point. Inconsistent reads do not occur, since each transaction sees a consistent snapshot of the database. This means that for two concurrent transactions T1 and T2, T1 sees either all of T2’s updates to data items it is reading, or none of T2’s updates. SI guarantees that (1) all reads in a transaction see a consistent snapshot of the database, and (2) a transaction successfully commits only if no update it has made conflicts with any concurrent updates made since that snapshot. As stated before, while SI is slightly different from serializability in terms of correctness (e.g. SI does not avoid the write-skew anomaly), SI implementations offer a better performance than strict serializability implementations. The diagram below illustrates snapshot isolation. In simple terms, two transactions conflict under SI if and only if: 1) they execute concurrently (overlap in time); and 2) write to the same element of a particular row (spatial overlap). Here, transactions T1 and T2 overlap in time but not in space (their write sets do not contain modifications on the same items), therefore they can both commit. T2 and T3 overlap in time and in space (both write to R4), therefore one of them must be aborted to avoid consistency violation. T4 on the other hand, does not overlap any other transaction in time and can therefore commit. Snapshot isolation is amenable to scalable implementations. NoSQL datastores that support multi-version concurrency control (MVCC), namely storing multiple versions capturing the history of an item associated with a unique key, are suitable for implementing SI. In such a system, transactions read data via immutable snapshots based on historic versions, and write data by creating new versions. Therefore, reads and writes need no coordination. Furthermore, writes can proceed without coordinating with concurrent transactions until they attempt to commit, at which time inter-transaction order is established and conflicts are detected. Fortunately, HBase is natively multi-versioned, whereby clients can control version numbers (timestamps) and retrieve data through snapshots associated with a given timestamp. While this API is complex to use in regular applications, it is a perfect fit for Omid, which exploits it to provide a simple and clean SI abstraction. We leave the implementation details for future posts. Acknowledgement  We would like to acknowledge all the contributions to Omid, as concept and code, since its early days. Our thanks go to Daniel Gomez Ferro, Eshcar Hillel, Flavio Junqueira, Idit Keidar, Ivan Kelly, Francis Christopher Liu, Matthieu Morel, Benjamin (Ben) Reed, Ohad Shacham, Maysam Yabandeh, and the whole Sieve team. Summary  This post introduced Omid, an open-source transaction processing system for HBase. We presented a web-scale application at Yahoo for which Omid is a critical building block, saw a simple code snippet that exercises Omid, and finally, discussed Omid’s snapshot isolation model for preserving data consistency. Our next post in this series will provide an overview of Omid’s architecture and administration basics.

Introducing Omid - Transaction Processing for Apache HBase

October 1, 2015
Large Scale Distributed Deep Learning on Hadoop Clusters September 25, 2015
September 25, 2015
Share

Large Scale Distributed Deep Learning on Hadoop Clusters

By Cyprien Noel, Jun Shi and Andy Feng (@afeng76), Yahoo Big ML Team Introduction In the last 10 years, Yahoo has progressively invested in building and scaling Apache Hadoop clusters with a current footprint of more than 40,000 servers and 600 petabytes of storage spread across 19 clusters. As discussed at the 2015 Hadoop Summit, we have developed scalable machine learning algorithms on these clusters for classification, ranking, and word embedding based on a home-grown parameter server.  Hadoop clusters have now become the preferred platform for large-scale machine learning at Yahoo. Deep learning (DL) is a critical capability demanded by many Yahoo products. At 2015 RE.WORK Deep Learning Summit, the Yahoo Flickr team (Simon Osindero and Pierre Garrigues) explained how deep learning is getting applied for scene detection, object recognition, and computational aesthetics. Deep learning empowers Flickr to automatically tag all user photos, enabling Flickr end users to organize and find photos easily. To enable more Yahoo products to benefit from the promise of  deep learning, we have recently introduced this capability natively into our Hadoop clusters. Deep learning on Hadoop provides the following major benefits: - Deep learning can be directly conducted on Hadoop clusters, where Yahoo stores most of its data. We avoid unnecessary data movement between Hadoop clusters and separate deep learning clusters. - Deep learning can be defined as first-class steps in Apache Oozie workflows with Hadoop for data processing and Spark pipelines for machine learning. - YARN works well for deep learning. Multiple experiments of deep learning can be conducted concurrently on a single cluster. It makes deep learning extremely cost effective as opposed to conventional methods. In the past, we had teams use “notepad” to schedule GPU resources manually, which was painful and worked only for a small number of users.  Deep learning on Hadoop is a novel approach for deep learning. Existing approaches in the industry require dedicated clusters whereas Deep learning on Hadoop enables the same level of performance as with dedicated clusters while simultaneously providing all the benefits listed above. Enhancing Hadoop Clusters To enable deep learning, we added GPU nodes into our Hadoop clusters (illustrated below). Each of these nodes have 4 Nvidia Tesla K80 cards, each card with two GK210 GPUs. These nodes have 10x  processing power than the traditional commodity CPU nodes we generally use in our Hadoop clusters. In a Hadoop cluster, GPU nodes have two separate network interfaces, Ethernet and Infiniband. While Ethernet acts as the primary interface for external communication, Infiniband provides 10X faster connectivity among the GPU nodes in the cluster and supports direct access to GPU memories over RDMA. By leveraging YARN’s recently introduced node label capabilities (YARN-796), we enable jobs to state whether containers should be launched in CPU or GPU nodes. Containers on GPU nodes use Infiniband to exchange data at a very high speed. Distributed Deep Learning: Caffe-on-Spark To enable deep learning on these enhanced Hadoop clusters, we developed a comprehensive distributed solution based upon open source software libraries, Apache Spark and Caffe. One can now submit deep learning jobs onto a cluster of GPU nodes via a command as illustrated below. spark-submit –master yarn –deploy-mode cluster –files solver.prototxt, net.prototxt –num-executors <# of EXECUTORS> –archives caffe_on_grid.tgz –conf spark.executorEnv.LD_LIBRARY_PATH=“./caffe_on_grid.tgz/lib64” –class com.yahoo.ml.CaffeOnSpark caffe-on-spark-1.0-jar-with-dependencies.jar -devices <# of GPUs PER EXECUTOR> -conf solver.prototxt -input hdfs:// -model hdfs:// In the command above, users specify the number of Spark executor processes to be launched (–num-executors), the number of GPUs to be allocated for each executor (-devices), the location of training data on HDFS, and the HDFS path where the model should be saved. Users use standard caffe configuration files to specify their caffe solver and deep network topology (ex. solver.prototxt, net.prototxt). As illustrated above, Spark on YARN launches a number of executors. Each executor is given a partition of HDFS-based training data, and launches multiple Caffe-based training threads. Each training thread is executed by a particular GPU. After back-propagation processing of a batch of training examples, these training threads exchange the gradients of model parameters. The gradient exchanged is carried out in an MPI Allreduce fashion across all GPUs on multiple servers. We have enhanced Caffe to use multiple GPUs on a server and benefit from RDMA to synchronize DL models. Caffe-on-Spark enables us to use the best of  Spark and Caffe for large scale deep learning. DL tasks are launched easily as any other Spark application. Multiple GPUs in a cluster of machines are used to train models from HDFS-based large datasets. Benchmarks Caffe-on-Spark enables (a) multiple GPUs, and (b) multiple machines to be used for deep learning. To understand the benefits of our approach, we performed benchmarks on ImageNet 2012 dataset. First, we looked into the progress of deep learning for AlexNet with 1 GPU, 2 GPUs, 4 GPUs and 8 GPUs with a single Spark executor. As illustrated in the diagram below, training time decreases as we add more GPUs. With 4 GPUs, we reached 50% accuracy in about 15/43=35% the time required by a single GPU. All these executions use identical total batch size 256. The setup with 8 GPUs didn’t show significant improvement over 4, as the overall batch size was too small on each GPU to use the hardware efficiently. Next, we conducted a distributed benchmark with GoogLeNet, which is much deeper and uses more convolutions than AlexNet, and thus requires more computation power. In every run, we arrange each GPU to handle batches of size 32, for an effective batch size of 32n when n GPUs are used. Our distributed algorithm is designed to produce models and end-result precision equivalent to running on a single GPU. 80% top-5 accuracy (20% error) was reached in 10 hours of training with 4 servers (4x8 GPUs). Notice that 1 GPU training reached only 60% top-5 accuracy (40% error) after 40 hours. GoogLeNet scales further with the number of GPUs. For top-5 accuracy 60% (40% error), 8 GPUs achieved 680% speedup over 1 GPU. Table below also shows the speedup for top-5 accuracy 70% and 80%. The speedup could be larger if we adjust batch size carefully (instead of total batch size 32n).  Open Source Continuing Yahoo’s commitment to open source, we have released some of our code into github.com/BVLC/caffe: - #2114 … Allow Caffe to use multiple GPUs within a computer - #1148 … RDMA transfers across computers - #2386 … Improved Caffe’s data pipeline and prefetching - #2395 … Added timing information - #2402 … Make Caffe’s IO dependencies optional - #2397 … Refactored Caffe solvers code In a follow-up post in the coming weeks, we will share the detailed design and implementation of Caffe-on-Spark. If there is enough interest from the community, we may open source our implementation. Please let us know what you think at bigdata@yahoo-inc.com. Conclusion The post describes early steps in bringing Apache Hadoop ecosystem and deep learning together on the same heterogeneous (GPU+CPU) cluster. We are encouraged by the early benchmark results and plan to invest further in Hadoop, Spark, and Caffe to make deep learning more effective on our clusters. We look forward to working closely with the open source communities in related areas.

Large Scale Distributed Deep Learning on Hadoop Clusters

September 25, 2015
More Results
Less Results