Latest Blogposts
Stories and updates you can see
Image | Date | Details* |
---|---|---|
October 26, 2023 |
October 26, 2023
Deep Dive into Yahoo's Semantic Search Suggestions: From Challenges to Effective ImplementationThe Pervasive Problem of Semantic Search
In the expansive digital age where information is not only vast but grows at an exponential rate, the quest for accurate and relevant search results has never been more critical. Within this context, Yahoo Mail, serving millions of users, understood the transformative potential of semantic search. By leveraging the prowess of OpenAI embeddings, we embarked on a journey to provide search results that would understand and match user intent, going beyond the conventional keyword-based approach. And while the results were commendable, they weren't devoid of hurdles:
1. Performance Bottlenecks: The integration of OpenAI embeddings, though powerful, significantly slowed down our search process.
2. User Experience: The new system demanded users to type extensively, often more than they were used to, leading to potential user dissatisfaction.
3. Habit Change: Introducing a paradigm shift in search behaviors meant we were not just altering algorithms but challenging years of user habits.
Our objective was crystal clear yet daunting: We wanted to augment the semantic search with suggestions that were rapid, economically viable, and seamlessly integrated into the user's natural search behavior.
Approach: Exploration Phase
Enticed by the idea of real-time suggestions via large language models (LLMs), we soon realized the impracticality of such an approach, primarily due to the speed constraints. The challenge demanded a solution that operated offline but mirrored the capabilities of real-time systems.
Our exploration led us to task the LLM to frame and answer all conceivable questions for every email a user received. While theoretically sound, the financial implications were prohibitive. Moreover, the risk of the LLM generating "hallucinations" or inaccurate results couldn't be ignored.
It was amidst this exploration that a revelatory idea emerged. We were already equipped with a sophisticated extraction pipeline capable of gleaning crucial information from emails. This was achieved using a blend of human curated regex parsing and meticulously fine-tuned AI models. This became the key to powering our search suggestions.
Implementation Challenges: Transitioning from Conceptualization to Real-World Application
1. The Intricacies of Indexing: One of the more pronounced challenges we encountered revolved around the intricacies of over-indexing. Let's delve into a hypothetical yet common scenario to elucidate this. Imagine a user intending to search for the term "staples." As they begin their search with the initial letters "sta", an all-encompassing approach to indexing, which takes into account every conceivable keyword, might mistakenly steer the user towards unrelated terms like "statement." Such deviations, although seemingly minor, can significantly hamper the user experience. Recognizing the paramount importance of ensuring that our search suggestions remained razor-sharp in their precision and highly relevant, we embarked on a methodical approach. Our resolution was to meticulously handpick and index only a curated set of keywords, ensuring that every suggestion offered was in perfect alignment with the user's intent.
2. The Quest for Relevance in Suggestions: Another challenge that frequently emerged was ensuring the highest degree of relevance in our search suggestions. This challenge becomes particularly pronounced when one considers a situation where a user's inbox is populated with multiple items that bear a resemblance to each other, say multiple flight confirmations. The conundrum we faced was discerning which of these similar items was of immediate interest to the user. Our breakthrough came in the form of an innovative approach centered on the extraction card date. Rather than basing our suggestions on the date the email was received, we shifted our focus to the date of the event described within the email, like a flight's departure date. This nuanced change enabled us to consistently zero in on and prioritize the most timely and pertinent result for the user.
3. Embracing Dynamism and Adaptability: When we first conceptualized our approach, our methodology was anchored in generating questions and answers during the email delivery phase, which were then indexed. However, as we delved deeper, it became evident that this approach, while robust, was somewhat inflexible and lacked the dynamism that modern search paradigms demand. Determined to infuse our system with greater adaptability, we pioneered the Just-in-Time question generation mechanism. With this refined approach, while the foundational search indexes are crafted at the point of delivery, the actual questions are dynamically constructed in real-time, tailored to the user's specific queries and the prevailing temporal context. This rejuvenated approach not only elevated the flexibility quotient of our system but also enhanced operational efficiency, ensuring that users always received the most pertinent suggestions.ImplementationAt delivery time
- Here we extract important information, create cards from the emails and save it in our BE store.
Semantic Search Indexing
- Fetch/Update the extracted cards from BE DB. Index by extracting the keywords and storing in Semantic Search Index - DB
Retrieval
- When the user makes the search, we make a server call which inturn will find the best matching extraction card for the query.
- This will then be used for generating the suggestions for the semantic search.
Conclusion
Our innovative foray into enhancing search suggestions bore fruit in a remarkably short span of 30 days, even as we navigated the intricacies of a completely new tech stack. The benefits were manifold, an enriched user experience and 10% of semantic search traffic handled by search suggestions.
In the rapidly evolving realm of AI, challenges are omnipresent. However, our journey at Yahoo underscores the potential of lateral thinking and a commitment to User Experience. Through our experiences, we hope to galvanize the broader tech community, encouraging them to ideate and implement solutions that are not just effective, but also economically prudent.
Contributors Kevin Patel(patelkev@yahooinc.com) + Renganathan Dhanogopal(renga@yahooinc.com) - Architecture + Tech Implementation Josh Jacobson + Sam Bouguerra(sbouguerra@yahooinc.com) - Product
Author Kevin Patel(patelkev@yahooinc.com) - Director of Engineering Yahoo
Deep Dive into Yahoo's Semantic Search Suggestions: From Challenges to Effective ImplementationOctober 26, 2023
|
|
March 28, 2023 |
March 28, 2023
Latest updates - March 2023Happy Spring! The Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.New Features
UI
- UI codebase has been upgraded to use Ember.js 4.4
- Build detail page to display the Template in use
- Links in the event label are now clickable
- PR title shows on PR build page
- Job list to display a build’s start & end times on hover
Bug Fixes
UI
- Job list view to handle job display name as expected
- Artifacts with & in name are now loaded properly
API
- Fixed data loss when adding Templates from multiple browser tabs
- Add API endpoints to add or remove one or more pipelines in a collectionInternals
- Fix for Launcher putting invalid characters on log linesCompatibility List
In order to have these improvements, you will need these minimum versions:
- API - v6.0.9
- UI - v1.0.790
- Store - v5.0.2
- Queue-Service - v3.0.2
- Launcher - v6.0.180
- Build Cluster Worker - v3.0.3Contributors
Thanks to the following contributors for making these features possible:
- Alan
- Anusha
- Haruka
- Ibuki
- Keisuke
- Pritam
- Sagar
- Yuki
- YutaQuestions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Author
Jithin Emmanuel, Director Of Engineering, Yahoo
Latest updates - March 2023March 28, 2023
|
|
December 30, 2022 |
December 30, 2022
Latest updates - December 2022Happy Holidays! Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.New Features
UI
- Enable deleting disconnected Child Pipelines from UI. This will give users more awareness and control over SCM URLs that are removed from child pipelines list.
API
- Cluster admins can configure different bookends for individual build clusters.
- Add more audit logs for Cluster admins to track API usage.Bug Fixes
UI
- Collections sorting enhancements.
- Create Pipeline flow now displays all Templates properly.
API
- Pipeline badges have been refactored to reduce resource usage..
- Prevent artifact upload errors due to incorrect retry logic.
Queue Service
- Prevent archived jobs from running periodic jobs if cleanup fails at any point.Internals
- Update golang version to 1.19 across all golang projects.
- Node.js has been upgraded to v18 for Store, Queue Service & Build Cluster Worker.
- Feature flag added to Queue Service to control Redis Table usage to track periodic builds.Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v5.0.12
- UI - v1.0.759
- Store - v5.0.2
- Queue-Service - v3.0.0
- Launcher - v6.0.178
- Build Cluster Worker - v3.0.2Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Anusha
- Kevin
- Haruka
- Ibuki
- Masataka
- Pritam
- Sagar
- Tiffany
- Yoshiyuki
- Yuki
- YutaQuestions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Author
Jithin Emmanuel, Director Of Engineering, Yahoo
Latest updates - December 2022December 30, 2022
|
|
October 31, 2022 |
October 31, 2022
New bug fixes and features - October 2022Latest Updates - October 2022
Happy Halloween! Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- Add sorting on branch and status for Collections
- Able to select timestamp format in user preferences
- Click on User profile in upper right corner, select User Settings
- Select dropdown for Timestamp Format, pick preferred format
- Click Save
- Soft delete for child pipelines - still need to ask a Screwdriver admin to remove completely
- Notify Screwdriver pipeline developers if pipeline is missing admin
- Add audit log of operations performed on the Pipeline Options page - Screwdriver admins should see more information in API logs
- API to reset user settings
- Support Redis cluster connection
- Add default event meta in launcher - set event.creator properly
- New gitversion binary with multiple branch support - added homebrew formula and added parameter –merged (to consider only versions on the current branch)
Bug Fixes
- UI
- Show error message when unauthorized users change job state
- Job state should be updated properly for delayed API response
- Gray out the Restart button for jobs that are disabled
- Modify toggle text to work in both directions
- Display full pipeline name in Collections
- Allow reset of Pipeline alias
- Remove default pipeline alias name
- Add tooltip for build history in Collections
- API
- Admins can sync on any pipeline
- Refactor unzipArtifactsEnabled configuration
- Check permissions before running startAll on child pipelines
- ID schema for pipeline get latestBuild
Internals
- Models
- Refactor syncStages to fail early
- Pull Request sync only returns PRs relevant to the pipeline
- Add more logs to stage creation
- Data-schema
- Display JobNameLength in user settings
- Remove old unique constraint for stages table
- SCM GitHub
- Get open pull requests - override the default limit (30) to return up to 100)
- Change wget to curl for downloading sd-repo
- Builds cannot be started if a pipeline has more than 5 invalid admins
- Coverage-sonar
- Use correct job name for PR with job scope
- Queue-Service
- Remove laabr
- Launcher
- Update Github link for grep
- Update build status if SIGTERM is received - build status will be updated to Failure when soft evict. Then buildCluster-queue-worker can send a delete request to clean up the build pod
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.297
- UI - v1.0.732
- Store - v4.2.5
- Queue-Service - v2.0.42
- Launcher - v6.0.171
- Build Cluster Worker - v2.24.3
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Anusha
- Kevin
- Haruka
- Ibuki
- Masataka
- Pritam
- Sagar
- Sheridan
- Shota
- Tiffany
- Yoshiyuki
- Yuki
- Yuta
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Author
Tiffany Kyi, Sr Software Dev Engineer, Yahoo
New bug fixes and features - October 2022October 31, 2022
|
|
October 21, 2022 |
October 21, 2022
Open Sourcing Subdomain SleuthSubdomain Sleuth is a new open source project built by the Yahoo DNS team, designed to help you defend your infrastructure against subdomain takeover attacks. This type of attack is especially dangerous for phishing attacks and cookie theft. It reads your zone files, identifies multiple types of possible takeovers, and generates a report of the dangerous records. If you work with DNS or security, I encourage you to keep reading.
A subdomain takeover is when an attacker is able to take control of the target of an existing DNS record. This is normally the result of what is called a “dangling record”, which is a record that points to something that doesn’t exist. That could be a broken CNAME or a bad NS record. It could also be a reference to a service that resolves but that you don’t manage. In either case, a successful takeover can allow the attacker to serve any content they want under that name. The surface area for these attacks grows proportionally to the adoption of cloud and other managed services.
Let’s consider an example. One of your teams creates an exciting new app called groundhog, with the web site at groundhog.example.com. The content for the site is hosted in a public AWS S3 bucket, and groundhog.example.com is a CNAME to the bucket name. Now the product gets rebranded, and the team creates all new web site content. The old S3 bucket gets deleted, but nobody remembers to remove the CNAME. If an attacker finds it, they can register the old bucket name in their account and host their own content under groundhog.example.com. They could then launch a phishing campaign against the users, using the original product name.
We’ve always had some subdomain takeover reports come through our Bug Bounty program. We couldn’t find many tools intended for defenders - most were built for either security researchers or attackers, focused on crawling web sites or other data sources for hostnames to check, or focused on specific cloud providers. We asked ourselves “how hard could it be to automatically detect these?”. That question ultimately led to Subdomain Sleuth.
Subdomain Sleuth reads your zone files and performs a series of checks against each individual record. It can handle large zone files with hundreds of thousands of records, as well as tens of thousands of individual zones. We regularly scan several million records at a time. The scan produces a JSON report, which includes the name of each failed record, the target resource, which check it failed, and a description of the failure.
We currently support three different check types. The CNAME check looks for broken CNAMEs. CNAMEs can be chained together, so the check will identify a break at any CNAME in the chain. The NS check looks for bad delegations where the server doesn’t exist, isn’t reachable, or doesn’t answer for the particular zone that was delegated. The HTTP check looks for references to known external resources that could be claimed by an attacker. It does this by sending an HTTP request and looking for known signatures of unclaimed resources. For example, if it sees a CNAME that points to an AWS S3 bucket, it will send an HTTP request to the name. If the response contains “no such bucket”, it is a target for an attacker.
Subdomain Sleuth is easy to use. All you need is a recent Go compiler and a copy of your zone files. The extra utilities require a Python 3 interpreter. The README contains details about how to build the tools and examples of how to use them.
If you’re interested in contributing to the project, we’d love to hear from you. We’re always open to detecting new variations of subdomain takeovers, whether by new checks or new HTTP fingerprints. If you participate in a bug bounty program, we’d especially love to have you feeding your findings back to the project. We’re also open to improvements in the core code, whether it’s bug fixes, unit tests, or efficiency improvements. We would also welcome improvements to the supporting tools.
We hope that you take a few minutes to give the tools a try. The increase in cloud-based services calls for more vigilance than ever. Together we can put an end to subdomain takeovers.
https://github.com/yahoo/SubdomainSleuth
Open Sourcing Subdomain SleuthOctober 21, 2022
|
|
October 10, 2022 |
October 10, 2022
Moving from Mantle to Swift for JSON ParsingWe recently converted one of our internal libraries from all Objective-C to all Swift. Along the way, we refactored how we parse JSON, moving from using the third-party Mantle library to the native JSON decoding built into the Swift language and standard library.
In this post, I'll talk about the motivation for converting, the similarities and differences between the two tools, and challenges we faced, including:
- Handling nested JSON objects
- Dealing with JSON objects of unknown types
- Performing an incremental conversion
- Continuing to support Objective-C users
- Dealing with failures Introduction
Swift is Apple's modern programming language for building applications on all of their platforms. Introduced in June 2014, it succeeds Objective-C, an object-oriented superset of the C language from the early 80's. The design goals for Swift were similar to a new crop of modern languages, such as Rust and Go, that provide a safer way to build applications, where the compiler plays a larger role in enforcing correct usage of types, memory access, collections, nil pointers, and more.
At Yahoo, adoption of Swift started slow, judiciously waiting for the language to mature. But in the last few years, Swift has become the primary language for new code across the company. This is important not only for the safety reasons mentioned, but also for a better developer experience. Many that started developing for iOS after 2014 have been using primarily Swift, and it's important to offer employees modern languages and codebases to work in. In addition to new code, the mobile org has been converting existing code when possible, both in apps and SDK's.
One recent migration was the MultiplexStream SDK. MultiplexStream is an internal library that fetches, caches, and merges streams of content. There is a subspec of the library specialized to fetch streams of Yahoo news articles and convert the returned JSON to data models.
During a Swift conversion, we try to avoid any refactoring or re-architecting, and instead aim for a line-for-line port. Even a one-to-one translation can introduce new bugs, and adding a refactor at the same time is risky. But sometimes rewriting can be unavoidable. JSON Encoding and Decoding
The Swift language and its standard library have evolved to add features that are practical for application developers. One addition is native JSON encoding and decoding support. Creating types that can be automatically encoded and decoded from JSON is a huge productivity boost.
Previously, developers would either manually parse JSON or use a third-party library to help reduce the tedious work of unpacking values, checking types, and setting the values on native object properties.Mantle
MultiplexStream relied on the third-party Mantle SDK to help with parsing JSON to native data model objects. And Mantle is great -- it has worked well in a number of Yahoo apps for a long time.
However, Mantle relies heavily on the dynamic features of the Objective-C language and runtime, which are not always available in Swift, and can run counter to the static, safe, and strongly-typed philosophy of Swift. In Objective-C, objects can be dynamically cast and coerced from one type to another. In Swift, the compiler enforces strict type checking and type inference, making such casts impossible. In Objective-C, methods can be called on objects at runtime whether they actually respond to them or not. In Swift, the compiler ensures that types will implement methods being called. In Objective-C, collections, such as Arrays and Dictionaries, can hold any type of object. In Swift, collections are homogeneous and the compiler guarantees they will only hold values of a pre-declared type.
For example, in Objective-C, every object has a -(id)getValueForKey:(NSString*)key method that, given a string matching a property name of the object, returns the value for the property from the instance.
But two things can go wrong here:
1. The string may not reference an actual property of the object. This crashes at runtime.
2. Notice the id return type. This is the generic "could be anything" placeholder. The caller must cast the id to what they expect it to be. But if you expect it to be a string, yet somehow it is a number instead, calling string methods on the number will crash at runtime.
Similarly, every Objective-C object has a -(void)setValue:(id)value, forKey:(NSString*)key method that, again, takes a string property name and an object of any type. But use the wrong string or wrong value type and, again, boom.
Mantle uses these dynamic Objective-C features to support decoding from JSON payloads, essentially saying, "provide me with the string keys you expect to see in your JSON, and I'll call setValueForKey on your objects for each value in the JSON." Whether it is the type you are expecting is another story.
Back-end systems work hard to fulfill their API contracts, but it isn't unheard of in a JSON object to receive a string instead of a float. Or to omit keys you expected to be present.
Swift wanted to avoid these sorts of problems. Instead, the deserialization code is synthesized by the compiler at compile time, using language features to ensure safety. Nested JSON Types
Our primary data model object, Article, represents a news article. Its API includes all the things you might expect, such as:
Public interface:
class Article { var id: String var headline: String var author: String var imageURL: String }
The reality is that these values come from various objects deeply nested in the JSON object structure.
JSON:
{ "id": "1234", "content": { "headline":"Apple Introduces Swift Language", "author": { "name":"John Appleseed", "imageURL":"..." }, "image": { "url":"www..." } } }
In Mantle, you would supply a dictionary of keypaths that map JSON names to property names:
{ "id":"id", "headline":"content.headline", "author":"content.author.name", "imageURL":"content.image.url" }
In Swift, you have multiple objects that match 1:1 the JSON payload:
class Article: Codable { var id: String var content: Content } class Content: Codable { var headline: String var author: Author var image: Image } class Author: Codable { var name: String var imageURL: String } class Image: Codable { var url: String }
We wanted to keep the Article interface the same, so we provide computed properties to surface the same API and handle the traversal of the object graph:
class Article { var id: String private var content: Content var headline: String { content.headline } var author: String { content.author.name } var imageURL: String { content.image.url } }
This approach increases the number of types you create, but gives a clearer view of what the entities look like on the server. But for the client, the end result is the same: Values are easy to access on the object, abstracting away the underlying data structure. JSON Objects of Unknown Type
In a perfect world, we know up front the keys and corresponding types of every value we might receive from the server. However, this is not always the case.
In Mantle, we can specify a property to be of type NSDictionary and call it a day. We could receive a dictionary of [String:String], [String:NSNumber], or even [String: NSDictionary].
Using Swift’s JSON decoding, the types need to be specified up front. If we say we expect a Dictionary, we need to specify "a dictionary of what types?"
Others have faced this problem, and one of the solutions that has emerged in the Swift community is to create a type that can represent any type of JSON value.
Your first thought might be to write a Dictionary of [String:Any]. But for a Dictionary to be Codable, its keys and values must also be Codable. Any is not Codable: it could be a UIView, which clearly can't be decoded from JSON. So instead we want to say, “we expect any type that is itself Codable.” Unfortunately there is no AnyCodable type in Swift. But we can write our own!
There are a finite number of types the server can send as JSON values. What is good for representing finite choices in Swift? Enums. Let’s model those cases first:
enum AnyDecodable { case int case float case bool case string case array case dictionary case none }
So we can say we expect a Dictionary of String: AnyDecodable. The enum case will describe the type that was in the field. But what is the actual value?
Enums in Swift can have associated values! So now our enum becomes:
enum AnyDecodable { case int(Int) case float(Float) case bool(Bool) case string(String) case array([AnyDecodable]) case dictionary([String:AnyDecodable]) case none }
We're almost done. Just because we have described what we would like to see, doesn't mean the system can just make it happen. We're outside the realm of automatic synthesis here. We need to implement the manual encode/decode functions so that when the JSONDecoder encounters a type we've said to be AnyDecodable, it can call the encode or decode method on the type, passing in what is essentially the untyped raw data:
extension AnyDecodable: Codable { init(from decoder: Decoder) throws { let container = try decoder.singleValueContainer() if let int = try? container.decode(Int.self) { self = .int(int) } else if let string = try? container.decode(String.self) { self = .string(string) } else if let bool = try? container.decode(Bool.self) { self = .bool(bool) } else if let float = try? container.decode(Float.self) { self = .float(float) } else if let array = try? container.decode([AnyDecodable].self) { self = .array(array) } else if let dict = try? container.decode([String:AnyDecodable].self) { self = .dictionary(dict) } else { self = .none } } func encode(to encoder: Encoder) throws { var container = encoder.singleValueContainer() switch self { case .int(let int): try container.encode(int) case .float(let float): try container.encode(float) case .bool(let bool): try container.encode(bool) case .string(let string): try container.encode(string) case .array(let array): try container.encode(array) case .dictionary(let dictionary): try container.encode(dictionary) case .none: try container.encodeNil() } } }
We've implemented functions that, at runtime, can deal with a value of unknown type, test to find out what type it actually is, and then associate it into an instance of our AnyDecodable type, including the actual value.
We can now create a Codable type such as:
struct Article: Codable { var headline: String var sportsMetadata: AnyDecodable }
In our use case, as a general purpose SDK, we don't know much about sportsMetadata. It is a part of the payload defined between the Sports app and their editorial staff.
When the Sports app wants to use the sportsMetadata property, they must switch over it and unwrap the associated value. So if they expect it to be a String:
switch article.metadata { case .string(let str): label.text = str default: break }
Or using "if case let" syntax:
if case let AnyDecodable.string(str) = article.metadata { label.text = str } Incremental Conversion
During conversion it was important to migrate incrementally. Pull requests should be fairly small, tests should continue to run and pass, build systems should continue to verify building on all supported platforms in various configurations.
We identified the tree structure of the SDK and began converting the leaf nodes first, usually converting a class or two at a time.
But for the data models, converting the leaf nodes from using Mantle to Codable was not possible. You cannot easily mix the two worlds: specifying a root object as Mantle means all of the leaves need to use Mantle also. Likewise for Codable objects.
Instead, we created a parallel set of Codable models with an _Swift suffix, and as we added them, we also added unit tests to verify our work in progress. Once we finished creating a parallel set of objects, we deleted the old objects and removed the Swift suffix from the new. Because the public API remained the same, the old tests didn’t need to change. Bridging
Some Swift types cannot be represented in Objective-C:
@objcMembers class Article: NSObject { ... var readTime: Int? }
Bridging the Int to Obj-C results in a value type of NSInteger. But optionality is expressed in Objective-C with nil pointers, and only NSObjects, as reference types, have pointers.
So the existing Objective-C API might look like this:
@property (nonatomic, nullable, strong) NSNumber *readTime;
Since we can't write var readTime: Int?, and NSNumber isn't Codable, we can instead write a computed property to keep the same API:
@objcMembers class Article: NSObject { private var _readTime: Int? public var readTime: NSNumber? { if let time = _readTime { return NSNumber(integerLiteral: time) } else { return nil } } }
Lastly, we need to let the compiler know to map our private _readTime variable to the readTime key in the JSON dictionary. We achieve this using CodingKeys:
@objcMembers class Article: NSObject { private var _readTime: Int? public var readTime: NSNumber? { if let time = _readTime { return NSNumber(integerLiteral: time) } else { return nil } } enum CodingKeys: String, CodingKey { case _readTime = "readTime" ... } } Failures
Swift's relentless focus on safety means there is no room for error. An article struct defined as having a non-optional headline must have one. And if one out of 100 articles in a JSON response is missing a headline, the entire parsing operation will fail.
People may think (myself included), "just omit the one article that failed." But there are cases where the integrity of the data falls apart if it is incomplete. A bank account payload that states a balance of $100, yet the list of transactions sums to $99 because we skipped one that didn't have a location field, would be a bad experience.
The solution here is to mark fields that may or may not be present as optional. It can lead to messier code, with users constantly unwrapping values, but it better reflects the reality that fields can be missing.
If a type declares an article identifier to be an integer, and the server sends a String instead, the parsing operation will throw an error. Swift will not do implicit type conversion.
The good news is that these failures do not crash, but instead throw (and provide excellent error diagnostics about what went wrong).Conclusion
A conversion like this really illustrates some of the fundamental differences between Objective-C and Swift. While some things may appear to be easier in Objective-C, such as dealing with unknown JSON types, the cost is in sharp edges that can cut in production. I do not mind paying a bit more at development time to save in the long run.
The unit tests around our model objects were a tremendous help. Because we kept the same API, once the conversion was complete, they verified everything worked as before. These tests used static JSON files of server responses and validated our objects contained correct values.
The Swift version of MultiplexStream shipped in the Yahoo News app in April 2022. So far, no one has noticed (which was the goal). But hopefully the next developer that goes in to work on MultiplexStream will.Resources
Apple Article on Encoding and Decoding Custom Types
Apple Migration Doc
Obj-C to Swift Interop
Swift to Obj-C InteropAuthorJason Howlin
Senior Software Mobile Apps Engineer
Moving from Mantle to Swift for JSON ParsingOctober 10, 2022
|
|
August 30, 2022 |
August 30, 2022
New bug fixes and features - August 2022Latest Updates - August 2022
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- Collections supports sorting by: last time a job was run in a pipeline or build history based on number of failed events/jobs. To sort by one of these fields, click the up/down caret to the right of the field names.
- Collections supports displaying a human-readable alias for a Pipeline (in List view). To set the alias for a pipeline, go to your pipeline Options tab. Under Pipeline Preferences, type the alias in the Rename pipeline field. Hit enter. Go to your Collections dashboard to see the new alias.
- Screwdriver Admins can perform Sync on any pipeline from the pipeline options UI
- If there is no pipeline admin, periodic build jobs will not run and Screwdriver will notify(if Slack or email notifications are configured)
- Pull Request Comments are now supported from individual PR jobs
- Support for self-hosted SonarQube for individual Pipelines
- Meta CLI
- Meta CLI can now be installed as homebrew formula
- Allow shebang lua commands to have parameters with dashes in them
Updates
- User preference to display job name length has now been moved under User Settings. Now you can configure your preference globally for all pipelines. Click on your username in the top right corner to show the dropdown, select User Settings. (Alternatively, navigate directly to https://YOUR_URL/user-settings/preferences). Under the User Preferences tab, click the arrows or type to adjust preferred Display Name Length.
Before:
After:
Bug Fixes
- API
- Pull Requests jobs added via a pull request should work
- Prevent disabled Pull Request jobs from executing
- Prevent API crash for Pipelines with large number of Pull Requests
- queue-service
- Prevent periodic jobs getting dropped due to API connection instabilities and improve error handling
- UI
- Even in PR workflow-graph job states show up
- Build not found redirects to intended pipeline page
- Improve the description of the parameter
- More consistent restart method when using listViewDisplay message when manually executing jobs for non-latest events
- Emphasize non-latest sha warning when manually executing jobs
- Use openresty as base image for M1 use
- Show error message when unauthorized users change job state
- Gray out restart button for jobs that are disabled
- Modify toggle text to work in both directions
- Collections and pipeline options improvements
- Launcher:
- Add SD_STEP_NAME env variable
Internals
- sd-cmd:
- Create command binary atomically
- Add configuration to README.md, local configuration improvements
- Fix sd-cmd not to slurp all input
- buildcluster-queue-worker:
- Upgrade amqplib from 0.8.0 to 0.10.0
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.282
- UI - v1.0.718
- Store - v4.2.5
- Queue-Service - v2.0.40
- Launcher - v6.0.165
- Build Cluster Worker - v2.24.3
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Anusha
- Haruka
- Ibuki
- Jacob
- Jithin
- Kazuyuki
- Keisuke
- Kenta
- Kevin
- Naoaki
- Pritam
- Sagar
- Sheridan
- Tatsuya
- Tiffany
- Yoshiyuki
- Yuichi
- Yuki
- Yuta
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Author
Alan Dong, Sr Software Dev Engineer, Yahoo
New bug fixes and features - August 2022August 30, 2022
|
|
August 24, 2022 |
August 24, 2022
Writing Lua scripts with metaSheridan Rawlins, Architect, Yahoo Summary
In any file ending in .lua with the executable bit set (chmod a+x), putting a “shebang” line like the following lets you run it, and even pass arguments to the script that won’t be swallowed by meta
hello-world.lua
#!/usr/bin/env meta print("hello world")
Screwdriver’s meta tool is provided to every job, regardless of which image you choose.
This means that you can write Screwdriver commands or helper scripts as Lua programs.
It was inspired by (but unrelated to) etcd’s bolt, as meta is a key-value store of sorts, and its boltcli which also provides a lua runner that interfaces with bolt. Example script or sd-cmd
run.lua
#!/usr/bin/env meta meta.set("a-plain-string-key", "somevalue") meta.set("a-key-for-json-value", { name = "thename", num = 123, array = { "foo", "bar", "baz" } })
What is included?
1. A Lua 5.1 interpreter written in go (gopher-lua)
2. meta CLI commands are exposed as methods on the meta object meta get local foo_value = meta.get('foo') meta set -- plain string meta.set('key', 'value')` -- json number meta.set('key', 123)` -- json array meta.set('key', { 'foo', 'bar', 'baz' })` -- json map meta.set('key', { foo = 'bar', bar = 'baz' })` meta dump local entire_meta_tree = meta.dump()
3. meta get local foo_value = meta.get('foo')
4. meta set -- plain string meta.set('key', 'value')` -- json number meta.set('key', 123)` -- json array meta.set('key', { 'foo', 'bar', 'baz' })` -- json map meta.set('key', { foo = 'bar', bar = 'baz' })`
5. meta dump local entire_meta_tree = meta.dump()
6. Libraries (aka “modules”) included in gopher-lua-libs - while there are many to choose from here, some highlights include: argparse - when writing scripts, this is a nice CLI parser inspired from the python one. Encoding modules: json, yaml, and base64 allow you to decode or encode values as needed. String helper modules: strings, and shellescape http client - helpful if you want to use the Screwdriver REST API possibly using os.getenv with the environment vars provided by screwdriver - SD_API_URL, SD_TOKEN, SD_BUILD_ID can be very useful. plugin - is an advanced technique for parallelism by firing up several “workers” or “threads” as “goroutines” under the hood and communicating via go channels. More than likely overkill for normal use-cases, but it may come in handy, such as fetching all artifacts from another job by reading its manifest.txt and fetching in parallel.
7. argparse - when writing scripts, this is a nice CLI parser inspired from the python one.
8. Encoding modules: json, yaml, and base64 allow you to decode or encode values as needed.
9. String helper modules: strings, and shellescape
10. http client - helpful if you want to use the Screwdriver REST API possibly using os.getenv with the environment vars provided by screwdriver - SD_API_URL, SD_TOKEN, SD_BUILD_ID can be very useful.
11. plugin - is an advanced technique for parallelism by firing up several “workers” or “threads” as “goroutines” under the hood and communicating via go channels. More than likely overkill for normal use-cases, but it may come in handy, such as fetching all artifacts from another job by reading its manifest.txt and fetching in parallel.Why is this interesting/useful? meta is atomic
When invoked, meta obtains an advisory lock via flock.
However, if you wanted to update a value from the shell, you might perform two commands and lose the atomicity:
# Note, to treat the value as an integer rather than string, use -j to indicate json declare -i foo_count="$(meta get -j foo_count)" meta set -j foo_count "$((++foo_count))"
While uncommon, if you write builds that do several things in parallel (perhaps a Makefile run with make -j $(nproc)), making such an update in parallel could hit race conditions between the get and set.
Instead, consider this script (or sd-cmd)
increment-key.lua
#!/usr/bin/env meta local argparse = require 'argparse' local parser = argparse(arg[0], 'increment the value of a key') parser:argument('key', 'The key to increment') local args = parser:parse() local value = tonumber(meta.get(args.key)) or 0 value = value + 1 meta.set(args.key, value) print(value)
Which can be run like so, and will be atomic
./increment-key.lua foo 1 ./increment-key.lua foo 2 ./increment-key.lua foo 3 meta is provided to every job
The meta tool is made available to all builds, regardless of the image your build chooses - including minimal jobs intended for fanning in several jobs to a single one for further pipeline job-dependency graphs (i.e. screwdrivercd/noop-container)
Screwdrivers commands can help share common tasks between jobs within an organization. When commands are written in bash, then any callouts it makes such as jq must either exist on the images or be installed by the sd-cmd. While writing in meta’s lua is not completely immune to needing “other things”, at least it has proper http and json support for making and interpreting REST calls. running “inside” meta can workaround system limits
Occasionally, if the data you put into meta gets very large, you may encounter Limits on size of arguments and environment, which comes from UNIX systems when invoking executables.
Imagine, for instance, wanting to put a file value into meta (NOTE: this is not a recommendation to put large things in meta, but, on the occasions where you need to, it can be supported). Say I have a file foobar.txt and want to put it into some-key. This code:
foobar="$(< foobar.txt)" meta set some-key "$foobar"
May fail to invoke meta at all if the args get too big.
If, instead, the contents are passed over redirection rather than an argument, this limit can be avoided:
load-file.lua
#!/usr/bin/env meta local argparse = require 'argparse' local parser = argparse(arg[0], 'load json from a file') parser:argument('key', 'The key to put the json in') parser:argument('filename', 'The filename') local args = parser:parse() local f, err = io.open(args.filename, 'r') assert(not err, err) local value = f:read("*a") -- Meta set the key to the contents of the file meta.set(args.key, value)
May be invoked with either the filename or, if the data is in memory with the named stdin device
# Direct from the file ./load-file.lua some-key foobar.txt # If in memory using "Here String" (https://www.gnu.org/software/bash/manual/bash.html#Here-Strings) foobar="$(< foobar.txt)" ./load-file.lua some-key /dev/stdin <<<"$foobar" Additional examples Using http module to obtain the parent id
get-parent-build-id.lua
#!/usr/bin/env meta local http = require 'http' local json = require 'json' SD_BUILD_ID = os.getenv('SD_BUILD_ID') or error('SD_BUILD_ID environment variable is required') SD_TOKEN = os.getenv('SD_TOKEN') or error('SD_TOKEN environment variable is required') SD_API_URL = os.getenv('SD_API_URL') or error('SD_API_URL environment variable is required') local client = http.client({ headers = { Authorization = "Bearer " .. SD_TOKEN } }) local url = string.format("%sbuilds/%d", SD_API_URL, SD_BUILD_ID) print(string.format("fetching buildInfo from %s", url)) local response, err = client:do_request(http.request("GET", url)) assert(not err, err) assert(response.code == 200, "error code not ok " .. response.code) local buildInfo = json.decode(response.body) print(tonumber(buildInfo.parentBuildId) or 0)
Invocation examples:
# From a job that is triggered from another job declare -i parent_build_id="$(./get-parent-build-id.lua)" echo "$parent_build_id" 48242862 # From a job that is not triggered by another job declare -i parent_build_id="$(./get-parent-build-id.lua)" echo "$parent_build_id" 0 Larger example to pull down manifests from triggering job in parallel
This advanced script creates 3 argparse “commands” (manifest, copy, and parent-id) to help copying manifest files from parent job (the job that triggers this one).
it demonstrates advanced argparse features, http client, and the plugin module to create a “boss + workers” pattern for parallel fetches:
- Multiple workers fetch individual files requested by a work channel
- The “boss” (main thread) filters relevent files from the manifest which it sends down the work channel
- The “boss” closes the work channel, then waits for all workers to complete tasks (note that a channel will still deliver any elements before a receive() call reports not ok
This improves throughput considerably when fetching many files - from a worst case of the sum of all download times with one at a time, to a best case of just the maximum download time when all are done in parallel and network bandwidth is sufficient.
manifest.lua
#!/usr/bin/env meta -- Imports argparse = require 'argparse' plugin = require 'plugin' http = require 'http' json = require 'json' log = require 'log' strings = require 'strings' filepath = require 'filepath' goos = require 'goos' -- Parse the request parser = argparse(arg[0], 'Artifact operations such as fetching manifest or artifacts from another build') parser:option('-l --loglevel', 'Set the loglevel', 'info') parser:option('-b --build-id', 'Build ID') manifestCommand = parser:command('manifest', 'fetch the manifest') manifestCommand:option('-k --key', 'The key to set information in') copyCommand = parser:command('copy', 'Copy from and to') copyCommand:option('-p --parallelism', 'Parallelism when copying multiple artifacts', 4) copyCommand:flag('-d --dir') copyCommand:argument('source', 'Source file') copyCommand:argument('dest', 'Destination file') parentIdCommand = parser:command("parent-id", "Print the parent-id of this build") args = parser:parse() -- Setup logs is shared with workers when parallelizing fetches function setupLogs(args) -- Setup logs log.debug = log.new('STDERR') log.debug:set_prefix("[DEBUG] ") log.debug:set_flags { date = true } log.info = log.new('STDERR') log.info:set_prefix("[INFO] ") log.info:set_flags { date = true } -- TODO(scr): improve log library to deal with levels if args.loglevel == 'info' then log.debug:set_output('/dev/null') elseif args.loglevel == 'warning' or args.loglevel == 'warning' then log.debug:set_output('/dev/null') log.info:set_output('/dev/null') end end setupLogs(args) -- Globals from env function setupGlobals() SD_API_URL = os.getenv('SD_API_URL') assert(SD_API_URL, 'missing SD_API_URL') SD_TOKEN = os.getenv('SD_TOKEN') assert(SD_TOKEN, 'missing SD_TOKEN') client = http.client({ headers = { Authorization = "Bearer " .. SD_TOKEN } }) end setupGlobals() -- Functions -- getBuildInfo gets the build info json object from the buildId function getBuildInfo(buildId) if not buildInfo then local url = string.format("%sbuilds/%d", SD_API_URL, buildId) log.debug:printf("fetching buildInfo from %s", url) local response, err = client:do_request(http.request("GET", url)) assert(not err, err) assert(response.code == 200, "error code not ok " .. response.code) buildInfo = json.decode(response.body) end return buildInfo end -- getParentBuildId gets the parent build ID from this build’s info function getParentBuildId(buildId) local parentBuildId = getBuildInfo(buildId).parentBuildId assert(parentBuildId, string.format("could not get parendId for %d", buildId)) return parentBuildId end -- getArtifact gets and returns the requested artifact function getArtifact(buildId, artifact) local url = string.format("%sbuilds/%d/artifacts/%s", SD_API_URL, buildId, artifact) log.debug:printf("fetching artifact from %s", url) local response, err = client:do_request(http.request("GET", url)) assert(not err, err) assert(response.code == 200, string.format("error code not ok %d for url %s", response.code, url)) return response.body end -- getManifestLines returns an iterator for the lines of the manifest and strips off leading ./ function getManifestLines(buildId) return coroutine.wrap(function() local manifest = getArtifact(buildId, 'manifest.txt') local manifest_lines = strings.split(manifest, '\n') for _, line in ipairs(manifest_lines) do line = strings.trim_prefix(line, './') if line ~= '' then coroutine.yield(line) end end end) end -- fetchArtifact fetches the artifact "source" and writes to a local file "dest" function fetchArtifact(buildId, source, dest) log.info:printf("Copying %s to %s", source, dest) local sourceContent = getArtifact(buildId, source) local dest_file = io.open(dest, 'w') dest_file:write(sourceContent) dest_file:close() end -- fetchArtifactDirectory fetches all the artifacts matching "source" from the manifest and writes to a folder "dest" function fetchArtifactDirectory(buildId, source, dest) -- Fire up workers to run fetches in parallel local work_body = [[ http = require 'http' json = require 'json' log = require 'log' strings = require 'strings' filepath = require 'filepath' goos = require 'goos' local args, workCh setupLogs, setupGlobals, fetchArtifact, getArtifact, args, workCh = unpack(arg) setupLogs(args) setupGlobals() log.debug:printf("Starting work %p", _G) local ok, work = workCh:receive() while ok do log.debug:print(table.concat(work, ' ')) fetchArtifact(unpack(work)) ok, work = workCh:receive() end log.debug:printf("No more work %p", _G) ]] local workCh = channel.make(tonumber(args.parallelism)) local workers = {} for i = 1, tonumber(args.parallelism) do local worker_plugin = plugin.do_string(work_body, setupLogs, setupGlobals, fetchArtifact, getArtifact, args, workCh) local err = worker_plugin:run() assert(not err, err) table.insert(workers, worker_plugin) end -- Send workers work to do log.info:printf("Copying directory %s to %s", source, dest) local source_prefix = strings.trim_suffix(source, filepath.separator()) .. filepath.separator() for line in getManifestLines(buildId) do log.debug:print(line, source_prefix) if source == '.' or source == '' or strings.has_prefix(line, source_prefix) then local dest_dir = filepath.join(dest, filepath.dir(line)) goos.mkdir_all(dest_dir) workCh:send { buildId, line, filepath.join(dest, line) } end end -- Close the work channel to signal workers to exit log.debug:print('Closing workCh') err = workCh:close() assert(not err, err) -- Wait for workers to exit log.debug:print('Waiting for workers to finish') for _, worker in ipairs(workers) do local err = worker:wait() assert(not err, err) end log.info:printf("Done copying directory %s to %s", source, dest) end -- Normalize/help the buildId by getting the parent build id as a convenience if not args.build_id then SD_BUILD_ID = os.getenv('SD_BUILD_ID') assert(SD_BUILD_ID, 'missing SD_BUILD_ID') args.build_id = getParentBuildId(SD_BUILD_ID) end -- Handle the command if args.manifest then local value = {} for line in getManifestLines(args.build_id) do table.insert(value, line) if not args.key then print(line) end end if args.key then meta.set(args.key, value) end elseif args.copy then if args.dir then fetchArtifactDirectory(args.build_id, args.source, args.dest) else fetchArtifact(args.build_id, args.source, args.dest) end elseif args['parent-id'] then print(getParentBuildId(args.build_id)) end Testing
In order to test this, bats testing system was used to invoke manifest.lua with various arguments and the return code, output, and side-effects checked.
For unit tests, an http server was fired up to serve static files in a testdata directory, and manifest.lua was actually invoked within this test.lua file so that the http server and the manifest.lua were run in two separate threads (via the plugin module) but the same process (to avoid being blocked by meta’s locking mechanism, if run in two processes)
test.lua
#!/usr/bin/env meta -- Because Meta locks, run the webserver as a plugin in the same process, then invoke the actual file under test. local plugin = require 'plugin' local filepath = require 'filepath' local argparse = require 'argparse' local http = require 'http' local parser = argparse(arg[0], 'Test runner that serves http test server') parser:option('-d --dir', 'Dir to serve', filepath.join(filepath.dir(arg[0]), "testdata")) parser:option('-a --addr', 'Address to serve on', "localhost:2113") parser:argument('rest', "Rest of the args") :args '*' local args = parser:parse() -- Run an http server on the requested (or default) addr and dir local http_plugin = plugin.do_string([[ local http = require 'http' local args = unpack(arg) http.serve_static(args.dir, args.addr) ]], args) http_plugin:run() -- Wait for http server to be running and serve status.html local wait_plugin = plugin.do_string([[ local http = require 'http' local args = unpack(arg) local client = http.client() local url = string.format("http://%s/status.html", args.addr) repeat local response, err = client:do_request(http.request("GET", url)) until not err and response.code == 200 ]], args) wait_plugin:run() -- Wait for it to finish up to 2 seconds local err = wait_plugin:wait(2) assert(not err, err) -- With the http server running, run the actual file under test -- Run with a plugin so that none of the plugins used by _this file_ are loaded before invoking dofile local run_plugin = plugin.do_string([[ arg[0] = table.remove(arg, 1) dofile(arg[0]) ]], unpack(args.rest)) run_plugin:run() -- Wait for the run to complete and report errors, if any local err = run_plugin:wait() assert(not err, err) -- Stop the http server for good measure http_plugin:stop()
And the bats test looked something like:
#!/usr/bin/env bats load test-helpers function setup() { mk_temp_meta_dir export SD_META_DIR="$TEMP_SD_META_DIR" export SD_API_URL="http://localhost:2113/" export SD_TOKEN=SD_TOKEN export SD_BUILD_ID=12345 export SERVER_PID="$!" } function teardown() { rm_temp_meta_dir } @test "artifacts with no command is an error" { run "${BATS_TEST_DIRNAME}/run.lua" echo "$status" echo "$output" ((status)) } @test "manifest gets a few files" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" manifest echo "$status" echo "$output" ((!status)) grep foo.txt <<<"$output" grep bar.txt <<<"$output" grep manifest.txt <<<"$output" } @test "copy foo.txt myfoo.txt writes it properly" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" copy foo.txt "${TEMP_SD_META_DIR}/myfoo.txt" echo "$status" echo "$output" ((!status)) [[ $(<"${TEMP_SD_META_DIR}/myfoo.txt") == "foo" ]] } @test "copy bar.txt mybar.txt writes it properly" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" copy bar.txt "${TEMP_SD_META_DIR}/mybar.txt" echo "$status" echo "$output" ((!status)) [[ $(<"${TEMP_SD_META_DIR}/mybar.txt") == "bar" ]] } @test "copy -b 101010 -d somedir mydir writes it properly" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" -l debug copy -b 101010 -d somedir "${TEMP_SD_META_DIR}/mydir" echo "$status" echo "$output" ((!status)) ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep one.txt ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep two.txt (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/one.txt") == 1 )) (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/two.txt") == 2 )) } @test "copy -b 101010 -d . mydir gets all artifacts" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" -l debug copy -b 101010 -d . "${TEMP_SD_META_DIR}/mydir" echo "$status" echo "$output" ((!status)) ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep one.txt ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep two.txt (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/one.txt") == 1 )) (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/two.txt") == 2 )) [[ $(<"${TEMP_SD_META_DIR}/mydir/abc.txt") == abc ]] [[ $(<"${TEMP_SD_META_DIR}/mydir/def.txt") == def ]] (($(find "${TEMP_SD_META_DIR}/mydir" -type f | wc -l) == 5)) } @test "copy -b 101010 -d . -p 1 mydir gets all artifacts" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" -l debug copy -b 101010 -d . -p 1 "${TEMP_SD_META_DIR}/mydir" echo "$status" echo "$output" ((!status)) ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep one.txt ls -1 "${TEMP_SD_META_DIR}/mydir/somedir" | grep two.txt (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/one.txt") == 1 )) (( $(<"${TEMP_SD_META_DIR}/mydir/somedir/two.txt") == 2 )) [[ $(<"${TEMP_SD_META_DIR}/mydir/abc.txt") == abc ]] [[ $(<"${TEMP_SD_META_DIR}/mydir/def.txt") == def ]] (($(find "${TEMP_SD_META_DIR}/mydir" -type f | wc -l) == 5)) } @test "parent-id 12345 gets 99999" { run "${BATS_TEST_DIRNAME}/test.lua" -- "${BATS_TEST_DIRNAME}/run.lua" parent-id -b 12345 echo "$status" echo "$output" ((!status)) (( $output == 99999 )) }
Writing Lua scripts with metaAugust 24, 2022
|
|
May 3, 2022 |
May 3, 2022
New bug fixes and features - May 2022Latest Updates - May 2022
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- Show base branch name on pipeline graph nav
- Relaxing blockedBy for same job - You can optionally run the same job at the same time in different events using the annotations `screwdriver.cd/blockedBySameJob` and `screwdriver.cd/blockedBySameJobWaitTime`
- Add resource limit environment variables to build pod template: `CONTAINER_CPU_LIMIT`, `CONTAINER_MEMORY_LIMIT`
- Add environment variable for private pipeline - `SD_PRIVATE_PIPELINE` will be set to `true` if private pipeline, otherwise `false`
- Add job enable or disable toggle on pipeline tooltipOption to filter out events that have no builds from workflow graph in UI
Bug Fixes
- API: Use non-readOnly DB to get latest build for join
- API: Return 404 error when GitHub api returns 404
- API: Multi-platform builds
- API: The build parameter should not be polluted by another pipeline
- API: Return 404 in openPr branch not found
- API: Update promster hapi version
- queue-service: Multi-platform builds
- UI: Multi-platform builds
- UI: Unify checkbox expansion behavior on pipeline creation page
- UI: Switch from power icon to info icon
- UI: Wait for rendering
- UI: Toggle checkbox when label text clicked
- Store: Multi-platform builds
- Store: Add function to delete zip files
- Store: Enable to Upload and Download artifact files by unzip worker scope token
- Launcher: Support ARM64 binary for sd-step
- Launcher: Build docker image for multiple platforms
- Launcher: Add buildkit flag
- Launcher: Use automatic platform args
- Launcher: Make launcher docker file multi-arch compatible
Internals
- homepage: Use tinyurl instead of git.io
- sd-cmd: Support arm64
- sd-local: Use latest patch version of golang 1.17
- meta-cli: Ensure that the jobName exists (before it was looking up “null”)
- meta-cli: Make meta get parameters behave like it does for children (i.e. apply the job overrides)
- meta-cli: Upgrade gopher-lua-libs for base64 support (and json/yaml file-io encoder/decoder)
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.239
- UI - v1.0.687
- Store - v4.2.5
- Queue-Service - v2.0.35
- Launcher - v6.0.161
- Build Cluster Worker - v2.24.0
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Dekus
- Haruka
- Hiroki
- Kazuyuki
- Keisuke
- Kevin
- Naoaki
- Pritam
- Sheridan
- Teppei
- Tiffany
- Yuki
- Yuta
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Author
Tiffany Kyi, Sr Software Dev Engineer, Yahoo
New bug fixes and features - May 2022May 3, 2022
|
|
March 30, 2022 |
March 30, 2022
New Bug Fixes and Features - March 2022Latest Updates - March 2022
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- (GitLab) Group owners can create pipelines for projects they have admin access to
- Option to filter out events that have no builds from workflow graph in UI
Bug Fixes
- API: Error fix in removeJoinBuilds
- API: Error code when parseUrl failed
- API: Source directory can be 2 characters or less
- API: New functional tests for parent event, source directory, branch-specific job, restrict PR setting, skip build
- queue-service: Region map value name
- queue-service: Do not retry when processHooks times out
- UI: Update validator with provider field
- UI: Change color code to be more colorblind-friendly
- UI: Properly prompt and sync no-admin pipelines
- UI: Add string case for provider for validator
- UI: Disable click Start when set annotations
- Launcher: Do not include parameters from external builds during remote join
- buildcluster-queue-worker: Create package-lock.json
- buildcluster-queue-worker: Fix health check processing error
- buildcluster-queue-worker: Do not requeue when executor returns 403 or 404 error
Internals
- sd-cmd: Restrict debug store access log by verbose option
- template-main: Requires >=node:12
- toolbox: Add logs to troubleshoot release files
- guide: Update Gradle example
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.224
- UI - v1.0.680
- Store - v4.2.3
- Queue-Service - v2.0.30
- Launcher - v6.0.149
- Build Cluster Worker - v2.23.3
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Harura
- Ibuki
- Jithin
- Joe
- Keisuke
- Kenta
- Naoaki
- Pritam
- Ryosuke
- Sagar
- Shota
- Tiffany
- Teppei
- Yoshiyuki
- Yuki
- Yuta
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Author
Tiffany Kyi, Sr Software Dev Engineer, Yahoo
New Bug Fixes and Features - March 2022March 30, 2022
|
|
February 15, 2022 |
February 15, 2022
Latest Updates - February 2022Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- Multi-tenant AWS Builds using AWS CodeBuild or EKS
- Micro Service to process SCM webhooks asynchronously.
Bug Fixes
- UI: Hide stop button for unknown events.
- UI: Properly update workflow graph for a running pipeline
- API: Prevent status change for a finished build.
- API: Return proper response code when Pipeline has no admins.
- API: Pull Request which spans multiple pipelines sometimes fail to start all jobs.
- API: Blocked By for the same job is not always working.
- API: Restarting build can fail sometimes when job parameters are used.
- API: Join job did not start when restarting a failed job.
- sd-local: Support for changing build user.
Internals
- API: Reduce Database calls during workflow processing.
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.206
- UI - v1.0.670
- Store - v4.2.3
- Queue-Service - v2.0.26
- Launcher - v6.0.147
- Build Cluster Worker - v2.23.0
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Jithin
- Ibuki
- Harura
- Kenta
- Keisuke
- Kevin
- Naoaki
- Pritam
- Sagar
- Tiffany
- Yoshiyuki
- Yuichi
- Yuki
- Yuta
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Author
Jithin Emmanuel, Director of Engineering, Yahoo
Latest Updates - February 2022February 15, 2022
|
|
January 27, 2022 |
January 27, 2022
Introducing YChaos - The resilience testing frameworkShashank Sharma, Software Engineer, Yahoo
We, the resilience team, are glad to announce the release of YChaos, an end-to-end resilience testing framework to inject real time failures onto the systems and verify the system’s readiness to handle these failures. YChaos provides an easy to understand, quick to setup tool to perform a predefined chaos on a system.
YChaos started as “Gru”, a tool that uses Yahoo’s internal technologies to run “Minions” on a predefined target system that creates a selected chaos on the system and restores the system to normal state once the testing is complete. YChaos has evolved a lot since then with better architecture, keeping the essence of Gru, catering to the use case of open source enthusiasts simultaneously supporting technologies used widely in Yahoo like Screwdriver CI/CD, Athenz etc. Get Started
The term chaos is intriguing. To know more about YChaos, you can start by installing the YChaos package
pip install ychaos[chaos]
The above installs the latest stable YChaos package (Chaos subpackage) on your machine. To install the latest beta version of the package, you can install from the test.pypi index
pip install -i https://test.pypi.org/simple/ ychaos[chaos]
To install the actual attack modules that cause chaos on the system, install the agents subpackage. If you are planning to create chaos onto a remote target, this is not needed.
pip install ychaos[agents]
That’s all. You are now ready to create your first test plan and run the tool. To know more, head over to our documentation Design and Architecture
YChaos is developed keeping in mind the Chaos Engineering principles. The framework provides a method to verify a system is in a condition that supports performing chaos on it along with providing “Agents” that are the actual chaos modules that inject a predefined failure on the system. The tool can also effectively be used to monitor and verify the system is back to normal once the chaos is complete. YChaos Test Plan
Most of the modules of YChaos require a structured document that defines the actual chaos/verification plan that the user wants to perform. This is termed as the test plan. The test plan can be written in JSON or YAML format adhering to the schema given by the tool.
The test plan provides a number of attributes that can be configured including verification plugins, agents, etc. Once the tool is fed with this test plan, the tool takes this configuration for anything it wants to do going forward.
If you have installed YChaos, you can check the validity of the test plan you have created by running
ychaos testplan validate /tmp/testplan.yaml YChaos Verification Plugins
YChaos provides various plugins within the framework to verify the system state before, during and after the chaos. This can be used to determine if the system is in a state good enough to perform an attack, verify the system is behaving as expected during the attack and if the system has returned back to normal once the attack is done.
YChaos currently bundles the following plugins ready to be used by the users
1. Python Module : Self configured plugin
2. Requests : Verify the latency of an API call
3. SDv4 : Remotely trigger a configured Screwdriver v4 pipeline and mark its completion as a criteria of verification.
We are currently working on adding metrics based verification to verify a specific metric from the OpenTSDB server and to provide different criteria (Numerical and Relative) to verify the system is in an expected state.
To know more about YChaos Verification and how to run verification, visit our documentation. The documentation provides a way to configure a simple python_module plugin and run verification. YChaos Target Executor
The target executor or just Executor is the one determining the necessary steps to run the Agent Coordinator. The target defines the place where the chaos takes place. Executor determines the right configuration to reach the actual target and thereby making the target available for Agent Coordinator to run the Agents
Currently, YChaos supports MachineTarget executor to SSH to a particular host and run the Agents on it. The other targets like Kubernetes/Docker, Self are also under consideration. YChaos Agent Coordinator
The agent coordinator prepares the agents configured in the test plan to run on the target. It also takes care of monitoring the lifecycle of each agent so that all of the agents run in a structured way and also ensures the agents are teardown before ending the execution.
The agent coordinator acts as a one point control of all the agents running on the target. YChaos Agents (Formerly Minions)
The agents are the actual attack modules that can be configured to create a specific chaos on the target. For example, CPU Burn Agent is specifically designed to burn up the CPU cores for a configured amount of time.
The agents are bundled with an Agent Configuration that provides attributes that can be configured by the user. For example, CPU Burn Agent configuration provides the cores_pct which can be configured by the user to run the process on a percentage of CPU cores on the target.
YChaos Agents are designed in such a way that it is possible to run them independently without any intermediates like a coordinator. This helps in quick development and testing of agents.
Agents follow a sequence in their execution called the lifecycle methods like setup, run, teardown and monitor. The setup initializes the prerequisites for an agent to execute. Run actually contains the program logic required to perform a chaos on the system. Once the run executes successfully, the teardown can be triggered to restore back the system from the chaos created by that particular agent. Acknowledgement
We would like to thank all the contributors to YChaos as an idea, concept or code. We extend our gratitude to all those supporting the project from “Gru” to “YChaos”. Summary
This post introduced a new Chaos Engineering or Resilience testing tool YChaos, how to get started with it and briefly discussed the design and architecture of the components that make up YChaos along with some quick examples to start your journey with YChaos with. References and Links
1. YChaos Codebase : https://github.com/yahoo/ychaos
2. YChaos Documentation : https://yahoo.github.io/ychaos
3. Our Presence on PyPi
4. https://test.pypi.org/project/ychaos/
5. https://pypi.org/project/ychaos/
Introducing YChaos - The resilience testing frameworkJanuary 27, 2022
|
|
December 22, 2021 |
December 22, 2021
Latest Updates - December 2021Happy Holidays. Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- Build parameters can be defined for jobs.
- UI: Show confirmation dialog when setting private pipelines public.
- UI: Option to always show pipeline triggers.
- UI: Option to display events by chronological order.
- UI: Unified UX for Pull Requests.
- Executor: Cluster admins can provide data into the build environment.
Bug Fixes
- UI: Properly start jobs in list view with parameters.
- UI: Properly close tool tips.
- API: Builds in blocked status can sometimes appear stuck
- API: Cleanup `subscribe` configuration properly.
- API: Speed up large pipeline deletion
- API: Pipeline creation sometimes fails due to job name length in “requires” configuration.
- API: Sonarqube configuration was not automatically created
- API: Redlock setting customization was not working.
- API: Template/Command publish was failing without specifying minor version.
- API: Unable to publish latest tag to template in another namespace.
- Queue Service: Properly handle API failures.
- Launcher: Handle jq install properly.
Internals
- Remove dependency on deprecated “request” npm package.
- Meta-cli download via go get now works as expected.
- Semantic release library updated to v17
- Launcher: Support disabling habitat in build environment.
- Adding more functional tests to the API.
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.179
- UI - v1.0.668
- Store - v4.2.3
- Queue-Service - v2.0.18
- Launcher - v6.0.147
- Build Cluster Worker - v2.23.0
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Jithin
- Ibuki
- Harura
- Kazuyuki
- Kenta
- Keisuke
- Kevin
- Naoaki
- Om
- Pritam
- Ryosuke
- Sagar
- Tiffany
- Yoshiyuki
- Yuichi
- Yuki
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Author
Jithin Emmanuel, Director of Engineering, Yahoo
Latest Updates - December 2021December 22, 2021
|
|
October 26, 2021 |
October 26, 2021
Apache Pulsar: Seamless Storage Evolution and Ultra-High Performance with Persistent MemoryRajan Dhabalia, Sr. Principal Software Engineer, Yahoo
Joe Francis, Apache Pulsar PMC Introduction
We have been using Apache Pulsar as a managed service in Yahoo! since 2014. After open-sourcing Pulsar in 2016, entering the Apache Incubator in 2017, and graduating as an Apache Top-Level Project in 2018, there have been a lot of improvements made and many companies have started using Pulsar for their messaging and streaming needs. At Yahoo, we run Pulsar as a hosted service, and more and more use cases run on Pulsar for different application requirements such as low latency, retention, cold reads, high fanout, etc. With the rise of the number of tenants and traffic in the cluster, we are always striving for a system that is both multi-tenant and can use the latest storage technologies to enhance performance and throughput without breaking the budget. Apache Pulsar provides us that true multi-tenancy by handling noisy-neighbor syndrome and serving users to achieve their SLA without impacting each other in a shared environment. Apache Pulsar also has a distinct architecture that allows Pulsar to adopt the latest storage technologies from time to time to enhance system performance by utilizing the unique characteristics of each technology to get the best performance out of it.
In this blog post, we are going to discuss two important characteristics of Apache Pulsar, multi-tenancy and adoption of next-generation storage technologies like NVMe and Persistent memory to achieve optimum performance with very low-cost overhead. We will also discuss benchmark testing of Apache Pulsar with persistent memory that shows we have achieved 5x more throughput with Persistent memory and also reduced the overall cost of the storage cluster.
What is Multi-Tenancy?
Multi-tenancy can be easily understood with the real-estate analogy and by understanding the difference between an apartment building and a single residence home. In apartment buildings, resources (exterior wall, utility, etc.) are shared among multiple tenants whereas in a single residence only one tenant consumes all resources of the house. When we use this analogy in technology, it describes multi-tenancy in a single instance of hardware or software that has more than one resident. And it's important that all residents on a shared platform operate their services without impacting each other.
Apache Pulsar has an architecture distinct from other messaging systems. There is a clear separation between the compute layer (which does message processing and dispatching) and the storage layer (that handles persistent storage for messages using Apache BookKeeper). In BookKeeper, bookies (individual BookKeeper storage nodes) are designed to use three separate I/O paths for writes, tailing reads, and backlog reads. Separating these paths is important because writes and tailing reads use-cases require predictable low latency while throughput is more important for backlog reads use cases.
Real-time applications such as databases and mission-critical online services need predictable low latency. These systems depend on low-latency messaging systems. In most messaging systems, under normal operating conditions, dispatch of messages occurs from in-memory caches. But when a message consumer falls behind, multiple interdependent factors get triggered. The first is storage backlog. Since the system guarantees delivery, messages need to be persistently stored until delivery, and a slow reader starts building a storage backlog. Second, when the slow consumer comes back online, it starts to consume messages from where it left off. Since this consumer is now behind, and older messages have been aged out of the in-memory cache, messages need to be read back from disk storage, and cold reads on the message store will occur. This backlog reads on the storage device will cause I/O contention with writes to persist incoming messages to storage getting published currently. This leads to general performance degradation for both reads and writes. In a system that handles many independent message topics, the backlog scenario is even more relevant, as backlogged topics will cause unbalanced storage across topics and I/O contention. Slow consumers force the storage system to read the data from the persistent storage medium, which could lead to I/O thrashing and page cache swap-in-and-out. This is worse when the storage I/O component shares a single path for writes, caught-up reads, and backlog reads.
A true test of any messaging system should be a test of how it performs under backlog conditions. In general, published throughput benchmarks don't seem to account for these conditions and tend to produce wildly unrealistic numbers that cannot be scaled or related to provisioning a production system. Therefore, the benchmark testing that we are presenting in this blog is performed with random cold reads by draining backlog across multiple topics.
BookKeeper and I/O Isolation
Apache BookKeeper stores log streams as segmented ledgers in bookie hosts. These segments (ledgers) are replicated to multiple bookies. This maximizes data placement options, which yields several benefits, such as high write availability, I/O load balancing, and a simplified operational experience. Bookies manage data in a log-structured way using three types of files:
Journal contains BookKeeper transaction logs. Before any update to a ledger takes place, the bookie ensures that a transaction describing the update is written to non-volatile storage.
Entry log (Data-File) aggregates entries from different ledgers (topics) and writes sequentially and asynchronously. It is also known as Data File.
Entry log index manages an index of ledger entries so that when a reader wants to read an entry, the BookKeeper locates the entry in the appropriate entry log and offset using this index.
With two separate file systems, Journal and Data-file, BookKeeper is designed to use separate I/O paths for writes, caught-up reads, and backlog reads. BookKeeper does sequential writes into journal files and performs cold reads from data files for the backlog draining.
[Figure 1: Pulsar I/O Isolation Architecture Diagram]
Adoption of Next-Generation Storage Technologies
In the last decade, storage technologies have evolved with different types of devices such as HDD, SSD, NVMe, persistent memory, etc. and we have been using these technologies for Pulsar storage as time changes. Adoption of the latest technologies is helpful in Pulsar to enhance system performance but it’s also important to design a system that can fully use a storage device based on its characteristics and squeeze the best performance out of each kind of storage.
Table 2. shows how each device can fit into the BookKeeper model to achieve optimum performance.
[Table 2: BookKeeper adaptation based on characteristics of storage devices]
Hard Disk Drive (HDD)
From the 80s until a couple of years ago, database systems have relied on magnetic disks as secondary storage. The primary advantages of a hard disk drive are affordability from a capacity perspective and reasonably good sequential performance. As we have already discussed, bookies append transactions to journals and always write to journals sequentially. So, a bookie can use hard disk drives (HDDs) with a RAID controller and a battery-backed write cache to achieve writes at lower latency than latency expectations from a single HDD.
Bookie also writes entry log files sequentially to the data device. Bookies do random reads when multiple Pulsar topics are trying to read backlogged messages. So, in total, there will be an increased I/O load when multiple topics read backlog messages from bookies. Having journal and entry log files on separate devices ensures that this read I/O is isolated from writes. Thus Pulsar can always achieve higher effective throughput and low latency writes with HDDs.
There are other messaging systems that use a single file to write and read data for a given stream. Such systems have to do a lot of random reads if consumers from multiple streams start reading backlog messages at the same time. In a multi-tenant environment, it’s not feasible for such systems to use HDDs to achieve consistent low-write latency along with backlog consumer reads because in HDD, random reads can directly impact both write and read latencies and eventually writes have to suffer due to random cold reads on the disk.
SATA Solid State Drives (SSD)
Solid-state disks (SSD)-based on NAND flash media have transformed the performance characteristics of secondary storage. SSDs are built from multiple individual flash chips wired in parallel to deliver tens of thousands of IOPS and latency in the hundred-microsecond range, as opposed to HDDs with hundreds of IOPS and latencies in milliseconds. Our experience (Figure 3) shows that SSD provides higher throughput and better latency for sequential writes compared to HDDs. We have seen significant bookie throughput improvements by replacing SSDs with HDD for just journal devices.
Non-Volatile Memory Express (NVMe) SSD
Non-Volatile Memory Express (NVMe) is another of the current technology industry storage choices. The reason is that NVMe creates parallel, low-latency data paths to underlying media to provide substantially higher performance and lower latency. NVMe can support multiple I/O queues, up to 64K with each queue having 64K entries. So, NVMe’s extreme performance and peak bandwidth will make it the protocol of choice for today’s latency-sensitive applications. However, in order to fully utilize the capabilities of NVMe, an application has to perform parallel I/O by spreading I/O loads to parallel processes.
With BOOKKEEPER-963 [2], the bookie can be configured with multiple journals. Each individual thread sequentially writes to its dedicated journal. So, bookies can write into multiple journals in parallel and achieve parallel I/O based on NVMe capabilities. Pulsar performs 2x-3x better with NVMe compared to SATA/SAS drives when the bookie is configured to write to multiple journals.
Persistent Memory
There is a large performance gap between DRAM memory technology and the highest-performing block storage devices currently available in the form of solid-state drives. This gap can be reduced by a novel memory module solution called Intel Optane DC Persistent Memory (DCPMM) [1]. The DCPMM is a byte-addressable cache coherent memory module device that exists on the DDR4 memory bus and permits Load/Store accesses without page caching.
DCPMM is a comparatively expensive technology on unit storage cost to use for the entirety of durable storage. However, BookKeeper provides a near-perfect option to use this technology in a very cost-effective manner. Since the journal is short-lived and does not demand much storage, a small-sized DCPMM can be leveraged as the journal device. Since journal entries are going to be ultimately flushed to ledgers, the size of the journal device and hence the amount of persistent memory needed is in the tens of GB.
Adding a small capacity DCPMM on bookie increases the total cost of bookie 5 - 10%, but it gives significantly better performance by giving more than 5x throughput while maintaining low write latency.
Endurance Considerations of Persistent Memory vs SSD
Due to the guarantees needed on the data persistence, journals need to be synced often. On a high-performance Pulsar cluster, with SSDs as the journal device to achieve lower latencies, this eats into the endurance budget, thus shortening the useful lifespan of NAND flash-based media. So for high performance and low latency Pulsar deployment, storage media needs to be picked carefully.
This issue can, however, be easily addressed by taking advantage of persistent memory. Persistent memory has significantly higher endurance, and the write throughput required for a journal should be handled by this device. A small amount of persistent memory is cheaper than an SSD with equivalent endurance. So from the endurance perspective, Pulsar can take advantage of persistent memory technology at a lower cost.
[Figure 3: Latency vs Throughput with Different Journal Device in Bookie]
Figure 3 shows the latency vs performance graph when we use different types of storage devices to store journal files. It illustrates that the Journal with NVMe device gives 350MB throughput and the PMEM device gives 900MB throughput by maintaining consistently low latency p99 5ms.
As we discussed earlier, this benchmark testing is performed under a real production situation and the test was performed under backlog conditions. Our primary focus for this test is (a) system throughput and (b) system latency. Most of the applications in our production environment have SLA of p99 5ms publish latency. Therefore, our benchmark setup tests throughput and latency of Apache Pulsar with various storage devices (HDD, SSD, NVMe, and Persistent memory) and with a mixed workload of writes, tail reads, and random cold reads across multiple topics. In the next section, let’s discuss the benchmark test setup and performance results in detail.
Benchmarking Pulsar Performance for Production Use Cases
Workload
We measured the performance of Pulsar for a typical mixed workload scenario. In terms of throughput, higher numbers are achievable (up to the network limit), but those numbers don't help in decision-making for building production systems. There is no one-size-fits-all recommended configuration available for any system. The configuration depends on various factors such as hardware resources of brokers (memory, CPU, network bandwidth, etc.) and bookies (storage disk types, network bandwidth, memory, CPU, etc.), replication configurations (ensembleSize, writeQuorum, ackQuorum), traffic pattern, etc.
The benchmark test configuration is set up to fully utilize system capabilities. Pulsar benchmark test includes various configurations such as a number of topics, message size, number of producers, and consumer processes. More importantly, we make an effort to ensure that cold-reads occur, which forces the system to read messages from the disk. This is typical for systems that do a replay, have downstream outages, and have multiple use cases with different consumption patterns.
In Verizon Media (Yahoo), most of our use cases are latency-sensitive and they have a publish latency SLA of p99 5ms. Hence these results are indicative of the throughput limits with that p99 limit, and not the absolute throughput that can be achieved with the setup. We evaluated the performance of Pulsar using different types of storage devices (HDD, SSD, NVMe, and PMEM) for BookKeeper Journal devices. However, NVMe and PMEM are more relevant to current storage technology trends. Therefore, our benchmark setup and results will be more focused on NVMe and PMEM to use them for BookKeeper journal devices.
Quorum Count, Write Availability, and Device Tail Latencies
Pulsar has various settings to ensure durability vs availability tradeoffs.
Unlike other messaging systems, Pulsar does not halt writes to do recovery in a w=2/a=2 setup. It does not require a w=3/a=2 setup to ensure write availability during upgrades or single node failure. Writing to 2 nodes (writeQuorum=2) and waiting for 2 acknowledgements (ackQuorum=2), provides write availability in Pulsar under those scenarios. In this setup (w=2/a=2), when a single node fails, writes can proceed without interruption instantaneously, while recovery executes in the background to restore the replication factor.
Other messaging systems halt writes, while doing recovery under these scenarios.
While failure may be rare, the much more common scenario of a rolling upgrade is seamlessly possible with a Pulsar configuration of (w=2/a=2).
We consider this a marked benefit out of the box, as we are able to get by with a data replication factor of 2 instead of 3 to handle these occasions, with storage provisioned for 2 copies.
Test Setup
We use 3 Brokers, 3 Bookies, and 3 application clients.
Application Configuration:
3 Namespaces, 150 Topics
Producer payload 100KB
Consumers: 100 Topics with consumers doing hot reads, 50 topics with consumers doing cold reads (disk access)
Broker Configuration:
96GB RAM, 25Gb NIC
Pulsar settings: bookkeeperNumberOfChannelsPerBookie=200 [4]
JVM settings: -XXMaxDirectMemorySize=60g -Xmx30g
Bookie Configuration: 1
(Journal Device: NVMe(Device-1), Ledger/Data Device: NVMe(Device-2))
64GB RAM, 25Gb NIC
BookkeeperNumberofChannelsperbookie=200
Journal disk: Micron NVMe SSD 9300
Journal directories: 2 (Bookie configuration: journalDirectories)
Data disk: Micron NVMe SSD 9300
Ledger directories: 2 (Bookie configuration: ledgerDirectories)
JVM settings: -XXMaxDirectMemorySize=30g -Xmx30g
Bookie Configuration: 2
(Journal Device: PMEM, Ledger/Data Device: NVMe)
64GB RAM, 25Gb NIC
BookkeeperNumberofChannelsperbookie=200
PMEM journal device: 2 DIIMs, each with 120GB, mounted as 2 devices
Journal directories: 4 (2 on each device) (Bookie configuration: journalDirectories)
Data disk: Micron NVMe SSD 9300
Ledger directories: 2 (Bookie configuration: ledgerDirectories)
JVM settings: -XXMaxDirectMemorySize=30g -Xmx30g
Client Setup
The Pulsar performance tool[3]: was used to run the benchmark test.
Results
The performance test was performed on two separate Bookie configurations: Bookie configuration-1 uses two separate NVMe each for Journal and Data device and Bookie configuration-2 uses PMEM as Journal and NVMe as a Device device.
[Table 4: Pulsar Performance Evaluation]
As noted before, read/write latency variations occur when an NVMe SSD controller is busy with media management tasks such as Garbage Collection, Wear Leveling, etc. The p99 NVMe disk latency goes high with certain workloads, and that impacts the Pulsar p99 latency, under a replication configuration: e=2, w=2, a=2. (The p95 NVMe disk latency is not affected, and so Pulsar 95 latencies are still under 5ms )
The impact of the NVME wear leveling and garbage collection can be mitigated by a replication configuration of e=3, w=3, and a=2, which helps flatten out the pulsar p99 latency graph across 3 bookies and achieves higher throughput while maintaining low 5ms p99 latency. We don’t see such improvements in the PMEM journal device set up with such a replication configuration.
The results demonstrate that Bookie with NVMe or PMEM storage devices gives fairly high throughput at around 900MB by maintaining low 5ms p99 latency. While performing benchmark tests on NVMe journal device setup with replication configuration e=3,w=3,ack=2, we have captured io-stats of each bookie. Figure 5 shows that Bookie with a PMEM device provides 900MB write throughput with consistent low latency ( < 5ms).
[Figure 5: Latency Vs Time (PMEM Journal Device with 900MB Throughput)]
[Figure 6: Pulsar Bookie IO Stats]
IO stats (Figure 6) shows that the journal device serves around 900MB writes and no reads. Data device also serves 900MB avg writes while serving 350MB reads from each bookie.
Performance & User Impact
The potential user impact of software-defined storage is best understood in the context of the performance, scale, and latency that characterize most distributed systems today. You can determine if a software solution is using storage resources optimally in several different ways, and two important metrics are throughput and latency. We have been using Bookies with PMEM journal devices in production for some time by replacing HDD-RAID devices. Figure 7 shows the write throughput vs latency bucket graph for Bookies with HDD-RAID journal device and Figure 8 shows for PMEM journal device. Bookies with HDD-RAID configuration have high write latency with the spike in traffic and it shows that requests having > 50ms write-latency increase with the higher traffic. On the other hand, Bookies with PMEM journal device provides stable and consistent low latency with higher traffic and serves user requests within SLA. These graphs explain the user impact of PMEM which allows Bookies to serve latency-sensitive applications and meet their SLA with the spike in traffic as well.
[Figure 7. Bookie Publish Latency Buckets with HDD-RAID Bookie Journal Device]
[Figure 8. Bookie Publish Latency Buckets with PMEM Bookie Journal Device]
Final Thoughts
Pulsar architecture can accommodate different types of hardware which allows users to balance performance and cost based on required throughput and latency. Pulsar has the capability to adapt to the next generation of storage devices to achieve better performance. We have also seen that persistent memory excels in the race to achieving higher write throughput by maintaining low latency.
Appendix
[1] DC Persistent Memory Module.
https://www.intel.com/content/www/us/en/architectureand-technology/optane-dc-persistent-memory.html
[2] Multiple Journal Support: https://issues.apache.org/jira/browse/BOOKKEEPER963.
[3] Pulsar Performance Tool: http://pulsar.apache.org/docs/en/performance-pulsar-perf/.
[4] Per Bookie Configurable Number of Channels: https://github.com/apache/pulsar/pull/7910.
Apache Pulsar: Seamless Storage Evolution and Ultra-High Performance with Persistent MemoryOctober 26, 2021
|
|
August 27, 2021 |
August 27, 2021
Latest Updates - August 2021Jithin Emmanuel, Director of Engineering, Verizon Media
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- Pipeline Visualizer tool to view connected pipelines in a single UI.
- Offline queue processing to detect and fail builds early with Kubernetes executor.
- Screwdriver now uses Docker in Docker to publish images to Docker Hub.
- Build artifacts to be streamed via API to speed up artifact rendering.
- Update eslint rules to latest across libraries and applications.
- Executors should be able to mount custom data into the build environment.
- UI to streamline display of Start/Stop buttons for Pull Requests.
Bug Fixes
- Launcher: Fix for not able to update build parameters when restarting builds.
- Queue Service: QUEUED notification is sent twice.
- API: QUEUED build status notification was not being sent.
- API: Validate input when updating user settings.
- UI: Fix for Template/Command title breadcrumbs not working.
- UI: Validate event URL path parameters.
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.140
- UI - v1.0.655
- Store - v4.2.2
- Queue-Service - v2.0.11
- Launcher - v6.0.137
- Build Cluster Worker - v2.20.2
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Jithin
- Ibuki
- Harura
- Kazuyuki
- Kenta
- Keisuke
- Kevin
- Naoaki
- Mansoor
- Om
- Pritam
- Tiffany
- Yoshiyuki
- Yuichi
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Latest Updates - August 2021August 27, 2021
|
|
June 30, 2021 |
June 30, 2021
Latest Updates - June 2021Jithin Emmanuel, Engineering Manager, Verizon Media
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- Read protections for Pipelines for Private SCM Repositories.
- Support Read-Only SCM for mirrored Source Repositories.
- API: Allow Pipeline tokens to list secrets.
- API: Support PENDING Pull Request status check using metadata.
- UI: Link to the Pipeline which published a Template/Command
- Queue Worker: Add offline processing to verify if builds have started.
Bug Fixes
- Launcher: Fix metadata getting overwritten
- Launcher: Fix broken builds.
- Launcher: Prevent Shared Commands from logging by default.
- Launcher: UUID library used is no longer supported.
- UI: PR job name in the build page should link to the Pull Requests tab.
- UI: Do not show remove option for Pipelines in default collections.
- UI: List view is slow for pipelines with large numbers of job.
- Store: Return proper Content-Type for artifacts.
- API: Fix broken tests due to higher memory usage.
- API: Job description is missing when templates are used..
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.120
- UI - v1.0.644
- Store - v4.1.11
- Queue-Service - v2.0.7
- Launcher - v6.0.133
- Build Cluster Worker - v2.15.0
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Jithin
- Ibuki
- Harura
- Kazuyuki
- Kenta
- Keisuke
- Kevin
- Naoaki
- Mansoor
- Om
- Pritam
- Tiffany
- Yoshiyuki
- Yuichi
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Latest Updates - June 2021June 30, 2021
|
|
June 2, 2021 |
June 2, 2021
Latest Updates - May 2021Jithin Emmanuel, Engineering Manager, Verizon Media
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- Ability to Abort Frozen builds.
- Badges are supported natively without having to connect to an external service.
Bug Fixes
- UI should skip builds with `CREATED` status when computing event status.
- UI setting for graph job name adjustment was not working.
- UI: Fix event label overflowing for large values and update Stop button position.
- UI: Show restart option for builds in PR Chain.
- UI: Tone down the color when the build parameters are changed from default value.
- UI: Stop rendering files with binary content.
- UI: Fix validation for Git repository during pipeline creation.
- API: Fix for trusted templates not getting carried over to new versions.
- API: Fix validation when a step name is defined that duplicates an automatically generated step.
- API: Streamline remove command tag API reposone.
- Store: Fix for large file downloads failing.
- Launcher: Fix metadata getting overwritten
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.96
- Queue-Service - v2.0.6
- UI - v1.0.629
- Store - v4.1.7
- Launcher - v6.0.128
- Build Cluster Worker - v2.10.0
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Dekus
- Jithin
- Ibuki
- Harura
- Kazuyuki
- Kazuki
- Kenta
- Keisuke
- Kevin
- Lakshminarasimhan
- Naoaki
- Mansoor
- Pritam
- Shu
- Tiffany
- Yoshiyuki
- Yuichi
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Latest Updates - May 2021June 2, 2021
|
|
April 22, 2021 |
April 22, 2021
Latest Updates - April 2021Jithin Emmanuel, Engineering Manager, Verizon Media
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- External config to have support for Source Directory in child pipelines.
- Removing expiry of shared commands.
- API to support OR workflow for jobs.
- Collections UX improvements. Part of Yahoo Hack Together.
- Proper validation of modal.
- Make sure mandatory fields are filled in.
- UI: Option to hide PR jobs in event workflow
- UI: Hide builds in `CREATED` status to avoid confusion.
- store-cli: Support for parallel writes to build cache with locking.
- Improvements to sd-local log format
- Fix for broken lines.
- Non-verbose logging for interactive mode.
- New API to remove a command tag.
- Warn users if build parameters are different from default values.
Bug Fixes
- API: Fix for a join build stuck in “CREATED” status due to missing join data.
- Queue Service: Enhanced error handling to reduce errors in build periodic processing.
- API: Prevent users from overwriting job audit data.
- UI : properly validate templates even if there are extra lines above config.
- UI : Fix for duplicate event displayed in event list.
- Launcher: Support setting pushgateway protocol schema
- Launcher: Enable builds to read metadata from the entire event in addition to immediate parent builds.
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.84
- Queue-Service - v2.0.6
- UI - v1.0.618
- Store - v4.1.3
- Launcher - v6.0.128
- Build Cluster Worker - v2.10.0
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Dekus
- Jithin
- Ibuki
- Harura
- Kazuyuki
- Kazuki
- Kenta
- Keisuke
- Krishna
- Kevin
- Lakshminarasimhan
- Naoaki
- Mansoor
- Pritam
- Rakshit
- Shu
- Tiffany
- Yoshiyuki
- Yuichi
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Latest Updates - April 2021April 22, 2021
|
|
March 8, 2021 |
March 8, 2021
Latest Updates - March 2021Jithin Emmanuel, Engineering Manager, Verizon Media
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- sd-local cli enhancements
- Compatibility with podman
- Make ssh agent it work with non-root user containers.
- added User-Agent info on request from sd-local to API to track usage.
- Template owners can lock template steps to prevent step override.
- UI can be prevented from restarting specific jobs using “manualStartEnabled” annotation.
- Added health check endpoint for “buildcluster-queue-worker”.
Bug Fixes
- Fix for trusted templates not showing up in UI.
- Launcher was not terminating the current running step on timeout or abort.
- API can now start with default configuration.
- Store not starting with memory strategy.
- Local development setup was broken.
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.66
- Queue-Service - v2.0.5
- UI - v1.0.604
- Store - v4.1.1
- Launcher - v6.0.122
- Build Cluster Worker - v2.9.0
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Jithin
- Ibuki
- Kazuyuki
- Kenta
- Keisuke
- Kevin
- Lakshminarasimhan
- Naoaki
- Pritam
- Tiffany
- Yoshiyuki
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Latest Updates - March 2021March 8, 2021
|
|
March 4, 2021 |
March 4, 2021
Join Screwdriver at Yahoo Hack Together (Virtual Open Source Hackathon), March 21 - 28We’re thrilled to be participating in Yahoo Hack Together, a virtual, open source hackathon, running from March 21 through 28.
In addition to Screwdriver, there are several other awesome projects participating. Themes include Data, Design, and Information Security (Defense).
The hackathon also includes:
- Suggested topics/issues for you to get started
- Support channels to reach out to project maintainers
- Office Hours to ask questions and get feedback
- Verizon Media swag & prizes
Eligible contributions include accessibility reviews, coding, design, documentation, translations, user experience, and social media suggestions.
We’d love to invite you to join us!
Join Screwdriver at Yahoo Hack Together (Virtual Open Source Hackathon), March 21 - 28March 4, 2021
|
|
February 17, 2021 |
February 17, 2021
cdCon 2021 - Call for Screwdriver ProposalsDear Screwdriver Community,
cdCon 2021 (the Continuous Delivery Foundation’s annual flagship event) is happening June 23-24 and its call for papers is open!
This is your chance to share what you’ve been doing with Screwdriver. Are you building something cool? Using it to solve real-world problems? Are you making things fast? Secure? Or maybe you’re a contributor and want to share what’s new. In all cases, we want to hear from you!
Submit your talk for cdCon 2021 to be part of the conversation driving the future of software delivery for technology teams, enterprise leadership, and open-source communities.Submission Deadline
Final Deadline: Friday, March 5 at 11:59 PM PSTTopics
Here are the suggested tracks:
- Continuous Delivery Ecosystem – This track spans the entire Continuous Delivery ecosystem, from workflow orchestration, configuration management, testing, security, release automation, deployment strategies, developer experience, and more.
- Advanced Delivery Techniques – For talks on the very cutting edge of continuous delivery and emerging technology, for example, progressive delivery, observability, and MLOps.
- GitOps & Cloud-Native CD – Submit to this track for talks related to continuous delivery involving containers, Kubernetes, and cloud*native technologies. This includes GitOps, cloud-native CD pipelines, chatops, best practices, etc.
- Continuous Delivery in Action – This track is for showcasing real-world continuous delivery addressing challenges in specific domains e.g. fintech, embedded, healthcare, retail, etc. Talks may cover topics such as governance, compliance, security, etc.
- Leadership Track – Talks for leaders and decision-makers on topics such as measuring DevOps, build vs buy, scaling, culture, security, FinOps, and developer productivity.
- Community Track – There is more to open source than code contributions. This track covers topics such as growing open source project communities, diversity & inclusion, measuring community health, project roadmaps, and any other topic around sustaining open source and open source communities.
Singular project focus and/or interoperability between:
- Jenkins
- Jenkins X
- Ortelius
- Spinnaker
- Screwdriver
- Tekton
- Other – e.g. Keptn, Flagger, Argo, Flux
View all tracks and read CFP details here.
We look forward to reading your proposal!
Submit here [https://events.linuxfoundation.org/cdcon/program/cfp/]
cdCon 2021 - Call for Screwdriver ProposalsFebruary 17, 2021
|
|
February 15, 2021 |
February 15, 2021
Latest UpdatesJithin Emmanuel, Engineering Manager, Verizon Media
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- Group Events in UI to visualize events started by restarting jobs in one place.
- Template Composition to enable Template authors to inherit job configuration from an existing Template
- Lua scripting support in meta cli.
- Launcher now bundles skopeo binary.
- UI to highlight the latest event in the events list.
- Notification configuration validations errors can be made into warnings.
- Build cache performance enhancement by optimizing compression algorithms.
- Streamline Collection deletion UI flow.
Bug Fixes
- SonarQube PR analysis setting is not always added.
- Session timeout was leading to 404.
- Templates & Commands UI page load time is now significantly faster.
- Fix Templates permalink.
- Clarify directions for build cluster queue setup.
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.53
- Queue-Service - v2.0.0
- UI - v1.0.598
- Store - v4.0.2
- Launcher - v6.0.115
- Build Cluster Worker - v2.9.0
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Dekus
- Jithin
- Ibuki
- Kkawahar
- Keisuke
- Kevin
- Lakshminarasimhan
- Pritam
- Sheridan C Rawlins
- Tiffany
- Yoshiyuki
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Latest UpdatesFebruary 15, 2021
|
|
January 7, 2021 |
January 7, 2021
Improvements and updates.Jithin Emmanuel, Engineering Manager, Verizon Media
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- sd-local tool support for mounting of local ssh-agent & custom volumes.
- Teardown steps will run for aborted builds. Users can control duration via annotation terminationGracePeriodSeconds
- Properly validate settings configuration in `screwdriver.yaml`
- This will break existing pipelines if the setting value is already wrong.
- Support exclusions in source paths.
- Warn users if template version is not specified when creating pipeline.
Bug Fixes
- Meta-cli now works with strings of common logarithms.
- Jobs with similar names were breaking the pipeline detail page.
- Pipeline list view to lazy load data for improved performance.
- Fix for slow rendering when rendering Pipeline workflow graph.
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v4.1.36
- Queue-Service - v2.0.0
- UI - v1.0.590
- Store - v4.0.2
- Launcher - v6.0.106
- Build Cluster Worker - v2.3.3
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Dekus
- Jithin
- Ibuki
- Kenta
- Kkawahar
- Keisuke
- Kevin
- Lakshminarasimhan
- Pritam
- Tiffany
- Yoshiyuki
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Improvements and updates.January 7, 2021
|
|
September 30, 2020 |
September 30, 2020
Explore Screwdriver at CDCon 2020Screwdriver is an open-source build platform for Continuous Delivery. Using Screwdriver, you can easily define the path that your code takes from Pull Request to Production. The Screwdriver team will be presenting three talks at CDCon (Oct 7-8) and would love to have you join! Register to attend CDCon.
CDCon has pledged to donate 100% of the proceeds received from CDCon 2020 registration to charitable causes: Black Girls Code, Women Who Code and the CDF Diversity Fund. Registrants indicate which charitable fund they want their 25 USD registration fees to go to during registration.
Hope to see you at CDCon!
- - -
Screwdriver UI Walkthrough
Oct 7, 12:40 PM PDT
Speakers: Alan Dong, Software Engineer, Verizon Media
In this session, Alan will cover the fundamental parts of Screwdriver:
- What is a pipeline?
- How to use a screwdriver to set up a pipeline from scratch
- Integrate with SCM (i.e. GitHub)
- Setup collections for personal preferences
- How to get involved with Screwdriver.cd to get help and contribute back to the community
Case Study: How Yahoo! Japan Uses and Contributes to Screwdriver at Scale
Oct 7, 2:20 PM PDT
Speakers: Hiroki Takatsuka, Engineering Manager, Yahoo! Japan & Jithin Emmanuel, Sr Mgr, Software Dev Engineering, Verizon Media
Yahoo! Japan will share how they use and contribute to Screwdriver, an open-source build platform designed for Continuous Delivery, at scale. Several topics will be covered including: architecture, use cases, usage stats, customization, operational tips, and collaborating directly with Verizon Media’s Screwdriver team to constantly evolve Screwdriver.
CI/CD with Open Source Screwdriver
Oct 8, 3:50 PM PDT
Speakers: Jithin Emmanuel, Sr Mgr, Software Dev Engineering & Tiffany Kyi, Software Development Engineer, Verizon Media
Now part of the Continuous Delivery Foundation, Screwdriver is an open source CI/CD platform, originally created and open-sourced by Yahoo/Verizon Media. At Yahoo/Verizon Media, Screwdriver is used to run more than 60,000 software builds every day. Yahoo! Japan also uses and contributes to Screwdriver. In this session, core contributors to Screwdriver will provide an overview of features and capabilities, and how it is used at scale covering use-cases across mobile, web applications, and library development across various programming languages.
Explore Screwdriver at CDCon 2020September 30, 2020
|
|
August 17, 2020 |
August 17, 2020
SonarQube Enterprise Edition SupportTiffany Kyi, Software Engineer, Verizon Media
We have recently added SonarQube Enterprise Edition support to Screwdriver, which unlocks powerful Pull Request Workflows and improves build analysis performance. Cluster admins can follow instructions in the Cluster Admin Configuration section below to use SonarQube Enterprise.
In order to make use of these new Pull Request features and to better utilize our SonarQube license, we will be making the following changes:
1. Sonar Project Key for your build will change from “job:
SonarQube Enterprise Edition SupportAugust 17, 2020
|
|
August 13, 2020 |
August 13, 2020
Latest Product UpdatesJithin Emmanuel, Engineering Manager, Verizon Media
Screwdriver team is pleased to announce our newest release which brings in new features and bug fixes across various components.
New Features
- SonarQube enterprise support #1314
- Automatic Deploy Key setup for Github SCM pipelines #1079
- Support for filtering on tag and release names #1994
- Notification Slack channel can be set dynamically in build. Usage instructions here.
- Build Parameters to support drop-down selections #2092
- Confirmation dialogue when deleting Pipeline secrets #2117
- Added “PR_BASE_BRANCH_NAME” environment variable for determining Pull Request base branch #2153
- Upgraded Ember.js to the latest LTS for Screwdriver UI
Bug Fixes
- Child pipelines to work without having to override config pipeline secrets #2125
- Periodic builds configs were not cleaning up on removal #2138
- Template list in “Create Pipeline” view to display namespaces #2140
- Remote trigger to work for Child Pipelines #2148
Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v0.5.964
- Queue-Service - v1.0.22
- UI - v1.0.535
- Launcher - v6.0.87
- Build Cluster Worker - v1.18.8
Contributors
Thanks to the following contributors for making this feature possible:
- Alan
- Jithin
- Joerg
- Ibuki
- Kevin
- Keisuke
- Kenta
- Lakshminarasimhan
- Pritam
- Teppei
- Tiffany
- Yoshiyuki
- Yuichi
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Latest Product UpdatesAugust 13, 2020
|
|
August 6, 2020 |
August 6, 2020
Behold! Big Data at Fast Speed!Oak0.2 Release: Significant Improvements to Throughput, Memory Utilization, and User Interface
By Anastasia Braginsky, Sr. Research Scientist, Verizon Media Israel
Creating an open source software is an ongoing and exciting process. Recently, Oak open-source library delivered a new release: Oak0.2, which summarizes a year of collaboration. Oak0.2 makes significant improvements in throughput, memory utilization, and user interface.
OakMap is a highly scalable Key-Value Map that keeps all keys and values off-heap. The Oak project is designed for Big Data real-time analytics. Moving data off-heap, enables working with huge memory sizes (above 100GB) while JVM is struggling to manage such heap sizes. OakMap implements the industry-standard Java8 ConcurrentNavigableMap API and more. It provides strong (atomic) semantics for read, write, and read-modify-write, as well as (non-atomic) range query (scan) operations, both forward and backward. OakMap is optimized for big keys and values, in particular, for incremental maintenance of objects (update in-place). It is faster and scales better with additional CPU cores than the popular Java’s ConcurrentNavigableMap implementation ConcurrentSkipListMap.
Oak data is written to the off-heap buffers, thus needs to be serialized (converting an object in memory into a stream of bytes). For retrieval, data might be deserialized (object created from the stream of bytes). In addition, to save the cycles spent on deserialization, we allow reading/updating the data directly via OakBuffers. Oak provides this functionality under the ZeroCopy API.
If you aren’t already familiar with Oak, this is an excellent starting point to use it! Check it out and let us know if you have any questions.
Oak keeps getting better: Introducing Oak0.2
We have made a ton of great improvements to Oak0.2, adding a new stream scanning for improved performance, releasing a ground-up rewrite of our Zero Copy API’s buffers to increase safety and performance, and decreasing the on-heap memory requirement to be less than 3% of the raw data! As an exciting bonus, this release also includes a new version of our off-heap memory management, eliminating memory fragmentation.
Below we dive deeper into sub-projects being part of the release.
Stream Data Faster
When scanned data is held by any on-heap data structures, each next-step is very easy: get to the next object and return it. To retrieve the data held off-heap, even when using Zero-Copy API, it is required to create a new OakBuffer object to be returned upon each next step. Scanning Big Data that way will create millions of ephemeral objects, possibly unnecessarily, since the application only accesses this object in a short and scoped time in the execution.
To avoid this issue, the user can use our new Stream Scan API, where the same OakBuffer object is reused to be redirected to different keys or values. This way only one element can be observed at a time. Stream view of the data is frequently used for flushing in-memory data to disk, copying, analytics search, etc.
Oak’s Stream Scan API outperforms CSLM by nearly 4x for the ascending case. For the descending case, Oak outperforms CSLM by more than 8x even with less optimized non-stream API. With the Stream API, Oak’s throughput doubles. More details about the performance evaluation can be found here.
Safety or Performance? Both!
OakBuffers are core ZeroCopy API primitives. Previously, alongside with OakBuffers, OakMap exposed the underlying ByteBuffers directly to the user, for the performance. This could cause some data safety issues such as an erroneous reading of the wrong data, unintentional corrupting of the data, etc. We couldn’t choose between safety and performance, so strived to have both!
With Oak0.2, ByteBuffer is never exposed to the user. Users can choose to work either with OakBuffer which is safe or with OakUnsafeDirectBuffer which gives you faster access, but use it carefully. With OakUnsafeDirectBuffer, it is the user’s responsibility to synchronize and not to access deleted data, if the user is aware of those issues, OakUnsafeDirectBuffer is safe as well.
Our safe OakBuffer works with the same, great and known, OakMap performance, which wasn’t easy to achieve. However, if the user is interested in even superior speed of operations, any OakBuffer can be cast to OakUnsafeDirectBuffer.
Less (metadata) is more (data)
In the initial version of OakMap we had an object named handler that was a gateway to access any value. Handler was used for synchronization and memory management. Handler took about 256 bytes per each value and imposed dereferencing on each value access.
Handler is now replaced with an 8-bytes header located in the off-heap, next to the value. No dereferencing is needed. All information needed for synchronization and memory manager is kept there. In addition, to keep metadata even smaller, we eliminated the majority of the ephemeral object allocations that were used for internal calculations.
This means less memory is used for metadata and what was saved goes directly to keep more user data in the same memory budget. More than that, JVM GC has much less reasons to steal memory and CPU cycles, even when working with hundreds of GBs.
Fully Reusable Memory for Values
As explained above, 8-byte off-heap headers were introduced ahead of each value. The headers are used for memory reclamation and synchronization, and to hold lock data. As thread may hold the lock after a value is deleted, the header’s memory couldn’t be reused. Initially the header’s memory was abandoned, causing a memory leak.
The space allocated for value is exactly the value size, plus header size. Leaving the header not reclaimed, creates a memory “hole” where a new value of the same size can not fit in. As the values are usually of the same size, this was causing fragmentation. More memory was consumed leaving unused spaces behind.
We added a possibility to reuse the deleted headers for new values, by introducing a sophisticated memory management and locking mechanism. Therefore the new values can use the place of the old deleted value. With Oak0.2, the scenario of 50% puts and 50% deletes is running with a stable amount of memory and performs twice better than CSLM.
We look forward to growing the Oak community! We invite you to explore the project, use OakMap in your applications, raise issues, suggest improvements, and contribute code. If you have any questions, please feel free to send us a note. It would be great to hear from you!
Acknowledgements:
Liran Funaro, Eshcar Hilel, Eran Meir, Yoav Zuriel, Edward Bortnikov, Yonatan Gottesman
open source
big data
performance
concurrency
multi-threading
scalability
java off-heap
key-value store
memory utilization
Behold! Big Data at Fast Speed!August 6, 2020
|
|
July 31, 2020 |
July 31, 2020
Apache Storm 2.2.0 Improvements - NUMA Support, Auto Refreshing SSL Certificates for All Daemons, V2 Tick Backwards Compatibility, Scheduler Improvements, & OutputCollector Thread SafetyKishor Patil, PMC Chair Apache Storm & Sr. Principal Software Systems Engineer, Verizon Media
Last year, we shared with you many of the Apache Storm 2.0 improvements contributed by Verizon Media. At Yahoo/Verizon Media, we’ve been committing to Storm for many years. Today, we’re excited to explore a few of the new features, improvements, and bug fixes we’ve contributed to Storm 2.2.0.
NUMA Support
The server hardware is getting beefier and requires worker JVMs to be NUMA (Non-uniform memory access) aware. Without constraining JVMs to NUMA zones, we noticed dramatic degradation in the JVM performance; specifically for Storm where most of the JVM objects are short-lived and continuous GC cycles perform complete heap scan. This feature enables maximizing hardware utilization and consistent performance on asymmetric clusters. For more information please refer to [STORM-3259].
Auto Refreshing SSL Certificates for All Daemons
At Verizon Media, as part of maintaining thousands of Storm nodes, refreshing SSL/TLS certificates without any downtime is a priority. So we implemented auto refreshing SSL certificates for all daemons without outages. This becomes a very useful feature for operation teams to monitor and update certificates as part of hassle free continuous monitoring and maintenance. Included in the security related critical bug fixes the Verizon Media team noticed and fixed are:
- Kerberos connectivity from worker to Nimbus/Supervisor for RPC heartbeats [STORM-3579]
- Worker token refresh causing authentication failure [STORM-3578]
- Use UserGroupInformation to login to HDFS only once per process [STORM-3494]
- AutoTGT shouldn’t invoke TGT renewal thread [STORM-3606]
V2 Tick Backwards Compatibility
This allows for deprecated metrics at worker level to utilize messaging and capture V1 metrics. This is a stop-gap giving topology developers sufficient time to switch from V1 metrics to V2 metrics API. The Verizon Media Storm team also provided shortening metrics names to allow for metrics names that conform to more aggregation strategies by dimension [STORM-3627]. We’ve also started removing deprecated metrics API usage within storm-core and storm-client modules and adding new metrics at nimbus/supervisor daemon level to monitor activity.
Scheduler Improvements
ConstraintSolverStrategy allows for max co-location count at the Component Level. This allows for better spread - [STORM-3585]. Both ResourceAwareScheduler and ConstraintSolverStrategy are refactored for faster performance. Now a large topology of 2500 component topology requesting complex constraints or resources can be scheduled in less than 30 seconds. This improvement helps lower downtime during topology relaunch - [STORM-3600]. Also, the blacklisting feature to detect supervisor daemon unavailability by nimbus is useful for failure detection in this release [STORM-3596].
OutputCollector Thread Safety
For messaging infrastructure, data corruption can happen when components are multi-threaded because of non thread-safe serializers. The patch [STORM-3620] allows for Bolt implementations that use OutputCollector in other threads than executor to emit tuples. The limitation is batch size 1. This important implementation change allows for avoiding data corruption without any performance overhead.
Noteworthy Bug Fixes
- For LoadAwareShuffle Grouping, we were seeing a worker overloaded and tuples timing out with load aware shuffle enabled. The patch checks for low watermark limits before switching from Host local to Worker local - [STORM-3602].
- For Storm UI, the topology visualization related bugs are fixed so topology DAG can be viewed more easily.
- The bug fix to allow the administrator access to topology logs from UI and logviewer.
- storm cli bug fixes to accurately process command line options.
What’s Next
In the next release, Verizon Media plans to contribute container support with Docker and RunC container managers. This should be a major boost with three important benefits - customization of system level dependencies for each topology with container images, better isolation of resources from other processes running on the bare metal, and allowing each topology to choose their worker OS and java version across the cluster.
Contributors
Aaron Gresch, Ethan Li, Govind Menon, Bipin Prasad, Rui Li
Apache Storm 2.2.0 Improvements - NUMA Support, Auto Refreshing SSL Certificates for All Daemons, V2 Tick Backwards Compatibility, Scheduler Improvements, & OutputCollector Thread SafetyJuly 31, 2020
|
|
July 23, 2020 |
July 23, 2020
Announcing RDFP for Zeek - Enabling Client Telemetry to the Remote Desktop ProtocolJeff Atkinson, Principal Security Engineer, Verizon Media
We are pleased to announce RDFP for Zeek. This project is based off of 0x4D31’s work, the FATT Remote Desktop Client fingerprinting. This technique analyzes client payloads during the RDP negotiation to build a profile of client software. RDFP extends RDP protocol parsing and provides security analysts a method of profiling software used on the network. BlueKeep identified some gaps in visibility spurring us to contribute to Zeek’s RDP protocol analyzer to extract additional details. Please share your questions and suggestions by filing an issue on Github.
Technical Details
RDFP extracts the following key elements and then generates an MD5 hash.
- Client Core Data
- Client Cluster Data
- Client Security Data
- Client Network Data
Here is how the RDFP hash is created:
md5(verMajor;verMinor;clusterFlags;encryptionMethods;extEncMethods;channelDef)
Client Core Data
The first data block handled is Client Core Data. The client major and minor versions are extracted. Other information can be found in this datagram but is more specific to the client configuration and not specific to the client software.
Client Cluster Data
The Client Cluster Data datagram contains the Cluster Flags. These are added in the order they are seen and will provide information about session redirection and other items - ex: if a smart card was used.
Client Security Data
The Client Security Data datagram provides the encryptionMethods and extEncryptionMethods. The encryptionMethods details the key that is used and message authentication code. The extEncryptionMethods is a specific flag designated for French locale.
Client Network Data
The Client Network Data datagram contains the Channel Definition Structure, (Channel_Def). Channel_Def provides configuration information about how the virtual channel with the server should be set up. This datagram provides details on compression, MCS priority, and channel persistence across transactions.
Here is the example rdfp.log generated by the rdfp.zeek script. The log provides all of the details along with the client rdfp_hash.
This technique works well, but notice that RDP clients can require TLS encryption. Reference the JA3 fingerprinting technique for TLS traffic analysis. Please refer to Adel’s blog post for additional details and examples about ways to leverage the RDP fingerprinting on the network.
Conclusion
Zeek RDFP extends network visibility into client software configurations. Analysts apply logic and detection techniques to these extended fields. Analysts and Engineers can also apply anomaly detection and additional algorithms to profile and alert suspicious network patterns.
Please share your questions and suggestions by filing an issue on Github.
Additional Reading
- John B. Althouse, Jeff Atkinson and Josh Atkins, “JA3 — a method for profiling SSL/TLS clients”
- Ben Reardon and Adel Karimi, “HASSH — a profiling method for SSH clients and servers”
- Microsoft Corporation, “[MS-RDPBCGR]: Remote Desktop Protocol: Basic Connectivity and Graphics Remoting”
- Adel Karimi, “Fingerprint All the Things!”
- Matt Bromiley and Aaron Soto, “What Happens Before Hello?”
- John Althouse, “TLS Fingerprinting with JA3 and JA3S”
- Zeek Package Contest 3rd Place Winner
Acknowledgments
Special thanks to Adel, #JA3, #HASSH, and W for reminding me there’s always more on the wire.
Announcing RDFP for Zeek - Enabling Client Telemetry to the Remote Desktop ProtocolJuly 23, 2020
|
|
July 16, 2020 |
July 16, 2020
Vespa Product Updates, June 2020: Support for Approximate Nearest Neighbor Vector Search, Streaming Search Speedup, Rank Features, & GKE Sample ApplicationKristian Aune, Tech Product Manager, Verizon Media
In the previous update, we mentioned Improved Slow Node Tolerance, Multi-Threaded Rank Profile Compilation, Reduced Peak Memory at Startup, Feed Performance Improvements, and Increased Tensor Performance. This month, we’re excited to share the following updates:
Support for Approximate Nearest Neighbor Vector Search
Vespa now supports approximate nearest neighbor search which can be combined with filters and text search. By using a native implementation of the HNSW algorithm, Vespa provides state of the art performance on vector search: Typical single digit millisecond response time, searching hundreds of millions of documents per node, but also uniquely allows vector query operators to be combined efficiently with filters and text search - which is usually a requirement for real-world applications such as text search and recommendation. Vectors can be updated in real-time with a sustained write rate of a few thousand vectors per node per second. Read more in the documentation on nearest neighbor search.
Streaming Search Speedup
Streaming Search is a feature unique to Vespa. It is optimized for use cases like personal search and e-mail search - but is also useful in high-write applications querying a fraction of the total data set. With #13508, read throughput from storage increased up to 5x due to better parallelism.
Rank Features
- The (Native)fieldMatch rank features are optimized to use less CPU query time, improving query latency for Text Matching and Ranking.
- The new globalSequence rank feature is an inexpensive global ordering of documents in a system with stable system state. For a system where node indexes change, this is inaccurate. See globalSequence documentation for alternatives.
GKE Sample Application
Thank you to Thomas Griseau for contributing a new sample application for Vespa on GKE, which is a great way to start using Vespa on Kubernetes.
…
About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.
We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.
Vespa Product Updates, June 2020: Support for Approximate Nearest Neighbor Vector Search, Streaming Search Speedup, Rank Features, & GKE Sample ApplicationJuly 16, 2020
|
|
July 11, 2020 |
July 11, 2020
Aggregated Job list view for Pipeline detailsInderbir Singh Hair, student at the University of Waterloo
We have recently added a new feature: Aggregated Job list view for Pipeline details.
This feature adds a way to view the status of each job in a pipeline as a list and thus provides a way to view the overall status of a pipeline.
An example of the aggregated job list view:
The list view can be seen by clicking the view toggle (highlighted in red) on the pipeline events tab:
The list view consists of 6 columns: Job, History, Duration, Start Time, Coverage, and Actions.
The Job column (highlighted in red) displays the most recent build status for a job along with the job’s name.
The History column (highlighted in red), provides a summary of the last 5 build statuses for the job, with the most recent build on the right:
Clicking on a status bubble, whether it be one from the history column or the one in the job column, will take you to the related build’s status page.
The Duration column (highlighted in red) displays how long it took to run the most recent build for the associated job.
The Start Time column (highlighted in red) displays when the most recent build for the associated job was started.
The Coverage column (highlighted in red) gives the SonarQube coverage for the associated job.
The Actions column (highlighted in red) allows 3 actions to be run for each job: starting a new build for the associated job (left), aborting the most recent build for the associated job (if it has yet to be completed) (center), and restarting the associated job from its latest build (right).
The list view does not have real-time data updates, instead, the refresh button (highlighted in red) can be used to update the list view’s data.
Compatibility List
In order to use this feature, you will need these minimum versions:
- API - v0.5.924
- UI - v1.0.521
- Store - v3.11.1
- Launcher - v6.0.73Contributors
Thanks to the following contributors for making this feature possible:
- InderH
- adong
- jithine
- tkyi
Questions & Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out through our various support channels. You can also visit us on Github and Slack.
Aggregated Job list view for Pipeline detailsJuly 11, 2020
|
|
July 8, 2020 |
July 8, 2020
Announcing Spicy Noise - Identify and Monitor WireGuard at Wire SpeedJeff Atkinson, Principal Security Engineer, Verizon Media
Today we are excited to announce the release of Spicy Noise. This open source project was developed to address the need to identify and monitor WireGuard traffic at line speed with Zeek. The Spicy framework was chosen to build the protocol parser needed for this project. Please share your questions and suggestions by filing an issue on Github.
WireGuard was implemented on the Noise Protocol Framework to provide simple, fast, and secure cryptographic communication. Its popularity started within the Linux community due to its ability to run on Raspberry Pi and high end servers. The protocol has now been adopted and is being used cross platform. To explain how Spicy Noise works, let’s look at how Zeek and Spicy help monitor traffic.
Zeek is a network monitoring project that is robust and highly scalable. It supports multiple protocol analyzers on a standard install and provides invaluable telemetry for threat hunting and investigations. Zeek has been deployed on 100 gigabit networks.
Spicy is a framework provided by the Zeek community to build new protocol analyzers. It is replacing Binpac as a much simpler method to build protocol parsers. The framework has built-in integration with Zeek to enable analysis at line speed.
How it works
Zeek’s Architecture begins by reading packets from the network. The packets are then routed to “Event Engines” which parse the packets and forward events containing details of the packet. These events are presented to the “Policy Script Interpreter” where the details from the event can be acted upon by Zeek scripts. There are many scripts which ship with Zeek to generate logs and raise notifications. Many of these logs and notifications are forwarded to the SIEM of a SOC for analysis.
To build the capability to parse WireGuard traffic a new “Event Engine” has been created. This is done with Spicy by defining how a packet is parsed and how events are created. Packet parsing is defined in a .spicy file. Events are defined in a .evt file which will forward the details extracted by the .spicy parser for the “Policy Script Interpreter”. A dynamic protocol detection signature has to be defined so Zeek knows how to route packets to the new Event Engine. Refer to the diagram below to understand the role of the .spicy and .evt files of the new WireGuard parser or “Event Engine”.
Technical Implementation
The first step to building a new “Event Engine” is to define how the packet is to be parsed. Referring to the WireGuard protocol specification, there are four main UDP datagram structures. The four datagram structures defined are the Handshake Initiation, Handshake Response, Cookie Reply, and Transport Data. The diagram below depicts how the client and server communicate.
We will focus on the first, Handshake Response, but the same method is used to apply to the other three packet structures. The following diagram from the WireGuard whitepaper illustrates the structure of the Handshake Initiation packet.
The sections of the packet are defined with their respective sizes. These details are used in the .spicy file to define how Spicy will handle the packet. Note that the first field is the packet type and a value of 1 defines it as a Handshake Initiation structured packet. Below is a code snippet of wg-1.spicy from the repository. A type is created to define the fields and their size or delimiters.
Spicy uses wg-1.spicy as the first part of the “Event Engine” to parse packets. The next part needed is to define events in the .evt file. An event is created for each packet type to pass values from the “Event Engine” to the “Policy Script Interpreter”.
The .evt file also includes an “Analyzer Setup” which defines the Analyzer_Name, Transport_Portocol and additional details if needed.
The Analyzer_Name is used by dynamic protocol detection (DPD). Zeek reads packets and compares them against DPD signatures to identify which Analyzer or “Event Engine” to use. The Wireguard DPD signature looks for the first byte of a UDP datagram to be 1 followed by the reserved zeros as defined in the protocol specification. Below is the DPD signature created for matching on the WireGuard Handshake_Initiation packet which is the first in the session.
Now as Spicy or Zeek parse packets, anytime a packet is parsed by the Handshake_Initiation type it will generate an event. The event will include connection details stored in the $conn variable which is passed from the stream processor portion of the “Event Engine.” The additional fields are extracted from the packet as defined in the corresponding .spicy file type. These events are received by the “Policy Script Interpreter” and can be acted upon to create logs or raise notifications. Zeek scripts define which events to receive and what action is to be taken. The example below shows how the WireGuard::Initiation event can be used to set the service field in Zeek’s conn.log.
The conn.log file will now have events with a service of WireGuard.
Conclusion
Wireguard provides an encrypted tunnel which can be used to circumvent security controls. Zeek and Spicy provide a solution to enhance network telemetry allowing better understanding of the traffic. Standard network analysis can be applied with an understanding that WireGuard is in use and encrypting the traffic.
Announcing Spicy Noise - Identify and Monitor WireGuard at Wire SpeedJuly 8, 2020
|
|
July 7, 2020 |
July 7, 2020
Bindable: Open Source Themeable Design System Built in Aurelia JS for Faster and Easier Web DevelopmentJoe Ipson, Software Dev Engineer, Verizon Media
Luke Larsen, Sr Software Dev Engineer, Verizon Media
As part of the Media Platform Video Team we build and maintain a set of web applications that allow customers to manage their video content. We needed a way to be consistent with how we build these applications. Creating consistent layouts and interfaces can be a challenge. There are many areas that can cause bloat or duplication of code. Some examples of this are, coding multiple ways to build the same layout in the app, slight variations of the same red color scattered all over, multiple functions being used to capitalize data returned from the database. To avoid cases like this we built Bindable. Bindable is an open source design system that makes it possible to achieve consistency in colors, fonts, spacing, sizing, user actions, user permissions, and content conversion. We’ve found it helps us be consistent in how we build layouts, components, and share code across applications. By making Bindable open source we hope it will do the same for others.
Theming
One problem with using a design system or library is that you are often forced to use the visual style that comes with it. With Bindable you can customize how it looks to fit your visual style. This is accomplished through CSS custom properties. You can create your own custom theme by setting these variables and you will end up with your own visual style.
Modular Scale
Harmony in an application can be achieved by setting all the sizing and spacing to a value on a scale. Bindable has a modular scale built in. You can set the scale to whatever you wish and it will adjust. This means your application will have visual harmony. When you need, you can break out of the modular scale for custom sizing and spacing.
Aurelia
Aurelia is a simple, powerful, and unobtrusive javascript framework. Using Aurelia allows us to take advantage of its high performance and extensibility when creating components. Many parts of Bindable have added features thanks to Aurelia.
Tokens
Tokens are small building blocks that all other parts of Bindable use. They are CSS custom properties and set things like colors, fonts, and transitions.
Layouts
The issue of creating the same layout using multiple methods is solved by Layouts in Bindable. Some of the Layouts in Bindable make it easy to set a grid, sidebar, or cluster of items in a row. Layouts also handle all the spacing between components. This keeps all your spacing orderly and consistent.
Components
Sharing these components was one of the principal reasons the library exists. There are over 40 components available, and they are highly customizable depending on your needs.
Access Modifiers
Bindable allows developers to easily change the state of a component on a variety of conditions. Components can be hidden or disabled if a user lacks permission for a particular section of a page. Or maybe you just need to add a loading indicator to a button. These attributes make it easy to do either (or both!).
Value Converters
We’ve included a set of value converters that will take care of some of the most basic conversions for you. Things like sanitizing HTML, converting CSV data into an array, escaping a regex string, and even more simple things like capitalizing a string or formatting an ISO Date string.
Use, Contribute, & Reach Out
Explore the Bindable website for helpful details about getting started and to see detailed info about a given component. We are excited to share Bindable with the open source community. We look forward to seeing what others build with Bindable, especially Aurelia developers. We welcome pull requests and feedback! Watch the project on GitHub for updates. Thanks!
Acknowledgements
Cam Debuck, Ajit Gauli, Harley Jessop, Richard Austin, Brandon Drake, Dustin Davis
Bindable: Open Source Themeable Design System Built in Aurelia JS for Faster and Easier Web DevelopmentJuly 7, 2020
|
|
July 3, 2020 |
July 3, 2020
Change Announcement - JSON Web Key (JWK) for Public Elliptic-curve (EC) KeyAshish Maheshwari, Software Engineer, Verizon Media
In this post, we will outline a change in the way we expose the JSON Web Key (JWK) for our public Elliptic-curve (EC) key at this endpoint: https://api.login.yahoo.com/openid/v1/certs, as well as, immediate steps users should take. Impacted users are any clients who parse our JWK to extract the EC public key to perform actions such as verify a signed token.
The X and Y coordinates of our EC public key were padded with a sign bit which caused it to overflow from 32 to 33 bytes. While most of the commonly used libraries to parse a JWK to public key can handle the extra length, others might expect a length strictly equal to 32 bytes. This change can be a breaking change for those.
Here are the steps affected users should take:
- Any code/flow which needs to extract our EC public key from the JWK needs to be tested for this change. Below is our pre and post change production JWK for EC public key. Please verify that your code can successfully parse the new JWK. Notice the change in base64url value of the Y coordinate in the new JWK.
We are planning to make this change live on July 20th, 2020. If you have any questions/comments, please tweet @YDN or email us.
Current production EC JWK:
{“keys”:[{“kty”:“EC”,“alg”:“ES256”,“use”:“sig”,“crv”:“P-256”,“kid”:“3466d51f7dd0c780565688c183921816c45889ad”,“x”:“cWZxqH95zGdr8P4XvPd_jgoP5XROlipzYxfC_vWC61I”,“y”:“AK8V_Tgg_ayGoXiseiwLOClkekc9fi49aYUQpnY1Ay_y”}]}
EC JWK after change is live:
{“keys”:[{“kty”:“EC”,“alg”:“ES256",“use”:“sig”,“crv”:“P-256",“kid”:“3466d51f7dd0c780565688c183921816c45889ad”,“x”:“cWZxqH95zGdr8P4XvPd_jgoP5XROlipzYxfC_vWC61I”,“y”:“rxX9OCD9rIaheKx6LAs4KWR6Rz1-Lj1phRCmdjUDL_I”}]}
Change Announcement - JSON Web Key (JWK) for Public Elliptic-curve (EC) KeyJuly 3, 2020
|
|
June 30, 2020 |
June 30, 2020
Introducing vSSH - Go Library to Execute Commands Over SSH at ScaleMehrdad Arshad Rad, Sr. Principal Software Engineer, Verizon Media
vSSH is a high performance Go library designed to execute shell commands remotely on tens of thousands of network devices or servers over SSH protocol. The vSSH high-level API provides additional functionality for developing network or server automation. It supports persistent SSH connection to execute shell commands with a warm connection and returns data back quickly.
If you manage multiple Linux machines or devices you know how difficult it is to run commands on multiple machines every day, and appreciate the significant value of automation. There are other open source SSH libraries available in a variety of languages but vSSH has great features like persistent SSH connection, the ability to limit sessions, to limit the amount of data transferred, and it handles many SSH connections concurrently while using resources efficiently. Go developers can quickly create the network device, server automation, or tools, by using this library and focusing on the business logic instead of handling SSH connections.
vSSH can run on your application asynchronous and then you can call the APIs/methods through your application (safe concurrency). To start, load your clients information and add them to vSSH using a simple method. You can add labels and other optional attributes to each client. By calling the run method, vSSH sends the given command to all available clients or based on your query, it runs the command on the specific clients and the results of the command can be received in streaming (real-time) or the final result.
One of the main features of vSSH is a persistent connection to all devices and the ability to manage them. It can connect to all the configured devices/servers, all the time. The connections are simple authenticated connections without session at the first stage. When vSSH needs to run a command, it tries to create a session and it closes the session when it’s completed. If you don’t need the persistence feature then you can disable it, which results in the connection closing at the end. The main advantage of persistence is that it works as a warm connection and once the run command is requested, it just needs to create a session. The main use case is when you need to run commands on the clients continuously or the response time is important. In both cases, vSSH multiplexes sessions at one connection.
vSSH provides a DSL query feature based on the provided labels that you can use to select / filter clients. It supports operators like == != or you can also create your own logic. I wrote this feature with the Go abstract syntax tree (AST). This feature is very useful as you can add many clients to the library at the same time and run different commands based on the labels.
Here are three features that you can use to control the load on the client and force to terminate the running command:
- By limiting the returned data which comes from stdout or stderr in bytes
- Terminate the command by defined timeout
- Limit the concurrent sessions on the client
Use & Contribute
To learn more about vSSH, explore github.com/yahoo/vssh and try the vSSH examples at https://pkg.go.dev/github.com/yahoo/vssh.
Introducing vSSH - Go Library to Execute Commands Over SSH at ScaleJune 30, 2020
|
|
June 15, 2020 |
June 15, 2020
Data Disposal - Open Source Java-based Big Data Retention ToolBy Sam Groth, Senior Software Engineer, Verizon Media
Do you have data in Apache Hadoop using Apache HDFS that is made available with Apache Hive? Do you spend too much time manually cleaning old data or maintaining multiple scripts? In this post, we will share why we created and open sourced the Data Disposal tool, as well as, how you can use it.
Data retention is the process of keeping useful data and deleting data that may no longer be proper to store. Why delete data? It could be too old, consume too much space, or be subject to legal retention requirements to purge data within a certain time period of acquisition.
Retention tools generally handle deleting data entities (such as files, partitions, etc.) based on: duration, granularity, or date format.
1. Duration: The length of time before the current date. For example, 1 week, 1 month, etc.
2. Granularity: The frequency that the entity is generated. Some entities like a dataset may generate new content every hour and store this in a directory partitioned by date.
3. Date Format: Data is generally partitioned by a date so the format of the date needs to be used in order to find all relevant entities.
Introducing Data Disposal
We found many of the existing tools we looked at lacked critical features we needed, such as configurable date format for parsing from the directory path or partition of the data and extensible code base for meeting the current, as well as, future requirements. Each tool was also built for retention with a specific system like Apache Hive or Apache HDFS instead of providing a generic tool. This inspired us to create Data Disposal.
The Data Disposal tool currently supports the two main use cases discussed below but the interface is extensible to any other data stores in your use case.
1. File retention on the Apache HDFS.
2. Partition retention on Apache Hive tables.
Disposal Process
The basic process for disposal is 3 steps:
- Read the provided yaml config files.
- Run Apache Hive Disposal for all Hive config entries.
- Run Apache HDFS Disposal for all HDFS config entries.
The order of the disposals is significant in that if Apache HDFS disposal ran first, it would be possible for queries to Apache Hive to have missing data partitions.
Key Features
The interface and functionality is coded in Java using Apache HDFS Java API and Apache Hive HCatClient API.
1. Yaml config provides a clean interface to create and maintain your retention process.
2. Flexible date formatting using Java’s SimpleDateFormat when the date is stored in an Apache HDFS file path or in an Apache Hive partition key.
3. Flexible granularity using Java’s ChronoUnit.
4. Ability to schedule with your preferred scheduler.
The current use cases all use Screwdriver, which is an open source build platform designed for continuous delivery, but using other schedulers like cron, Apache Oozie, Apache Airflow, or a different scheduler would be fine.
Future Enhancements
We look forward to making the following enhancements:
1. Retention for other data stores based on your requirements.
2. Support for file retention when configuring Apache Hive retention on external tables.
3. Any other requirements you may have.
Contributions are welcome! The Data team located in Champaign, Illinois, is always excited to accept external contributions. Please file an issue to discuss your requirements.
Data Disposal - Open Source Java-based Big Data Retention ToolJune 15, 2020
|
|
June 3, 2020 |
June 3, 2020
Local buildScrewdriver.cd offers powerful features such as templates, commands, secrets, and metadata, which can be used to simplify build settings or as build parameters. However, it’s difficult to reproduce equivalent features for local builds.
Although you can use these features by uploading your changes to an SCM such as GitHub, you may feel like it is a pain to upload your changes over and over in order to get a successful build. With sd-local, you can easily make sure the build is not corrupted before uploading changes to SCM and debug the build locally if it fails.
Note:
Because sd-local works with Screwdriver.cd, it does not work by itself. If you don’t have a Screwdriver.cd cluster, you need to set up it first.
See the documentation at https://docs.screwdriver.cd/cluster-management/. How to Install
sd-local uses Docker internally, so make sure you have Docker Engine installed locally.
https://www.docker.com/
The next step is to install sd-local. Download the latest version of sd-local from the GitHub release page below and grant execute permission to it.
https://github.com/screwdriver-cd/sd-local/releases
$ mv sd-local_*_amd64 /usr/local/bin/sd-local$ chmod +x /usr/local/bin/sd-local Build configuration
Configure to use the templates and commands registered in your Screwdriver.cd cluster from sd-local. sd-local communicates with the following SD components:
- API Validating screwdriver.yaml Getting a template
- Validating screwdriver.yaml
- Getting a template
- Store Getting a command
- Getting a command
$ sd-local config set api-url https://
Local buildJune 3, 2020
|
|
May 29, 2020 |
May 29, 2020
Vespa Product Updates, May 2020: Improved Slow Node Tolerance, Multi-Threaded Rank Profile Compilation, Reduced Peak Memory at Startup, Feed Performance Improvements, & Increased Tensor PerformanceKristian Aune, Tech Product Manager, Verizon Media
In the April updates, we mentioned Improved Performance for Large Fan-out Applications, Improved Node Auto-fail Handling, CloudWatch Metric Import and CentOS 7 Dev Environment. This month, we’re excited to share the following updates:
Improved Slow Node Tolerance
To improve query scaling, applications can group content nodes to balance static and dynamic query cost. The largest Vespa applications use a few hundred nodes. This is a great feature to optimize cost vs performance in high-query applications. Since Vespa-7.225.71, the adaptive dispatch policy is made default. This balances load to the node groups based on latency rather than just round robin - a slower node will get less load and overall latency is lower.
Multi-Threaded Rank Profile Compilation
Queries are using a rank profile to score documents. Rank profiles can be huge, like machine learned models. The models are compiled and validated when deployed to Vespa. Since Vespa-7.225.71, the compilation is multi-threaded, cutting compile time to 10% for large models. This makes content node startup quicker, which is important for rolling upgrades.
Reduced Peak Memory at Startup
Attributes is a unique Vespa feature used for high feed performance for low-latency applications. It enables writing directly to memory for immediate serving. At restart, these structures are reloaded. Since Vespa-7.225.71, the largest attribute is loaded first, to minimize temporary memory usage. As memory is sized for peak usage, this cuts content node size requirements for applications with large variations in attribute size. Applications should keep memory at less than 80% of AWS EC2 instance size.
Feed Performance Improvements
At times, batches of documents are deleted. This subsequently triggers compaction. Since Vespa-7.227.2, compaction is blocked at high removal rates, reducing overall load. Compaction resumes once the remove rate is low again.
Increased Tensor Performance
Tensor is a field type used in advanced ranking expressions, with heavy CPU usage. Simple tensor joins are now optimized and more optimizations will follow in June.
…
About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.
We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.
Vespa Product Updates, May 2020: Improved Slow Node Tolerance, Multi-Threaded Rank Profile Compilation, Reduced Peak Memory at Startup, Feed Performance Improvements, & Increased Tensor PerformanceMay 29, 2020
|
|
May 28, 2020 |
May 28, 2020
Kata Containers in ScrewdriverScrewdriver is a scalable CI/CD solution which uses Kubernetes to manage user builds. Screwdriver build workers interfaces with Kubernetes using either “executor-k8s” or “executor-k8s-vm” depending on required build isolation.
executor-k8s runs builds directly as Kubernetes pods while executor-k8s-vm uses HyperContainers along with Kubernetes for stricter build isolation with containerized Virtual Machines (VMs). This setup was ideal for running builds in an isolated, ephemeral, and lightweight environment. However, HyperContainer is now deprecated, has no support, is based on an older Docker runtime and it also required non-native Kubernetes setup for build execution. Therefore, it was time to find a new solution.
Why Kata Containers ?
Kata Containers is an open source project and community that builds a standard implementation of lightweight virtual machines (VMs) that perform like containers, but provide the workload isolation and security advantages of VMs. It combines the benefits of using a hypervisor, such as enhanced security, along with container orchestration capabilities provided by Kubernetes. It is the same team behind HyperD where they successfully merged the best parts of Intel Clear Containers with Hyper.sh RunV. As a Kubernetes runtime, Kata enables us to deprecate executor-k8s-vm and use executor-k8s exclusively for all Kubernetes based builds.
Screwdriver journey to Kata
As we faced a growing number of instabilities with the current HyperD - like network and devicemapper issues and IP cleanup workarounds, we started our initial evaluation of Kata in early 2019 (
https://github.com/screwdriver-cd/screwdriver/issues/818#issuecomment-482239236) and identified two major blockers to move ahead with Kata:
1. Security concern for privileged mode (required to run docker daemon in kata)
2. Disk performance.
We recently started reevaluating Kata in early 2020 based on a fix to “add flag to overload default privileged host device behaviour” provided by Containerd/cri (https://github.com/containerd/cri/pull/1225), but still we faced issues with disk performance and switched from overlayfs to devicemapper, which yielded significant improvement. With our two major blockers resolved and initial tests with Kata looking promising, we moved ahead with Kata.Screwdriver Build Architecture
Replacing Hyper with Kata led to a simpler build architecture. We were able to remove the custom build setup scripts to launch Hyper VM and rely on native Kubernetes setup.
Setup
To use Kata containers for running user builds in a Screwdriver Kubernetes build cluster, a cluster admin needs to configure Kubernetes to use Containerd container runtime with Cri-plugin.
Components:
Screwdriver build Kubernetes cluster (minimum version: 1.14+) nodes must have the following components set up for using Kata containers for user builds.
Containerd:
Containerd is a container runtime that helps with management of the complete lifecycle of the container.
Reference: https://containerd.io/docs/getting-started/
CRI-Containerd plugin:
Cri-Containerd is a containerd plugin which implements Kubernetes container runtime interface. CRI plugin interacts with containerd to manage the containers.
Reference: https://github.com/containerd/cri
Image credit: containerd / cri. Photo licensed under CC-BY-4.0.
Architecture:
Image credit: containerd / cri. Photo licensed under CC-BY-4.0.
Installation:
Reference: https://github.com/containerd/cri/blob/master/docs/installation.md, https://github.com/containerd/containerd/blob/master/docs/ops.md
Crictl:
To debug, inspect, and manage their pods, containers, and container images.
Reference: https://github.com/containerd/cri/blob/master/docs/crictl.md
Kata:
Builds lightweight virtual machines that seamlessly plugin to the containers ecosystem.
Architecture:
Image credit: kata-containers Project licensed under Apache License Version 2.0
Installation:
- https://github.com/kata-containers/documentation/blob/master/Developer-Guide.md#run-kata-containers-with-kubernetes
- https://github.com/kata-containers/documentation/blob/master/how-to/containerd-kata.md
- https://github.com/kata-containers/documentation/blob/master/how-to/how-to-use-k8s-with-cri-containerd-and-kata.md
- https://github.com/kata-containers/documentation/blob/master/how-to/containerd-kata.md#kubernetes-runtimeclass
- https://github.com/kata-containers/documentation/blob/master/how-to/containerd-kata.md#configuration
Routing builds to Kata in Screwdriver build cluster
Screwdriver uses Runtime Class to route builds to Kata nodes in Screwdriver build clusters. The Screwdriver plugin executor-k8s config handles this based on:
1. Pod configuration:
apiVersion: v1
kind: Pod
metadata:
name: kata-pod
namespace: sd-build-namespace
labels:
sdbuild: “sd-kata-build”
app: screwdriver
tier: builds
spec:
runtimeClassName: kata
containers:
- name: “sd-build-container”
image: <
Kata Containers in ScrewdriverMay 28, 2020
|
|
May 5, 2020 |
May 5, 2020
Vespa Product Updates, April 2020: Improved Performance for Large Fan-out Applications, Improved Node Auto-fail Handling, CloudWatch Metric Import, & CentOS 7 Dev EnvironmentKristian Aune, Tech Product Manager, Verizon Media
In the previous update, we mentioned Ranking with LightGBM Models, Matrix Multiplication Performance, Benchmarking Guide, Query Builder and Hadoop Integration. This month, we’re excited to share the following updates:
Improved Performance for Large Fan-out Applications
Vespa container nodes execute queries by fanning out to a set of content nodes evaluating parts of the data in parallel. When fan-out or partial results from each node is large, this can cause bandwidth to run out. Vespa now provides an optimization which lets you control the tradeoff between the size of the partial results vs. the probability of getting a 100% global result. As this works out, tolerating a small probability of less than 100% correctness gives a large reduction in network usage. Read more.
Improved Node Auto-fail Handling
Whenever content nodes fail, data is auto-migrated to other nodes. This consumes resources on both sender and receiver nodes, competing with resources used for processing client operations. Starting with Vespa-7.197, we have improved operation and thread scheduling, which reduces the impact on client document API operation latencies when a node is under heavy migration load.
CloudWatch Metric Import
Vespa metrics can now be pushed or pulled into AWS CloudWatch. Read more in monitoring.
CentOS 7 Dev Environment
A development environment for Vespa on CentOS 7 is now available. This ensures that the turnaround time between code changes and running unit tests and system tests is short, and makes it easier to contribute to Vespa.
About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.
We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.
Vespa Product Updates, April 2020: Improved Performance for Large Fan-out Applications, Improved Node Auto-fail Handling, CloudWatch Metric Import, & CentOS 7 Dev EnvironmentMay 5, 2020
|
|
April 27, 2020 |
April 27, 2020
Yahoo Knowledge Graph Announces COVID-19 Dataset, API, and Dashboard with Source AttributionAmit Nagpal, Sr. Director, Software Development Engineering, Verizon Media
Among many interesting teams at Verizon Media is the Yahoo Knowledge (YK) team. We build the Yahoo Knowledge Graph; one of the few web scale knowledge graphs in the world. Our graph contains billions of facts and entities that enrich user experiences and power AI across Verizon Media properties. At the onset of the COVID-19 pandemic we felt the need and responsibility to put our web scale extraction technologies to work, to see how we can help. We have started to extract COVID-19 statistics from hundreds of sources around the globe into what we call the YK-COVID-19 dataset. The YK-COVID-19 dataset provides data and knowledge that help inform our readers on Yahoo News, Yahoo Finance, Yahoo Weather, and Yahoo Search. We created this dataset by carefully combining and normalizing raw data provided entirely by government and public health authorities. We provide website level provenance for every single statistic in our dataset, so our community has the confidence it needs to use it scientifically and report with transparency. After weeks of hard work, we are ready to make this data public in an easily consumable format at the YK-COVID-19-Data GitHub repo.
A dataset alone does not always tell the full story. We reached out to teams across Verizon Media to get their help in building a set of tools that can help us, and you, build dashboards and analyze the data. Engineers from the Verizon Media Data team in Champaign, Illinois volunteered to build an API and dashboard. The API was constructed using a previously published Verizon Media open source platform called Elide. The dashboard was constructed using Ember.js, Leaflet and the Denali design system. We still needed a map tile server and were able to use the Verizon Location Technology team’s map tile service powered by HERE. We leveraged Screwdriver.cd, our open source CI/CD platform to build our code assets, and our open source Athenz.io platform to secure our applications running in our Kubernetes environment. We did this using our open source K8s-athenz-identity control plane project. You can see the result of this incredible team effort today at https://yahoo.github.io/covid-19-dashboard.
Build With Us
You can build applications that take advantage of the YK-COVID-19 dataset and API yourself. The YK-COVID-19 dataset is made available under a Creative Commons CC-BY-NC 4.0 license. Anyone seeking to use the YK-COVID-19 dataset for other purposes is encouraged to submit a request.
Feature Roadmap
Updated multiple times a day, the YK-COVID-19 dataset provides reports of country, state, and county-level data based on the availability of data from our many sources. We plan to offer more coverage, granularity, and metadata in the coming weeks.
Why a Knowledge Graph?
A knowledge graph is information about real world entities, such as people, places, organizations, and events, along with their relations, organized as a graph. We at Yahoo Knowledge have the capability to crawl, extract, combine, and organize information from thousands of sources. We create refined information used by our brands and our readers on Yahoo Finance, Yahoo News, Yahoo Search and others sites too.
We built our web scale knowledge graph by extracting information from web pages around the globe. We apply information retrieval techniques, natural language processing, and computer vision to extract facts from a variety of formats such as html, tables, pdf, images and videos. These facts are then reconciled and integrated into our core knowledge graph that gets richer every day. We applied some of these techniques and processes relevant in the COVID-19 context to help gather information from hundreds of public and government authoritative websites. We then blend and normalize this information into a single combined COVID-19 specific dataset with some human oversight for stability and accuracy. In the process, we preserve provenance information, so our users know where each statistic comes from and have the confidence to use it for scientific and reporting purposes with attribution. We then pull basic metadata such as latitude, longitude, and population for each location from our core knowledge graph. We also include a Wikipedia id for each location, so it is easy for our community to attach additional metadata, as needed, from public knowledge bases such as Wikimedia or Wikipedia.
We’re in this together. So we are publishing our data along with a set of tools that we’re contributing to the open source community. We offer these tools, data, and an invitation to work together on getting past the raw numbers.
Yahoo, Verizon Media, and Verizon Location Technology are all part of the family at Verizon.
Yahoo Knowledge Graph Announces COVID-19 Dataset, API, and Dashboard with Source AttributionApril 27, 2020
|
|
April 15, 2020 |
April 15, 2020
Dash Open 21: Athenz - Open Source Platform for X.509 Certificate-based Service AuthN & AuthZBy Ashley Wolf, Open Source Program Manager, Verizon Media
In this episode, Gil Yehuda (Sr. Director, Open Source) interviews Mujib Wahab (Sr. Director, Software Dev Engineering) and Henry Avetisyan (Distinguished Software Dev Engineer). Mujib and Henry discuss why Verizon Media open sourced Athenz, a platform for X.509 Certificate-based Service Authentication and Authorization. They also share how others can use and contribute to Athenz.
Audio and transcript available here.
You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.
Dash Open 21: Athenz - Open Source Platform for X.509 Certificate-based Service AuthN & AuthZApril 15, 2020
|
|
April 2, 2020 |
April 2, 2020
Introducing Queue ServicePritam Paul, Software Engineer, Verizon Media
We have recently made changes to the underlying Screwdriver Architecture for build processing. Previously, the executor-queue was tightly-coupled to the SD API and worked by constantly polling for messages at specific intervals. Due to this design, the queue would block API requests. Furthermore, if the API crashed, scheduled jobs might not be added to the queue, causing cascading failures.
Hence, keeping the principles of isolation-of-concerns and abstraction in mind, we designed a more resilient REST-API-based queueing system: the Queue Service. This new service reads, writes and deletes messages from the queue after processing. It also encompasses the former capability of the queue-worker and acts as a scheduler.Authentication
The SD API and Queue Service communicate bidirectionally using signed JWT tokens sent via auth headers of each request.Build SequenceDesign Document
For more details, check out our design spec.Using Queue Service
As a cluster admin, to configure using the queue as an executor, you can deploy the queue-service as a REST API using a screwdriver.yaml and update configuration in SD API to point to the new service endpoint:
# config/default.yaml
ecosystem:
# Externally routable URL for the User Interface
ui: https://cd.screwdriver.cd
# Externally routable URL for the Artifact Store
store: https://store.screwdriver.cd
# Badge service (needs to add a status and color)
badges: https://img.shields.io/badge/build–.svg
# Internally routable FQDNS of the queue service
queue: http://sdqueuesvc.screwdriver.svc.cluster.local
executor:
plugin: queue
queue: “
For more configuration options, see the queue-service documentation.Compatibility List
In order to use the new workflow features, you will need these minimum versions:
- UI - v1.0.502
- API - v0.5.887
- Launcher - v6.0.56
- Queue-Service - v1.0.11Contributors
Thanks to the following contributors for making this feature possible:
- adong
- klu909
- jithine
- parthasl
- pritamstyz4ever
- tkyi
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Introducing Queue ServiceApril 2, 2020
|
|
March 29, 2020 |
March 29, 2020
Dash Open 20: The Benefits of Presenting at MeetupsBy Rosalie Bartlett, Open Source Community, Verizon Media
In this episode, Ashley Wolf, Open Source Program Manager, interviews Eran Shapira, Software Development Engineering Manager, Verizon Media. Based in Tel Aviv, Israel, Eran manages the video activation team. Eran shares about his team’s focus, which technology he’s most excited about right now, the value of presenting at meetups, and his advice for being a great team member.
Audio and transcript available here.
You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.
P.S. Learn more about job opportunities (backend engineer, product manager, research scientist, and many others!) at our Tel Aviv and Haifa offices here.
Dash Open 20: The Benefits of Presenting at MeetupsMarch 29, 2020
|
|
March 28, 2020 |
March 28, 2020
Search COVID-19 Open Research Dataset (CORD-19) using Vespa - Open Source Big Data Serving EngineKristian Aune, Tech Product Manager, Verizon Media
After being made aware of the COVID-19 Open Research Dataset Challenge (CORD-19), where AI experts have been asked to create text and data mining tools that can help the medical community, the Vespa team wanted to contribute.
Given our experience with big data at Yahoo (now Verizon Media) and creating Vespa (open source big data serving engine), we thought the best way to help was to index the dataset, which includes over 44,000 scholarly articles, and to make it available for searching via Vespa Cloud.
Now live at https://cord19.vespa.ai, you can get started with a few of the sample queries or for more advanced queries, visit CORD-19 API Query. Feel free to tweet us @vespaengine or submit an issue, if you have any questions or suggestions.
Please expect daily updates to the documentation and query features. Contributions are appreciated - please refer to our contributing guide and submit PRs. You can also download the application, index the data set, and improve the service. More info here on how to run Vespa.ai on your own computer.
Search COVID-19 Open Research Dataset (CORD-19) using Vespa - Open Source Big Data Serving EngineMarch 28, 2020
|
|
March 23, 2020 |
March 23, 2020
Dash Open 19: KDD - Understanding Consumer Journey using Attention-based Recurrent Neural NetworksBy Ashley Wolf, Open Source Program Manager, Verizon Media
In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Shaunak Mishra, Sr. Research Scientist, Verizon Media. Shaunak discusses two papers he presented at Knowledge Discovery and Data Mining (KDD) - “Understanding Consumer Journey using Attention-based Recurrent Neural Networks” and “Learning from Multi-User Activity Trails for B2B Ad Targeting”.
Audio and transcript available here.
You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.
Dash Open 19: KDD - Understanding Consumer Journey using Attention-based Recurrent Neural NetworksMarch 23, 2020
|
|
March 16, 2020 |
March 16, 2020
Introducing Accessible Audio Charts - An Open Source Initiative for Android AppsSukriti Chadha, Senior Product Manager, Verizon Media
Finance charts quickly render hundreds of data points making it seamless to analyze a stock’s performance. Charts are great for people who can see well. Those who are visually impaired often use screen readers. For them, the readers announce the data points in a table format. Beyond a few data points, it becomes difficult for users to create a mental image of the chart’s trend. The audio charts project started with the goal of making Yahoo Finance charts accessible to users with visual impairment. With audio charts, data points are converted to tones with haptic feedback and are easily available through mobile devices where users can switch between tones and spoken feedback.
The idea for the accessible charts solution was first discussed during a conversation between Sukriti Chadha, from the Yahoo Finance team, and Jean-Baptiste Queru, a mobile architect. After building an initial prototype, they worked with Mike Shebanek, Darren Burton and Gary Moulton from the Accessibility team to run user studies and make improvements based on feedback. The most important lesson learned through research and development was that users want a nuanced, customizable solution that works for them in their unique context, for the given product.
Accessible charts were launched on the production versions of the Yahoo Finance Android and iOS apps in 2019 and have since seen positive reception from screen reader users. The open source effort was led by Yatin Kaushal and Joao Birk on engineering, Kisiah Timmons on the Verizon Media accessibility team, and Sukriti Chadha on product.
We would love for other mobile app developers to have this solution, adapt to their users’ needs and build products that go from accessible to truly usable. We also envision applications of this approach in voice interfaces and contextual vision limitation scenarios. Open sourcing this version of the solution marks an important first step in this initiative.
To integrate the SDK, simply clone or fork the repository. The UI components and audio conversion modules can be used separately and modified for individual use cases. Please refer to detailed instructions on integration in the README. This library is the Android version of the solution, which can be replicated on iOS with similar logic. While this implementation is intended to serve as reference for other apps, we will review requests and comments on the repository.
We are so excited to make this available to the larger developer community and can’t wait to see how other applications take the idea forward! Please reach out to finance-android-dev@verizonmedia.com for questions and requests.
Introducing Accessible Audio Charts - An Open Source Initiative for Android AppsMarch 16, 2020
|
|
March 3, 2020 |
March 3, 2020
Vespa Product Updates, February 2020: Ranking with LightGBM Models, Matrix Multiplication Performance, Benchmarking Guide, Query Builder, and Hadoop IntegrationKristian Aune, Tech Product Manager, Verizon Media
In the January Vespa product update, we mentioned Tensor Operations, New Sizing Guides, Performance Improvements for Matched Elements in Map/Array-of-Struct, and Boolean Query Optimizations. This month, we’re excited to share the following updates:
Ranking with LightGBM Models
Vespa now supports LightGBM machine learning models in addition to ONNX, Tensorflow and XGBoost. LightGBM is a gradient boosting framework that trains fast, has a small memory footprint, and provides similar or improved accuracy to XGBoost. LightGBM also supports categorical features.
Matrix Multiplication Performance
Vespa now uses OpenBLAS for matrix multiplication, which improves performance in machine-learned models using matrix multiplication.
Benchmarking Guide
Teams use Vespa to implement applications with strict latency requirements and minimal cost. In January, we released a new sizing guide. This month, we’re adding a benchmarking guide that you can use to find the perfect spot between cost and performance.
Query Builder
Thanks to contributions from yehzu, Vespa now has a fluent library for composing queries - explore the client module for details.
Hadoop Integration
Vespa is integrated with Hadoop and easy to feed from a grid. The grid integration now also supports conditional writes, see #12081.
We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.
About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.
Vespa Product Updates, February 2020: Ranking with LightGBM Models, Matrix Multiplication Performance, Benchmarking Guide, Query Builder, and Hadoop IntegrationMarch 3, 2020
|
|
March 2, 2020 |
March 2, 2020
Introducing Proxy Verifier - Open Source Tool for Testing HTTP Based ProxiesAlan M. Carroll and Brian Neradt, Software Engineers, Verizon Media
We’re pleased to announce Proxy Verifier - an open source tool for testing HTTP based proxies. Originally built as part of Verizon Media’s support for Apache Traffic Server (ATS) to improve testability and reliability, Proxy Verifier generates traffic through a proxy and verifies the behavior of the proxy. A key difference between Proxy Verifier and existing HTTP based test tools is Proxy Verifier verifies traffic to and from the proxy. This bi-directional ability was a primary motivation. In addition, handling traffic on both sides of the proxy means a Proxy Verifier setup can run in a network disconnected environment, which was an absolute requirement for this work - no other servers are required, and the risk of hitting production servers with test traffic is eliminated.
After sharing the idea for Proxy Verifier with the Apache Traffic Server community, we’ve received significant external interest. We are pleased to have achieved a level of maturity with the tool’s development that we can now share it with the world by open sourcing it. As a related benefit, by open sourcing Proxy Verifier we will also be able to use it as a part of Traffic Server’s end-to-end test automation.
Within Verizon Media, Proxy Verifier serves to support correctness, production simulation, and load testing. Generated and captured replay files are used for production simulation and load testing. Handbuilt replay files are used for debugging and correctness testing. Replay files are easily constructed by hand based on use cases or packet capture files, and also easily edited and extended later. Proxy Verifier is being integrated into the AuTest framework used in ATS for automated end-to-end testing.
Proxy Verifier builds two executables, the client and server, which are used to test the proxy:
The client sends requests to the proxy under test, which in turn is configured to send them to the server. The server parses the request from the proxy, sends a response, which the proxy then sends to the client. This traffic is controlled by a “replay file”, which is a YAML formatted configuration file. This contains the transactions as four messages - client to proxy, proxy to server, server to proxy, and proxy to client.
Transactions can be grouped into sessions, each of which represents a single connection from the client to the proxy.
This set of events are depicted in the following sequence diagram:
Because the Proxy Verifier server needs only the replay file and no other configuration, it is easy for a developer to use it as a test HTTP server instead of setting up and configuring a full web server.
Other key features:
- Fine-grained control of what is sent from the client and server, along with what is expected from the proxy.
- Specific fields in the proxy request or response can be checked against one of three criteria: the presence of a field, the absence of a field, or the presence of a field with a specific value.
- Transactions in the config can be run once or repeatedly a specified number of times.
- Sessions allow control of how much a client session is reused.
- Transactions can be sent at a fixed rate to help simulate production level loads. Proxy Verifier has been tested up to over 10K RPS sustained.
- The “traffic_dump” plugin for ATS can be used to capture production traffic for later testing with Proxy Verifier.
- Protocol support:
- IPv4 and IPv6 support.
- HTTP/1.x support for both the Verifier client and server.
- The Verifier client supports HTTP/2 but the server currently does not. We have plans to support server-side HTTP/2 sometime before the end of Q2 2020.
- HTTPS with TLS 1.3 support (assuming Proxy Verifier is linked against OpenSSL 1.1.1 or higher).
For build and installation instructions, explore the github README page. Please file github issues for bugs or enhancement requests.
Acknowledgments
We would like to thank several people whose work contributed to this project:
- Syeda “Persia” Aziz, initial work and proof of concept for the replay server.
- Jesse Zhang, previous generation prototype and the schema.
- Will Wendorf, initial verification logic.
- Susan Hinrichs, implemented the client side HTTP/2 support.
Introducing Proxy Verifier - Open Source Tool for Testing HTTP Based ProxiesMarch 2, 2020
|
|
February 28, 2020 |
February 28, 2020
Remote JoinTiffany Kyi, Software Engineer, Verizon Media
We have recently rolled out a new feature: Remote Join.
Previously, with remote triggers, users could kick off jobs in external pipelines by requiring a job from another one. With this new remote join feature, users can do parallel forks and join with jobs from external pipelines.
An example of external parallel fork join in the Screwdriver UI:
User configuration
Make sure your cluster admin has the proper configuration set to support this feature.
In order to use this new feature, you can configure your screwdriver.yaml similar to how remote triggers are done today. Just as with normal jobs, remote triggers will follow the rules:
- ~ tilde prefix denotes logical [OR]
- Omitting the ~ tilde prefix denotes logical [AND]
Example
Pipeline 3 screwdriver.yaml:
shared: image: node:12 steps: - echo: echo hi jobs: main: requires: [~commit, ~pr] internal_fork: requires: [main] join_job: requires: [internal_fork, sd@2:external_fork, sd@4:external_fork]
Pipeline 2 screwdriver.yaml:
shared: image: node:12 steps: - echo: echo hi jobs: external_fork: requires: [~sd@3:main]
Pipeline 4 screwdriver.yaml:
shared: image: node:12 steps: - echo: echo hi jobs: external_fork: requires: [~sd@3:main]
Caveats
- In the downstream remote job, you’ll need to use ~ tilde prefix for the external requires
- This feature is only guaranteed one external dependency level deep
- This feature currently does not work with PR chain
- The event list on the right side of the UI might not show the complete mini-graph for the eventCluster Admin configuration
In order to enable this feature in your cluster, you’ll need to make changes to your Screwdriver cluster’s configuration by setting EXTERNAL_JOIN custom environment variable to true. Compatibility List
In order to use this feature, you will need these minimum versions:
- API - v0.5.877
- UI - v1.0.494
- Store - v3.10.5
- Launcher - v6.0.12Contributors
Thanks to the following contributors for making this feature possible:
- adong
- d2lam
- jithine
- klu909
- tkyi
Questions & Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Remote JoinFebruary 28, 2020
|
|
February 24, 2020 |
February 24, 2020
Dash Open 18: A chat with Joshua Simmons, Vice President, Open Source InitiativeBy Ashley Wolf, Open Source Program Manager, Verizon Media
In this episode, Gil Yehuda, Sr. Director of Open Source, interviews Joshua Simmons, Vice President, Open Source Initiative (OSI). Joshua discusses the Open Source Initiative (OSI), a global non-profit championing software freedom in society through education, collaboration, and infrastructure. Joshua also highlights trends in the open source landscape and potential future changes.
Audio and transcript available here.
You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.
Dash Open 18: A chat with Joshua Simmons, Vice President, Open Source InitiativeFebruary 24, 2020
|
|
February 22, 2020 |
February 22, 2020
Dash Open 17: A chat with Neil McGovern, Executive Director, GNOME FoundationBy Ashley Wolf, Open Source Program Manager, Verizon Media
In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Neil McGovern, Executive Director, GNOME Foundation. Neil shares how he originally became involved with open source, the industry changes he has observed, and his focus at the GNOME Foundation, a non-profit organization that furthers the goals of the GNOME Project, helping it to create a free software computing platform for the general public that is designed to be elegant, efficient, and easy to use.
Audio and transcript available here.
You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.
Dash Open 17: A chat with Neil McGovern, Executive Director, GNOME FoundationFebruary 22, 2020
|
|
February 19, 2020 |
February 19, 2020
Dash Open 16: OSCON 2019 - A chat with Rachel Roumeliotis, VP Content Strategy, O'Reilly MediaBy Ashley Wolf, Open Source Program Manager, Verizon Media
In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Rachel Roumeliotis, Vice President of Content Strategy at O'Reilly Media. Rachel reflects on OSCON 2019 themes, what to expect at OSCON 2020, where the industry is going, and how she empowers her team to be great storytellers.
Audio and transcript available here.
You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.
Dash Open 16: OSCON 2019 - A chat with Rachel Roumeliotis, VP Content Strategy, O'Reilly MediaFebruary 19, 2020
|
|
February 14, 2020 |
February 14, 2020
Improvements and FixesScrewdriver Team from Verizon Media
UI
- Enhancement: Upgrade to node.js v12.
- Enhancement: Users can now link to custom test & coverage URL via metadata.
- Enhancement: Reduce number of API calls to fetch active build logs.
- Enhancement: Display proper title for Commands and Templates pages.
- Bug fix: Hide “My Pipelines” from Add to collection dialogue.
- Enhancement: Display usage stats for a template.
API
- Enhancement: Upgrade to node.js v12.
- Enhancement: Reduce DB Size by removing steps column from builds.
- Enhancement: New API to display usage metrics of a template.
- Bug fix: Restarting builds multiple times now carries over proper context.
Store
- Enhancement: Upgrade to node.js v12.
- Enhancement: Support for private AWS S3 buckets.
Compatibility List
In order to have these improvements, you will need these minimum versions:
- UI - v1.0.491
- API - v0.5.851
- Store - v3.10.5
Contributors
Thanks to the following contributors for making this feature possible:
- adong
- jithine
- klu909
- InderH
- djerraballi
- tkyi
- wahapo
- tk3fftk
- sugarnaoming
- kkisic
- kkokufud
- sakka2
- yuichi10
- s-yoshika
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Improvements and FixesFebruary 14, 2020
|
|
January 30, 2020 |
January 30, 2020
Build cache - Disk strategyBuild cache - Disk strategy
Screwdriver now has the ability to cache and restore files and directories from your builds to either s3 or disk-based storage. Rest all features related to the cache feature remains the same, only a new storage option is added. Please DO NOT USE this cache feature to store any SENSITIVE data or information.
The graph below is our Internal Screwdriver instance build-cache comparison between disk-based strategy vs aws s3.
Build cache - get cache - (disk strategy)
Build cache - get cache - (s3)
Build cache - set cache - (disk strategy)
Build cache - set cache - (s3)
Why disk-based strategy?
Based on the cache analysis, 1. The majority of time was spent pushing data from build to s3, 2. At times the cache push fails if the cache size is big (ex: >1gb). So, simplified the storage part by using a disk cache strategy and using filer/storage mount as a disk option. Each cluster will have its own filer/storage disk mount.
NOTE: When a cluster becomes unavailable and if the requested cache is not available in the new cluster, the cache will be rebuilt once as part of the build.
Cache Size:
Max size limit per cache is configurable by Cluster admins. Retention policy:
Cluster admins are responsible to enforce retention policy.Cluster Admins:
Screwdriver cluster-admin has the ability to specify the cache storage strategy along with other options like compression, md5 check, cache max limit in MB
Reference:
1. https://github.com/screwdriver-cd/screwdriver/blob/master/config/default.yaml#L280
2. https://github.com/screwdriver-cd/executor-k8s-vm/blob/master/index.js#L336
3. Issue: https://github.com/screwdriver-cd/screwdriver/issues/1830
Compatibility List:
In order to use this feature, you will need these minimum versions:
- API - v0.5.835
- Buildcuster queue worker - v1.4.7
- Launcher - v6.0.42
- Store-cli - v0.0.50
- Store - v3.10.3
Contributors:
Thanks to the following people for making this feature possible:
- parthasl
Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.
Build cache - Disk strategyJanuary 30, 2020
|
|
January 28, 2020 |
January 28, 2020
Vespa Product Updates, January 2020: Tensor Functions, New Sizing Guides, Performance Improvement for Matched Elements in Map/Array-of-Struct, Boolean Field Query OptimizationKristian Aune, Tech Product Manager, Verizon Media
In the December Vespa product update, we mentioned improved ONNX support, new rank feature attributeMatch().maxWeight, free lists for attribute multivalue mapping, faster updates for out-of-sync documents, and ZooKeeper 3.5.6.
This month, we’re excited to share the following updates:
Tensor Functions
The tensor language has been extended with functions to allow the representation of very complex neural nets, such as BERT models, and better support for working with mapped (sparse) tensors:
- Slice makes it possible to extract values and subspaces from tensors.
- Literal tensors make it possible to create tensors on the fly, for instance from values sliced out of other tensors or from a list of scalar attributes or functions.
- Merge produces a new tensor from two mapped tensors of the same type, where a lambda to resolve is invoked only for overlapping values. This can be used, for example, to supply default values which are overridden by an argument tensor.
New Sizing Guides
Vespa is used for applications with high performance or cost requirements. New sizing guides for queries and writes are now available to help teams use Vespa optimally.
Performance Improvement for Matched Elements in Map/Array-of-Struct
As maps or arrays in documents can often grow large, applications use matched-elements-only to return only matched items. This also simplifies application code. Performance for this feature is now improved - ex: an array or map with 20.000 elements is now 5x faster.
Boolean Field Query Optimization
Applications with strict latency requirements, using boolean fields and concurrent feed and query load, have a latency reduction since Vespa 7.165.5 due to an added bitCount cache. For example, we realized latency improvement from 3ms to 2ms for an application with a 30k write rate. Details in #11879.
About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.
We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.
Vespa Product Updates, January 2020: Tensor Functions, New Sizing Guides, Performance Improvement for Matched Elements in Map/Array-of-Struct, Boolean Field Query OptimizationJanuary 28, 2020
|
|
January 22, 2020 |
January 22, 2020
Recent Enhancements and bug fixesScrewdriver Team from Verizon Media
UI
- Bugfix: Artifacts images are now displayed correctly in Firefox browser
- Feature: Deep linking to an artifact for a specific build You can now share a link directly to an artifact, for example: https://cd.screwdriver.cd/pipelines/3709/builds/168862/artifacts/artifacts/dog.jpeg
- Enhancement: Can override Freeze Window to start a build.
Previously, users could not start builds during a freeze window unless they made changes to the freeze window setting in the screwdriver.yaml configuration. Now, you can start a build by entering a reason in the confirmation modal. This can be useful for users needing to push out an urgent patch or hotfix during a freeze window.
Store
- Feature: Build cache now supports local disk-based cache in addition to S3 cache.
Queue Worker
- Bugfix: Periodic build timeout check
- Enhancement: Prevent re-enqueue of builds from same event.
Compatibility List
In order to have these improvements, you will need these minimum versions:
- UI - v1.0.479
- API - v0.5.835
- Store - v3.10.3
- Launcher - v6.0.42
- Queue-Worker - v2.9.0
Contributors
Thanks to the following contributors for making this feature possible:
- adong
- jithine
- klu909
- parthasl
- pritamstyz4ever
- tk3fftk
- tkyi
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Recent Enhancements and bug fixesJanuary 22, 2020
|
|
January 13, 2020 |
January 13, 2020
Speak at the 1st Annual Pulsar Summit - April 28th, San FranciscoSijie Guo, Founder, StreamNative
The first-ever Pulsar Summit will bring together an international audience of CTOs, CIOs, developers, data architects, data scientists, Apache Pulsar committers/contributors, and the messaging and streaming community, to share experiences, exchange ideas and knowledge, and receive hands-on training sessions led by Apache Pulsar experts.
Talk submissions, pre-registration, and sponsorship opportunities are now open for the conference!
Speak at Pulsar Summit
Submit a presentation or a lightning talk. Suggested topics cover Pulsar use cases, operations, technology deep dives, and ecosystem. Submissions are open until January 31, 2020.
If you would like feedback or advice on your proposal, please reach out to sf-2020@pulsar-summit.org. We’re happy to help!
- CFP Closes: January 31, 2020 - 23:59 PST
- Speakers Notified: February 21, 2020
- Schedule Announced: February 24, 2020
Speaker Benefits
Accepted speakers will enjoy:
- Conference pass & speaker swag
- Name, title, company, and bio will be featured on the Summit website
- Professionally produced video of your presentation
- Session recording added to the Pulsar Summit YouTube Channel
- Session recording promoted on Twitter and LinkedIn
Pre-registration
Pre-registration is now open! After submitting the pre-registration form, you will be added to the Pulsar Summit waitlist. Once registration is open, we’ll email you.
Sponsor Pulsar Summit
Pulsar Summit is a community-run conference and your support is appreciated. Sponsoring this event will provide a great opportunity for your organization to further engage with the Apache Pulsar community. Contact us to learn more.
Follow us on Twitter @pulsarsummit to receive the latest conference updates.
Hope to see you there!
Speak at the 1st Annual Pulsar Summit - April 28th, San FranciscoJanuary 13, 2020
|
|
January 9, 2020 |
January 9, 2020
Dash Open 15: The Virtues and Pitfalls of Contributor License AgreementsBy Ashley Wolf, Open Source Program Manager, Verizon Media
In this episode, Gil Yehuda, Sr. Director of Open Source, interviews Michael Martin, Associate General Counsel and Head of Patents at Verizon Media. Mike shares why Contributor License Agreements (also known as CLAs) came to be and some of the reasons they don’t work as well as we’d hope. Fundamentally, we need to foster trust among people who don’t know each other and have no reason to trust each other. Without it, we’re not going to be able to build these incredibly complex things that require us to work together. Do CLAs do that? Listen and find out.
Audio and transcript available here.
You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.
P.S. If you enjoyed this podcast then you might be interested in this Open Source Developer Lead position.
Dash Open 15: The Virtues and Pitfalls of Contributor License AgreementsJanuary 9, 2020
|
|
January 6, 2020 |
January 6, 2020
Omid Graduates from the Apache Incubator Program!Ohad Shacham, Sr. Research Scientist
Yonatan Gottesman, Sr. Research Engineer
Edward Bortnikov, Sr. Director, Research
Scalable Systems, Yahoo Research Haifa, Verizon Media
We have awesome news to share with you about Apache Omid, a scalable transaction processing platform for Apache HBase developed and open sourced by Yahoo. Omid has just graduated from the Apache Incubator program and is now part of Apache Phoenix. Phoenix is a real-time SQL database for OLTP and real-time analytics over HBase. It is widely employed by the industry, powering products in Alibaba, Bloomberg, Salesforce, and many others. Omid means ‘hope’ in Farsi, and as we hoped, it has proven to be a successful project.
In 2011, a team of scientists at Yahoo Research incepted Omid in anticipation of the need for low-latency OLTP applications within the Hadoop ecosystem. It has been powering real-time content indexing for Yahoo Search since 2015. In the same year, Omid entered the Apache Incubator program, taking the path towards wider adoption, community development, and code maturity. A year ago, Omid hit another major milestone when the Apache Phoenix project community selected it as the default provider of ACID transaction technology.
We worked hard to make Omid’s recent major release match Phoenix’s requirements - flexibility, high speed, and SQL extensions. Our work started when Phoenix was already using the Tephra technology to power its transaction backend. In order to provide backward compatibility, we contributed a brand new transaction adaptation layer (TAL) to Phoenix, which enables a configurable choice for the transaction technology provider. Our team performed extensive benchmarks, demonstrating Omid’s excellent scalability and reliability, which led to its adoption as the default transaction processor for Phoenix. With Omid’s support, Phoenix now features consistent secondary indexes and extended query semantics. The Phoenix-Omid integration is now generally available (release 4.15). Notwithstanding this integration, Omid can still be used as a standalone service by NoSQL HBase applications.
In parallel with Phoenix adoption, Omid’s code and documentation continuously improved to meet the Apache project standards. Recently, the Apache community suggested that Omid (as well as Tephra) becomes an integral part of Phoenix going forward. The community vote ratified this decision. Omid’s adoption by the top-level Apache project is a huge success. We could not imagine a better graduation for Omid, since it will now enjoy a larger developer community and will be used in even more real-world applications. As we celebrate Omid’s Apache graduation, it’s even more exciting to see new products using it to run their data platforms at scale.
Omid could not have been successful without its wonderful developer community at Yahoo and beyond. Thank you Maysam Yabandeh, Flavio Junqueira, Ben Reed, Ivan Kelly, Francisco Perez-Sorrosal, Matthieu Morel, Sameer Paranjpye, Igor Katkov, James Taylor, and Lars Hofhansl for your numerous contributions. Thank you also to the Apache community for your commitment to open source and for letting us bring our technology to benefit the community at large. We invite future contributors to explore Omid’s new home repository at https://gitbox.apache.org/repos/asf?p=phoenix-omid.git.
Omid Graduates from the Apache Incubator Program!January 6, 2020
|
|
December 30, 2019 |
December 30, 2019
Documentation for Panoptes - Open Source Global Scale Network Telemetry EcosystemBy James Diss, Software Systems Engineer, Verizon Media
Documentation is important to the success of any project. Panoptes, which we open-sourced in October 2018, is no exception, due to its distribution of concerns and plugin architecture. Because of this, there are inherent complexities in implementing and personalizing the framework for individual users.
While the code provides clarity, it’s the documentation that supplies the map for exploration. In recognition of this, we’ve split out the documentation for the Panoptes project, and will update it separately from now on. Expanding the documentation contained within the project and separating out the documentation from the actual framework code gives us a little more flexibility in expanding and contextualizing the documentation, but also gets it away from the code that would be deployed to production hosts.
We’re also using an internal template and the docusaurus.io project to produce a website that will be updated at the same time as the project documentation at https://getpanoptes.io.
Panoptes Resources
- Panoptes Documentation Repo
- https://github.com/yahoo/panoptes_documentation
- Panoptes in Docker Image
- https://hub.docker.com/r/panoptes/panoptes_docker
- Panoptes in Docker GitHub Repo
- https://github.com/yahoo/panoptes_docker
- Panoptes GitHub Repo
- https://github.com/yahoo/panoptes/
Questions, Suggestions, & Contributions
Your feedback and contributions are appreciated! Explore Panoptes, use and help contribute to the project, and chat with us on Slack.
Documentation for Panoptes - Open Source Global Scale Network Telemetry EcosystemDecember 30, 2019
|
|
December 19, 2019 |
December 19, 2019
Vespa Product Updates, December 2019: Improved ONNX support, New rank feature attributeMatch().maxWeight, Free lists for attribute multivalue mapping, faster updates for out-of-sync documents, Zookeeper 3.5.6In the November Vespa product update, we mentioned Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance and Datadog Monitoring Support.
Today, we’re excited to share the following updates:
Improved ONNX Support
Vespa has added more operations to its ONNX model API, such as GEneral Matrix to Matrix Multiplication (GEMM) - see list of supported opsets. Vespa has also improved support for PyTorch through ONNX, see the pytorch_test.py example.
New Rank Feature attributeMatch().maxWeight
attributeMatch(name).maxWeight was added in Vespa-7.135.5. The value is the maximum weight of the attribute keys matched in a weighted set attribute.
Free Lists for Attribute Multivalue Mapping
Since Vespa-7.141.8, multivalue attributes uses a free list to improve performance. This reduces CPU (no compaction jobs) and approximately 10% memory. This primarily benefits applications with a high update rate to such attributes.
Faster Updates for Out-of-Sync Documents
Vespa handles replica consistency using bucket checksums. Updating documents can be cheaper than putting a new document, due to less updates to posting lists. For updates to documents in inconsistent buckets, a GET-UPDATE is now used instead of a GET-PUT whenever the document to update is consistent across replicas. This is the common case when only a subset of the documents in the bucket are out of sync. This is useful for applications with high update rates, updating multi-value fields with large sets. Explore details here.
ZooKeeper 3.5.6
Vespa now uses Apache ZooKeeper 3.5.6 and can encrypt communication between ZooKeeper servers.
About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.
We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.
Vespa Product Updates, December 2019: Improved ONNX support, New rank feature attributeMatch().maxWeight, Free lists for attribute multivalue mapping, faster updates for out-of-sync documents, Zookeeper 3.5.6December 19, 2019
|
|
December 18, 2019 |
December 18, 2019
Vespa Product Updates, December 2019: Improved ONNX Support, New Rank Feature attributeMatch().maxWeight, Free Lists for Attribute Multivalue Mapping, Faster Updates for Out-of-Sync Documents, and ZooKeeper 3.5.6 SupportKristian Aune, Tech Product Manager, Verizon Media
In the November Vespa product update, we mentioned Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance and Datadog Monitoring Support.
Today, we’re excited to share the following updates:
Improved ONNX Support
Vespa has added more operations to its ONNX model API, such as GEneral Matrix to Matrix Multiplication (GEMM) - see list of supported opsets. Vespa has also improved support for PyTorch through ONNX, see the pytorch_test.py example.
New Rank Feature attributeMatch().maxWeight
attributeMatch(name).maxWeight was added in Vespa-7.135.5. The value is the maximum weight of the attribute keys matched in a weighted set attribute.
Free Lists for Attribute Multivalue Mapping
Since Vespa-7.141.8, multivalue attributes uses a free list to improve performance. This reduces CPU (no compaction jobs) and approximately 10% memory. This primarily benefits applications with a high update rate to such attributes.
Faster Updates for Out-of-Sync Documents
Vespa handles replica consistency using bucket checksums. Updating documents can be cheaper than putting a new document, due to less updates to posting lists. For updates to documents in inconsistent buckets, a GET-UPDATE is now used instead of a GET-PUT whenever the document to update is consistent across replicas. This is the common case when only a subset of the documents in the bucket are out of sync. This is useful for applications with high update rates, updating multi-value fields with large sets. Explore details here.
ZooKeeper 3.5.6
Vespa now uses Apache ZooKeeper 3.5.6 and can encrypt communication between ZooKeeper servers.
About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.
We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.
Vespa Product Updates, December 2019: Improved ONNX Support, New Rank Feature attributeMatch().maxWeight, Free Lists for Attribute Multivalue Mapping, Faster Updates for Out-of-Sync Documents, and ZooKeeper 3.5.6 SupportDecember 18, 2019
|
|
December 4, 2019 |
December 4, 2019
Learning to Rank with Vespa – Getting started with Text SearchVespa.ai have just published two tutorials to help people to get started with text search applications by building scalable solutions with Vespa. The tutorials were based on the full document ranking task released by Microsoft’s MS MARCO dataset’s team.
The first tutorial helps you to create and deploy a basic text search application with Vespa as well as to download, parse and feed the dataset to a running Vespa instance. They also show how easy it is to experiment with ranking functions based on built-in ranking features available in Vespa.
The second tutorial shows how to create a training dataset containing Vespa ranking features that allow you to start training ML models to improve the app’s ranking function. It also illustrates the importance of going beyond pointwise loss functions when training models in a learning to rank context.
Both tutorials are detailed and come with code available to reproduce the steps. Here are the highlights.Basic text search app in a nutshell
The main task when creating a basic app with Vespa is to write a search definition file containing information about the data you want to feed to the application and how Vespa should match and order the results returned in response to a query.
Apart from some additional details described in the tutorial, the search definition for our text search engine looks like the code snippet below. We have a title and body field containing information about the documents available to be searched. The fieldset keyword indicates that our query will match documents by searching query words in both title and body fields. Finally, we have defined two rank-profile, which controls how the matched documents will be ranked. The default rank-profile uses nativeRank, which is one of many built-in rank features available in Vespa. The bm25 rank-profile uses the widely known BM25 rank feature.
search msmarco {
document msmarco {
field title type string
field body type string
} fieldset default {
fields: title, body
} rank-profile default {
first-phase {
expression: nativeRank(title, body)
}
} rank-profile bm25 inherits default {
first-phase {
expression: bm25(title) + bm25(body)
}
} }
When we have more than one rank-profile defined, we can chose which one to use at query time, by including the ranking parameter in the query:
curl -s "
Learning to Rank with Vespa – Getting started with Text SearchDecember 4, 2019
|
|
December 3, 2019 |
December 3, 2019
Dash Open 14: How Verizon Media’s Data Platforms and Systems Engineering Team Uses and Contributes to Open SourceBy Ashley Wolf, Open Source Program Manager, Verizon Media
In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Tom Miller, Director of Software Development Engineering on the Data Platforms and Systems Engineering Team at Verizon Media. Tom shares how his team uses and contributes to open source. Tom also chats about empowering his team to do great work and what it’s like to live and work in Champaign, IL.
Audio and transcript available here.
You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.
P.S. If you enjoyed this podcast then you might be interested in this Software Development Engineer position in Champaign!
Dash Open 14: How Verizon Media’s Data Platforms and Systems Engineering Team Uses and Contributes to Open SourceDecember 3, 2019
|
|
November 29, 2019 |
November 29, 2019
E-commerce search and recommendation with Vespa.aiIntroduction
Holiday shopping season is upon us and it’s time for a blog post on E-commerce search and recommendation using Vespa.ai. Vespa.ai is used as the search and recommendation backend at multiple Yahoo e-commerce sites in Asia, like tw.buy.yahoo.com.
This blog post discusses some of the challenges in e-commerce search and recommendation, and shows how they can be solved using the features of Vespa.ai.
Photo by Jonas Leupe on Unsplash
Text matching and ranking in e-commerce search
E-commerce search have text ranking requirements where traditional text ranking features like BM25 or TF-IDF might produce poor results. For an introduction to some of the issues with TF-IDF/BM25 see the influence of TF-IDF algorithms in e-commerce search. One example from the blog post is a search for ipad 2 which with traditional TF-IDF ranking will rank ‘black mini ipad cover, compatible with ipad 2’ higher than ‘Ipad 2’ as the former product description has several occurrences of the query terms Ipad and 2.
Vespa allows developers and relevancy engineers to fine tune the text ranking features to meet the domain specific ranking challenges. For example developers can control if multiple occurrences of a query term in the matched text should impact the relevance score. See text ranking occurrence tables and Vespa text ranking types for in-depth details. Also the Vespa text ranking features takes text proximity into account in the relevancy calculation, i.e how close the query terms appear in the matched text. BM25/TF-IDF on the other hand does not take query term proximity into account at all. Vespa also implements BM25 but it’s up to the relevancy engineer to chose which of the rich set of built-in text ranking features in Vespa that is used.
Vespa uses OpenNLP for linguistic processing like tokenization and stemming with support for multiple languages (as supported by OpenNLP).Custom ranking business logic in e-commerce search
Your manager might tell you that these items of the product catalog should be prominent in the search results. How to tackle this with your existing search solution? Maybe by adding some synthetic query terms to the original user query, maybe by using separate indexes with federated search or even with a key value store which rarely is in synch with the product catalog search index?
With Vespa it’s easy to promote content as Vespa’s ranking framework is just math and allows the developer to formulate the relevancy scoring function explicitly without having to rewrite the query formulation. Vespa controls ranking through ranking expressions configured in rank profiles which enables full control through the expressive Vespa ranking expression language. The rank profile to use is chosen at query time so developers can design multiple ranking profiles to rank documents differently based on query intent classification. See later section on query classification for more details how query classification can be done with Vespa.
A sample ranking profile which implements a tiered relevance scoring function where sponsored or promoted items are always ranked above non-sponsored documents is shown below. The ranking profile is applied to all documents which matches the query formulation and the relevance score of the hit is the assigned the value of the first-phase expression. Vespa also supports multi-phase ranking.
Sample hand crafted ranking profile defined in the Vespa application package.
The above example is hand crafted but for optimal relevance we do recommend looking at learning to rank (LTR) methods. See learning to Rank using TensorFlow Ranking and learning to Rank using XGBoost. The trained MLR models can be used in combination with the specific business ranking logic. In the example above we could replace the default-ranking function with the trained MLR model, hence combining business logic with MLR models.
Facets and grouping in e-commerce search
Guiding the user through the product catalog by guided navigation or faceted search is a feature which users expects from an e-commerce search solution today and with Vespa, facets and guided navigation is easily implemented by the powerful Vespa Grouping Language.
Sample screenshot from Vespa e-commerce sample application UI demonstrating search facets using Vespa Grouping Language.
The Vespa grouping language supports deep nested grouping and aggregation operations over the matched content. The language also allows pagination within the group(s). For example if grouping hits by category and displaying top 3 ranking hits per category the language allows paginating to render more hits from a specified category group.The vocabulary mismatch problem in e-commerce search
Studies (e.g. this study from FlipKart) finds that there is a significant fraction of queries in e-commerce search which suffer from vocabulary mismatch between the user query formulation and the relevant product descriptions in the product catalog. For example, the query “ladies pregnancy dress” would not match a product with description “women maternity gown” due to vocabulary mismatch between the query and the product description. Traditional Information Retrieval (IR) methods like TF-IDF/BM25 would fail retrieving the relevant product right off the bat.
Most techniques currently used to try to tackle the vocabulary mismatch problem are built around query expansion. With the recent advances in NLP using transfer learning with large pre-trained language models, we believe that future solutions will be built around multilingual semantic retrieval using text embeddings from pre-trained deep neural network language models. Vespa has recently announced a sample application on semantic retrieval which addresses the vocabulary mismatch problem as the retrieval is not based on query terms alone, but instead based on the dense text tensor embedding representation of the query and the document. The mentioned sample app reproduces the accuracy of the retrieval model described in the Google blog post about Semantic Retrieval.
Using our query and product title example from the section above, which suffers from the vocabulary mismatch, and instead move away from the textual representation to using the respective dense tensor embedding representation, we find that the semantic similarity between them is high (0.93). The high semantic similarity means that the relevant product would be retrieved when using semantic retrieval. The semantic similarity is in this case defined as the cosine similarity between the dense tensor embedding representations of the query and the product description. Vespa has strong support for expressing and storing tensor fields which one can perform tensor operations (e.g cosine similarity) over for ranking, this functionality is demonstrated in the mentioned sample application.
Below is a simple matrix comparing the semantic similarity of three pairs of (query, product description). The tensor embeddings of the textual representation is obtained with the Universal Sentence Encoder from Google.
Semantic similarity matrix of different queries and product descriptions.
The Universal Sentence Encoder Model from Google is multilingual as it was trained on text from multiple languages. Using these text embeddings enables multilingual retrieval so searches written in Chinese can retrieve relevant products by descriptions written in multiple languages. This is another nice property of semantic retrieval models which is particularly useful in e-commerce search applications with global reach.Query classification and query rewriting in e-commerce search
Vespa supports deploying stateless machine learned (ML) models which comes handy when doing query classification. Machine learned models which classify the query is commonly used in e-commerce search solutions and the recent advances in natural language processing (NLP) using pre-trained deep neural language models have improved the accuracy of text classification models significantly. See e.g text classification using BERT for an illustrated guide to text classification using BERT. Vespa supports deploying ML models built with TensorFlow, XGBoost and PyTorch through the Open Neural Network Exchange (ONNX) format. ML models trained with mentioned tools can successfully be used for various query classification tasks with high accuracy.
In e-commerce search, classifying the intent of the query or query session can help ranking the results by using an intent specific ranking profile which is tailored to the specific query intent. The intent classification can also determine how the result page is displayed and organised.
Consider a category browse intent query like ‘shoes for men’. A query intent which might benefit from a query rewrite which limits the result set to contain only items which matched the unambiguous category id instead of just searching the product description or category fields for ‘shoes for men’ . Also ranking could change based on the query classification by using a ranking profile which gives higher weight to signals like popularity or price than text ranking features.
Vespa also features a powerful query rewriting language which supports rule based query rewrites, synonym expansion and query phrasing.Product recommendation in e-commerce search
Vespa is commonly used for recommendation use cases and e-commerce is no exception.
Vespa is able to evaluate complex Machine Learned (ML) models over many data points (documents, products) in user time which allows the ML model to use real time signals derived from the current user’s online shopping session (e.g products browsed, queries performed, time of day) as model features. An offline batch oriented inference architecture would not be able to use these important real time signals. By batch oriented inference architecture we mean pre-computing the inference offline for a set of users or products and where the model inference results is stored in a key-value store for online retrieval.
In our blog recommendation tutorial we demonstrate how to apply a collaborative filtering model for content recommendation and in part 2 of the blog recommendation tutorial we show to use a neural network trained with TensorFlow to serve recommendations in user time. Similar recommendation approaches are used with success in e-commerce.Keeping your e-commerce index up to date with real time updates
Vespa is designed for horizontal scaling with high sustainable write and read throughput with low predictable latency. Updating the product catalog in real time is of critical importance for e-commerce applications as the real time information is used in retrieval filters and also as ranking signals. The product description or product title rarely changes but meta information like inventory status, price and popularity are real time signals which will improve relevance when used in ranking. Also having the inventory status reflected in the search index also avoids retrieving content which is out of stock.
Vespa has true native support for partial updates where there is no need to re-index the entire document but only a subset of the document (i.e fields in the document). Real time partial updates can be done at scale against attribute fields which are stored and updated in memory. Attribute fields in Vespa can be updated at rates up to about 40-50K updates/s per content node.Campaigns in e-commerce search
Using Vespa’s support for predicate fields it’s easy to control when content is surfaced in search results and not. The predicate field type allows the content (e.g a document) to express if it should match the query instead of the other way around. For e-commerce search and recommendation we can use predicate expressions to control how product campaigns are surfaced in search results. Some examples of what predicate fields can be used for:
- Only match and retrieve the document if time of day is in the range 8–16 or range 19–20 and the user is a member. This could be used for promoting content for certain users, controlled by the predicate expression stored in the document. The time of day and member status is passed with the query.
- Represent recurring campaigns with multiple time ranges.
Above examples are by no means exhaustive as predicates can be used for multiple campaign related use cases where the filtering logic is expressed in the content.Scaling & performance for high availability in e-commerce search
Are you worried that your current search installation will break by the traffic surge associated with the holiday shopping season? Are your cloud VMs running high on disk busy metrics already? What about those long GC pauses in the JVM old generation causing your 95percentile latency go through the roof? Needless to say but any downtime due to a slow search backend causing a denial of service situation in the middle of the holiday shopping season will have catastrophic impact on revenue and customer experience.
Photo by Jon Tyson on Unsplash
The heart of the Vespa serving stack is written in C++ and don’t suffer from issues related to long JVM GC pauses. The indexing and search component in Vespa is significantly different from the Lucene based engines like SOLR/Elasticsearch which are IO intensive due to the many Lucene segments within an index shard. A query in a Lucene based engine will need to perform lookups in dictionaries and posting lists across all segments across all shards. Optimising the search access pattern by merging the Lucene segments will further increase the IO load during the merge operations.
With Vespa you don’t need to define the number of shards for your index prior to indexing a single document as Vespa allows adaptive scaling of the content cluster(s) and there is no shard concept in Vespa. Content nodes can be added and removed as you wish and Vespa will re-balance the data in the background without having to re-feed the content from the source of truth.
In ElasticSearch, changing the number of shards to scale with changes in data volume requires an operator to perform a multi-step procedure that sets the index into read-only mode and splits it into an entirely new index. Vespa is designed to allow cluster resizing while being fully available for reads and writes. Vespa splits, joins and moves parts of the data space to ensure an even distribution with no intervention needed
At the scale we operate Vespa at Verizon Media, requiring more than 2X footprint during content volume expansion or reduction would be prohibitively expensive. Vespa was designed to allow content cluster resizing while serving traffic without noticeable serving impact. Adding content nodes or removing content nodes is handled by adjusting the node count in the application package and re-deploying the application package.
Also the shard concept in ElasticSearch and SOLR impacts search latency incurred by cpu processing in the matching and ranking loops as the concurrency model in ElasticSearch/SOLR is one thread per search per shard. Vespa on the other hand allows a single search to use multiple threads per node and the number of threads can be controlled at query time by a rank-profile setting: num-threads-per-search. Partitioning the matching and ranking by dividing the document volume between searcher threads reduces the overall latency at the cost of more cpu threads, but makes better use of multi-core cpu architectures. If your search servers cpu usage is low and search latency is still high you now know the reason.
In a recent published benchmark which compared the performance of Vespa versus ElasticSearch for dense vector ranking Vespa was 5x faster than ElasticSearch. The benchmark used 2 shards for ElasticSearch and 2 threads per search in Vespa.
The holiday season online query traffic can be very spiky, a query traffic pattern which can be difficult to predict and plan for. For instance price comparison sites might direct more user traffic to your site unexpectedly at times you did not plan for. Vespa supports graceful quality of search degradation which comes handy for those cases where traffic spikes reaches levels not anticipated in the capacity planning. These soft degradation features allow the search service to operate within acceptable latency levels but with less accuracy and coverage. These soft degradation mechanisms helps avoiding a denial of service situation where all searches are becoming slow due to overload caused by unexpected traffic spikes. See details in the Vespa graceful degradation documentation.Summary
In this post we have explored some of the challenges in e-commerce Search and Recommendation and highlighted some of the features of Vespa which can be used to tackle e-commerce search and recommendation use cases. If you want to try Vespa for your e-commerce application you can go check out our e-commerce sample application found here . The sample application can be scaled to full production size using our hosted Vespa Cloud Service at https://cloud.vespa.ai/. Happy Holiday Shopping Season!
E-commerce search and recommendation with Vespa.aiNovember 29, 2019
|
|
November 26, 2019 |
November 26, 2019
YAML tip: Using anchors for shared steps & jobsSheridan Rawlins, Architect, Verizon Media Overview
Occasionally, a pipeline needs several similar but different jobs. When these jobs are specific to a single pipeline, it would not make much sense to create a Screwdriver template. In order to reduce copy/paste issues and facilitate sharing jobs and steps in a single YAML, the tips shared in this post will hopefully be as helpful to you as they were to us.
Below is a condensed example showcasing some techniques or patterns that can be used for sharing steps. Example of desired use
jobs: deploy-prod: template: sd/noop deploy-location1: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location1 deploy-location2: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location2
Complete working example at the end of this post.
Defining shared steps What is a step?
First, let us define a step.
Steps of a job look something like the following, and each step is an array element with an object with only one key and corresponding value. The key is the step name and the value is the cmd to be run. More details can be found in the SD Guide.
jobs: job1: steps: - step1: echo "do step 1" - step2: echo "do step 2" What are anchors and aliases?
Second, let me describe YAML anchors and aliases. An anchor may only be placed between an object key and its value. An alias may be used to copy or merge the anchor value. Recommendation for defining shared steps and jobs
While an anchor can be defined anywhere in a yaml, defining shared things in the shared section makes intuitive sense. As annotations can contain freeform objects in addition to documented ones, we recommend defining annotations in the “shared” section.
Now, I’ll show an example and explain the details of how it works:
shared: environment: ANOTHER_ARG: another_arg_value annotations: steps: - .: &set-dryrun set-dryrun: | DRYRUN=false if [[ -n $SD_PULL_REQUEST ]]; then DRYRUN=true fi - .: &deploy deploy: | CMD=( my big deploy tool --dry-run="${DRYRUN:?}" --location "${LOCATION:?}" --another-arg "${ANOTHER_ARG:?}" ) "${CMD[@]}" Explanation of how the step anchor declaration patterns work:
In order to reduce redundancy, annotations allow users to define one shared configuration with an “alias” that can be referenced multiple times, such as *some-step in the following example, used by job1 and job2.
jobs: job1: steps: - *some-step job2: steps: - *some-step
To use the alias, the anchor &some-step must result in an object with single key (also some-step) and value which is the shell code to execute.
Because an anchor can only be declared between a key and a value, we use an array with a single object with single key . (short to type). The array allows us to use . again without conflict - if it were in an object, we might need to repeat the some-step three times such as:
# Anti-pattern: do not use as it is too redundant. some-step: &some-step some-step: | # shell instructions
The following is an example of a reasonably short pattern that can be used to define the steps with only redundancy being the anchor name and the step name:
shared: annotations: steps: - .: &some-step some-step: | echo "do some step"
When using *some-step, you alias to the anchor which is an object with single key some-step and value of echo "do some step" which is exactly what you want/need. FAQ Why the | character after some-step:?
While you could just write some-step: echo "do some step", I prefer to use the | notation for describing shell code because it allows you to do multiline shell scripting. Even for one-liners, you don’t have to reason about the escape rules - as long as the commands are indented properly, they will be passed to the $USER_SHELL_BIN correctly, allowing your shell to deal with escaping naturally.
set-dryrun: | DRYRUN=false if [[ -n $SD_PULL_REQUEST ]]; then DRYRUN=true fi Why that syntax for environment variables?
1. By using environment variables for shared steps, it allows the variables to be altered by the specific jobs that invoke them.
2. The syntax "${VARIABLE:?}" is useful for a step that needs a value - it will cause an error if the variable is undefined or empty.Why split CMD into array assignment and invocation?
The style of defining an array and then invoking it helps readability by putting each logical flag on its own line. It can be digested by a human very easily and also copy/pasted to other commands or deleted with ease as a single line. Assigning to an array allows multiple lines as bash will not complete the statement until the closing parenthesis. Why does one flag have –flag=value and another have –flag value
Most CLI parsers treat boolean flags as a flag without an expected value - omission of the flag is false, existence is true. However, many CLI parsers also accept the --flag=value syntax for boolean flags and, in my opinion, it is far easier to debug and reason about a variable (such as false) than to know that the flag exists and is false when not provided. Defining shared jobs What is a job?
A job in screwdriver is an object with many fields described in the SD Guide Job anchor declaration patterns
To use a shared job effectively, it is helpful to use a feature of YAML that is documented outside of the YAML 1.2 Spec called Merge Key.
The syntax <<: *some-object-anchor lets you merge in keys of an anchor that has an object as its value into another object and then add or override keys as necessary. Recommendation for defining shared jobs
shared: annotations: jobs: deploy-job: &deploy-job image: the-deploy-image steps: - *set-dryrun - *deploy
If you browse back to the previous example of desired use (also copied here), you can see use of the <<: *deploy-job to start with the deploy-job keys/values, and then add requires and environment overrides to customize the concrete instances of the deploy job.
jobs: deploy-prod: template: sd/noop deploy-location1: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location1 deploy-location2: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location2 FAQ Why is environment put in the shared section and not included with the shared job?
The answer to that is quite subtle. The Merge Key merges top level keys; if you were to put defaults in a shared job, overriding environment: would end up clobbering all of the provided values. However, Screwdriver follows up the YAML parsing phase with its own logic to merge things from the shared section at the appropriate depth. Why not just use shared.steps?
As noted above, Screwdriver does additional work to merge annotations, environment, and steps into each job after the YAML parsing phase. The logic for steps goes like this:
1. If a job has NO steps key, then it inherits ALL shared steps.
2. If a job has at least one step, then only matching wrapping steps (steps starting with pre or post) are copied in at the right place (before or after steps that the job provides matching the remainder of the step name after pre or post).
While the above pattern might be useful for some pipelines, complex pipelines typically have a few job types and may want to share some but not all steps. Complete Example
Copy paste the following into validator
shared: environment: ANOTHER_ARG: another_arg_value annotations: steps: - .: &set-dryrun set-dryrun: | DRYRUN=false if [[ -n $SD_PULL_REQUEST ]]; then DRYRUN=true fi - .: &deploy deploy: | CMD=( my big deploy tool --dry-run="${DRYRUN:?}" --location "${LOCATION:?}" --another-arg "${ANOTHER_ARG:?}" ) "${CMD[@]}" jobs: deploy-job: &deploy-job image: the-deploy-image steps: - *set-dryrun - *deploy jobs: deploy-prod: template: sd/noop deploy-location1: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location1 deploy-location2: <<: *deploy-job requires: [deploy-prod] environment: LOCATION: location2
YAML tip: Using anchors for shared steps & jobsNovember 26, 2019
|
|
November 24, 2019 |
November 24, 2019
Dash Open 13: Using and Contributing to Hadoop at Verizon MediaBy Ashley Wolf, Open Source Program Manager, Verizon Media
In this episode, Rosalie Bartlett, Sr. Open Source Community Manager, interviews Eric Badger, Software Development Engineer, about using and contributing to Hadoop at Verizon Media.
Audio and transcript available here.
You can listen to this episode of Dash Open on iTunes, SoundCloud, and Spotify.
Dash Open 13: Using and Contributing to Hadoop at Verizon MediaNovember 24, 2019
|
|
November 13, 2019 |
November 13, 2019
Build ParametersAlan Dong, Software Engineer, Verizon Media
Screwdriver team is constantly evolving and building new features for its users. Today, we are announcing a nuanced feature: Build Parameters, aka Parameterized Builds, which enables users to have more control over build pipelines.Purpose
The Build Parameters feature allows users to define a set of parameters on the pipeline level; users can customize runtime parameters either through using the UI or API to kickoff builds. This means users can now implement reactive behaviors based on the parameters passed in as well.Definition
There are 2 ways of defining parameters, see
paramters: nameA: "value1" nameB: value: "value2" description: "description of nameB"
Parameters is a dictionary which expects key:value pairs.
nameA: "value1"
key: string is a shorthand for writting as key: value
nameA: value: "value1" description: ""
These two are identical with description to be an empty string.Example
See Screwdriver pipeline
shared: image: node:8 parameters: region: "us-west-1" az: value: "1" description: "default availability zone" jobs: main: requires: [~pr, ~commit] steps: - step1: 'echo "Region: $(meta get parameters.region.value)"' - step2: 'echo "AZ: $(meta get parameters.az.value)"'
You can also preview the parameters that being used during a build in Setup -> sd-setup-init step
Pipeline Preview Screenshot:Compatibility List
In order to use this feature, you will need these minimum versions:
- API - v0.5.780
- UI - v1.0.466Contributors
Thanks to the following contributors for making this feature possible:
- adong
Questions & Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Build ParametersNovember 13, 2019
|
|
November 5, 2019 |
November 5, 2019
Vespa Product Updates, October/November 2019: Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance, and Datadog Monitoring SupportKristian Aune, Tech Product Manager, Verizon Media
In the September Vespa product update, we mentioned Tensor Float Support, Reduced Memory Use for Text Attributes, Prometheus Monitoring Support, and Query Dispatch Integrated in Container.
This month, we’re excited to share the following updates:
Nearest Neighbor and Tensor Ranking
Tensors are native to Vespa. We compared elastic.co to vespa.ai testing nearest neighbor ranking using dense tensor dot product. The result of an out-of-the-box configuration demonstrated that Vespa performed 5 times faster than Elastic. View the test results.
Optimized JSON Tensor Feed Format
A tensor is a data type used for advanced ranking and recommendation use cases in Vespa. This month, we released an optimized tensor format, enabling a more than 10x improvement in feed rate. Read more.
Matched Elements in Complex Multi-value Fields
Vespa is used in many use cases with structured data - documents can have arrays of structs or maps. Such arrays and maps can grow large, and often only the entries matching the query are relevant. You can now use the recently released matched-elements-only setting to return matches only. This increases performance and simplifies front-end code.
Large Weighted Set Update Performance
Weighted sets in documents are used to store a large number of elements used in ranking. Such sets are often updated at high volume, in real-time, enabling online big data serving. Vespa-7.129 includes a performance optimization for updating large sets. E.g. a set with 10K elements, without fast-search, is 86.5% faster to update.
Datadog Monitoring Support
Vespa is often used in large scale mission-critical applications. For easy integration into dashboards, Vespa is now in Datadog’s integrations-extras GitHub repository. Existing Datadog users will now find it easy to monitor Vespa. Read more.
About Vespa: Largely developed by Yahoo engineers, Vespa is an open source big data processing and serving engine. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance, and the Verizon Media Ad Platform. Thanks to feedback and contributions from the community, Vespa continues to grow.
We welcome your contributions and feedback (tweet or email) about any of these new features or future improvements you’d like to request.
Vespa Product Updates, October/November 2019: Nearest Neighbor and Tensor Ranking, Optimized JSON Tensor Feed Format, Matched Elements in Complex Multi-value Fields, Large Weighted Set Update Performance, and Datadog Monitoring SupportNovember 5, 2019
|
|
November 4, 2019 |
November 4, 2019
Collection Page RedesignYufeng Gao, Software Engineer Intern, Verizon Media
We would like to introduce our new collections dashboard page. Users can now know more about the statuses of pipelines and have more flexibility when managing pipelines within a collection. Main Features
View Modes
The new collection dashboard provides two view options - card mode and list mode. Both modes display pipeline repo names, branches, histories, and latest event info (such as commit sha, status, start date, duration). However, card mode also shows the latest events while the list mode doesn’t. Users can switch between the two modes using the toggle on the top right corner of the dashboard’s main panel.
Collection Operations
To create or delete a collection, users can use the left sidebar of the new collections page.
For a specific existing collection, the dashboard offers three operations which can be found to the right of the title of the current collection:
1. Search all pipelines that the current collection doesn’t contain, then select and add some of them into the current collection;
2. Change the name and description of the current collection;
3. Copy and share the link of the current collection.
Additionally, the dashboard also provides useful pipeline management operations:
1. Easily remove a single pipeline from the collection;
2. Remove multiple pipelines from the collection;
3. Copy and add multiple pipelines of the current collection to another collection.
Default Collection
Another new feature is the default collection, a collection where users can find all pipelines created by themselves. Note: Users have limited powers when it comes to the default collection; that is, they cannot conduct most operations they can do on normal collections. Users can only copy and share default collection links.Compatibility List
In order to see the collection page redesign, you will need these minimum versions:
- API: v0.5.781
- UI: v1.0.466Contributors
- code-beast
- adong
- jithine
Questions & Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on GitHub and Slack.
Collection Page RedesignNovember 4, 2019
|
|
October 21, 2019 |
October 21, 2019
Recent UpdatesJithin Emmanuel, Engineering Manager, Verizon Media
Recent bug fixes in Screwdriver: Meta
- skip-store option to prevent caching external meta.
- meta cli is now concurrency safe.
- When caching external metadata, meta-cli will not store existing cached data present in external metadata.API
- Users can use SD_COVERAGE_PLUGIN_ENABLED environment variable to skip Sonarqube coverage bookend.
- Screwdriver admins can now update build status to FAILURE through the API.
- New API endpoint for fetching latest build for a job is now available.
- Fix for branch filtering not working for PR builds.
- Fix for branch filtering not working for triggered jobs.Compatibility List
In order to have these improvements, you will need these minimum versions:
- API - v0.5.773
- Launcher - v6.0.23Contributors
Thanks to the following contributors for making this feature possible:
- adong
- klu909
- scr-oath
- kumada626
- tk3fftk
Questions and Suggestions
We’d love to hear from you. If you have any questions, please feel free to reach out here. You can also visit us on Github and Slack.
Recent UpdatesOctober 21, 2019
|
|
October 18, 2019 |
October 18, 2019
Database schema migrationsLakshminarasimhan Parthasarathy, Verizon Media
Database schema migrations
Screwdriver now supports database schema migrations using sequelize-cli migrations. When adding any fields to models in the data-schema, you will need to add a migration file. Sequelize-cli migrations keep track of changes to the database, helping with adding and/or reverting the changes to the DB. They also ensure models and migration files are in sync.
Why schema migrations?
Database schema migrations will help to manage the state of schemas. Screwdriver originally did schema deployments during API deployments while this was helpful for low-scale deployments, it also leads to unexpected issues for high-scale deployments. For such high-scale deployments, migrations are more effective as they ensure quicker and more consistent schema deployment outside of API deployments. Moreover, API traffic is not served until database schema changes are applied and ready.
Cluster Admins
In order to run schema migrations, DDL sync via API should be disabled using the DATASTORE_DDL_SYNC_ENABLED environment variable, since this option is enabled by default.
- Both schema migrations and DDL sync via API should not be run together. Either option should suffice based on the scale of Screwdriver deployment.
- Always create new migration files for any new DDL changes.
- Do not edit or remove migration files even after it’s migrated and available in the database.
Screwdriver cluster admins can refer to the following documentation for more details on database schema migrations:
- README: https://github.com/screwdriver-cd/data-schema/blob/master/CONTRIBUTING.md#migrations
- Issue: https://github.com/screwdriver-cd/screwdriver/issues/1664
- Disable DDL sync via API: https://github.com/screwdriver-cd/screwdriver/pull/1756
Compatibility List
In order to use this feature, you will need these minimum versions:
- [API] (https://hub.docker.com/r/screwdrivercd/screwdriver) - v0.5.752
Contributors
Thanks to the following people for making this feature possible:
- parthasl
- catto
- dekus
Screwdriver is an open-source build automation platform designed for Continuous Delivery. It is built (and used) by Yahoo. Don’t hesitate to reach out if you have questions or would like to contribute: http://docs.screwdriver.cd/about/support.
Database schema migrationsOctober 18, 2019
|
|
October 2, 2019 |
October 2, 2019
Improving Screwdriver’s meta toolSheridan Rawlins, Architect, Verizon Media Improving Screwdriver’s meta tool
Over the past month there have been a few changes to the meta tool mostly focused on using external metadata, but also on helping to identify and self-diagnose a few silent gotchas we found.
Metadata is a structured key/value data store that gives you access to details about your build. Metadata is carried over to all builds part of the same event, At the end of each build metadata is merged back to the event the build belongs to. This allows builds to share their metadata with other builds in the same event or external. External metadata
External metadata can be populated for a job in your pipeline using the requires clause that refers to it in the form sd@${pipelineID}:${jobName} (such as sd@12345:main).
If sd@12345:main runs to completion and “triggers” a job or jobs in your pipeline, a file will be created with meta from that build in /sd/meta/sd@12345:main.json, and you can refer to any of its data with commands such as meta get someKey --external sd@12345:main.
The above feature has existed for some time, but there were several corner cases that made it challenging to use external metadata:
1. External metadata was not provided when the build was not triggered from the external pipeline such as by clicking the “Start” button, or via a scheduled trigger (i.e., through using the buildPeriodically annotation).
2. Restarting a previously externally-triggered build would not provide the external metadata. The notion of rollback can be facilitated by retriggering a deployment job, but if that deployment relies on metadata from an external trigger, it wouldn’t be there before these improvements.Fetching from lastSuccessfulMeta
Screwdriver has an API endpoint called lastSuccessfulMeta. While it is possible to use this endpoint directly, providing this ability directly in the meta tool makes it a little easier to just “make it so”. By default now, if external metadata does not exist in the file /sd/meta/sd@12345:main.json, it is fetched from that job’s lastSuccessfulMeta via the API call. Should this behavior not be desired, the flag --skip-fetch can be used to skip fetching.
For rollback behavior, however, this feature by itself wasn’t enough - consider a good deployment followed by a bad deployment. The “bad” deployment would most likely have deployed what, from a build standpoint, was “successful”. When retriggering the previous job, because it is a manual trigger, there will be no external metadata and the lastSuccessfulMeta will most likely be fetched and the newer, “bad” code would ju |