Yahoo Knowledge Graph Announces COVID-19 Dataset, API, and Dashboard with Source Attribution
<p><a href="https://www.linkedin.com/in/amitnagpal09/">Amit Nagpal</a>, Sr. Director, Software Development Engineering, Verizon Media<br/></p><p>Among many interesting teams at Verizon Media is the Yahoo Knowledge (YK) team. We build the <a href="https://www.researchgate.net/publication/322899161_The_Yahoo_Knowledge_Graph">Yahoo Knowledge Graph</a>; one of the few web scale knowledge graphs in the world. Our graph contains billions of facts and entities that enrich user experiences and power AI across Verizon Media properties. At the onset of the COVID-19 pandemic we felt the need and responsibility to put our web scale extraction technologies to work, to see how we can help. We have started to extract COVID-19 statistics from hundreds of sources around the globe into what we call the YK-COVID-19 dataset. The YK-COVID-19 dataset provides data and knowledge that help inform our readers on <a href="https://news.yahoo.com/">Yahoo News</a>, <a href="https://finance.yahoo.com/">Yahoo Finance</a>, <a href="https://mobile.yahoo.com/weather">Yahoo Weather</a>, and <a href="https://search.yahoo.com/">Yahoo Search</a>. We created this dataset by carefully combining and normalizing raw data provided entirely by <a href="https://github.com/yahoo/covid-19-data/blob/master/data-sources.md">government and public health authorities</a>. We provide website level provenance for every single statistic in our dataset, so our community has the confidence it needs to use it scientifically and report with transparency. After weeks of hard work, we are ready to make this data public in an easily consumable format at <a href="https://github.com/yahoo/covid-19-data">the YK-COVID-19-Data GitHub repo</a>.<b><br/></b></p><p>A dataset alone does not always tell the full story. We reached out to teams across Verizon Media to get their help in building a set of tools that can help us, and you, build dashboards and analyze the data. Engineers from the Verizon Media Data team in Champaign, Illinois volunteered to build an API and dashboard. <a href="https://github.com/yahoo/covid-19-api/blob/master/README.md">The API</a> was constructed using a previously published Verizon Media open source platform called <a href="https://elide.io/">Elide</a>. The dashboard was constructed using <a href="https://emberjs.com/">Ember.js</a>, <a href="https://leafletjs.com/">Leaflet</a> and the <a href="https://denali.design/">Denali design system</a>. We still needed a map tile server and were able to use the Verizon Location Technology team’s map tile service powered by <a href="https://www.here.com/products/mapping/map-data">HERE</a>. We leveraged <a href="http://screwdriver.cd/">Screwdriver.cd</a>, our open source CI/CD platform to build our code assets, and our open source <a href="https://www.athenz.io/">Athenz.io</a> platform to secure our applications running in our Kubernetes environment. We did this using our open source <a href="https://github.com/yahoo/k8s-athenz-identity/blob/master/ATHENZ.md">K8s-athenz-identity</a> control plane project. You can see the result of this incredible team effort today at <a href="https://yahoo.github.io/covid-19-dashboard">https://yahoo.github.io/covid-19-dashboard</a>.</p><p><b>Build With Us</b><br/></p><p>You can build applications that take advantage of the YK-COVID-19 dataset and API yourself. The YK-COVID-19 dataset is made available under a Creative Commons CC-BY-NC 4.0 license. Anyone seeking to use the YK-COVID-19 dataset for other purposes is encouraged to <a href="https://docs.google.com/forms/d/e/1FAIpQLSdINfXR6S0ZmOGSvdvg4WUKzhqvDxltLoa4q4btQ4gkJokTPw/viewform">submit a request</a>.</p><p><b>Feature Roadmap</b></p><p>Updated multiple times a day, the YK-COVID-19 dataset provides reports of country, state, and county-level data based on the availability of data from our many sources. We plan to offer more coverage, granularity, and metadata in the coming weeks.</p><p><b>Why a Knowledge Graph?</b></p><p>A knowledge graph is information about real world entities, such as people, places, organizations, and events, along with their relations, organized as a graph. We at Yahoo Knowledge have the capability to crawl, extract, combine, and organize information from thousands of sources. We create refined information used by our brands and our readers on Yahoo Finance, Yahoo News, Yahoo Search and others sites too. </p><p>We built our web scale knowledge graph by extracting information from web pages around the globe. We apply information retrieval techniques, natural language processing, and computer vision to extract facts from a variety of formats such as html, tables, pdf, images and videos. These facts are then reconciled and integrated into our core knowledge graph that gets richer every day. We applied some of these techniques and processes relevant in the COVID-19 context to help gather information from hundreds of public and government authoritative websites. We then blend and normalize this information into a single combined COVID-19 specific dataset with some human oversight for stability and accuracy. In the process, we preserve provenance information, so our users know where each statistic comes from and have the confidence to use it for scientific and reporting purposes with attribution. We then pull basic metadata such as latitude, longitude, and population for each location from our core knowledge graph. We also include a Wikipedia id for each location, so it is easy for our community to attach additional metadata, as needed, from public knowledge bases such as Wikimedia or Wikipedia.</p><p>We’re in this together. So we are publishing our data along with a set of tools that we’re contributing to the open source community. We offer these tools, data, and an invitation to work together on getting past the raw numbers.</p><p>Yahoo, Verizon Media, and Verizon Location Technology are all part of the family at Verizon.</p>