Mixing Up Great Data Without the Hassle of Scraping

There is an awful lot of great data online which has the potential to power awesome next-generation apps and services. Open data, government data services, and many APIs provide relatively easy access to data which can be integrated with sufficient time and effort.

The Problem

However, the data your app or service needs is not necessarily available in a useful format. In fact, the vast majority of online data is locked up in web pages with no provision of data dumps or APIs in order to programmatically access it.

Frequently, the solution to this problem has been the development of so-called "scrapers". While scrapers may initially solve the problem of data retrieval, they come with a number of issues. First of all, they are technically difficult and expensive to implement, and even more so to maintain. Websites where the layout changes frequently and subtly can cause maintainers headaches while they fix the problems, with no data access in the meantime. Additionally, scrapers are specific to one page or website - abstracting the technology to apply to multiple sites is a major investment in terms of time and resources.

A Solution

Import.io solves the problem or obtaining data from multiple sources on the web by providing APIs which query and extract data from websites into consistent JSON data formats.

The import.io extractor tool allows developers to train our platform to recognise semi-structured information in web pages so that it can provide information extraction from web pages on demand over REST APIs. The information is extracted from the pages and returned in a structured JSON document which adheres to the schema that you define when you create the extractor.

For scenarios where a more complex process is required, an import.io connector can be created. A connector allows a developer to train the platform how to find the correct information for a specific query they have. For example, websites which provide search functionality can be converted into a query API, where developers can send a query with specific inputs which the platform will then execute, and return the result data in the same structured format as described earlier. The technology can also handle the retrieval of multiple pages of results, and can perform optimisations on the query to improve response times.

The extractor and connector use cases describe retrieving data from one specific website. The power of the import.io platform is the ability to combine multiple data sources together, through the creation of import.io mixes. A mix is simply a named collection of connectors and extractors. The main difference with a mix is that when you query it, the same input is federated to all of the data sources in the mix, and streamed to the client as data becomes available. This allows developers the power to adjust which sources of data are integrated with their applications and services, without having to recompile or deploy the app.

For more information on import.io and to sign up (required for querying), visit http://import.io. For our developer documentation, check out http://docs.import.io - to find our 2-line quick query (get data from a connector with 2 curl commands) then visit http://go.import.io/start. Finally, the code that I demoed live at the Yahoo! Hack Europe event is available on Github at: http://go.import.io/yhack. If you have any questions, comments or feedback, then please drop us a line at hello@import.io - every e-mail is read by a human and we'll do our best to help you out.