SKYSCANNER IS A TRAVEL SERVICE that provides price comparisons for flights, as well as car hire and hotels, and has faced challenges in storing and managing data to ensure that it can deliver up to date information to its users.
The website has been operating for more than a decade. While Skyscanner began by listing only budget flights in Europe, it now includes a broad range of carriers and destinations around the globe. From its inception, the nature of the business meant dealing with complex data queries that involve timetables, ticketing rules and prices for multiple airlines, which has led to some challenging data management issues.
Skyscanner CTO Alistair Hann explained, "If someone runs a search, then we need to be able to look at timetables and the rules about what tickets can be sold, as different airlines fly different routes and you can only fly certain tickets in certain directions."
The firm holds data such as the timetable details in traditional relational data stores, but has moved to a NoSQL solution for caching pricing information, which changes frequently but needs to be held for a certain period of time in order to speed queries.
"We serve a few hundred million live flight queries a month, and each of those involve queries to many providers, so a lot of data comes back and we cache that temporarily so we can reuse it in case someone runs a similar query," Hann said.
That data was traditionally held in a SQL database as well, but the cache caused "progressively more problems", according to Hann, until the firm sought an alternative, and started experimenting with the Couchbase NoSQL document-oriented database, as this offered a clustered alternative to Memcached, which Skyscanner was using to cache the database.
"We were looking for a solution that would allow us to scale out horizontally instead of keeping on applying more fixes to scaling the store in memory," he said.
Since then, as Skyscanner has added new applications, the firm has increasingly found that non-relational data stores are the best solution going forwards, according to Hann.
"We have one service that needs to calculate a lot of aggregate statistics about the prices we have been seeing to give heuristics to help predict whether we think a particular carrier will give us a good price on a route, for example, and we ended up building that in Couchbase using the MapReduce views because it was much less painful that way, development was quick and performance is good too," he said.
Hann stressed that Skyscanner is "not on a mission to eliminate the relational database", but that it quite naturally makes more sense to use the best tool for the job, and Couchbase tends to fit the problem.
For the most part, Skyscanner runs its platform on its own infrastructure sited in data centres across Europe and Asia, although the firm does make use of public cloud resources, chiefly to store archived flight data on Amazon's AWS because this is the most cost-effective route.
Since Skyscanner started using Couchbase, developments in data management have seen the SQL and NoSQL approaches increasingly being fused, in order to allow companies to make use of familiar tools to query unstructured data sets. For example, Oracle's Big Data SQL and Actian's Analytics Platform-Hadoop SQL Edition both allow SQL queries to be run against data in Hadoop.
Hann said this approach is interesting, but his firm would likely have still chosen Couchbase.
"It's interesting, as [SQL] is a way that we are familiar with, but I believe we would have stuck to the same solution, because if you look at the way we developed, we have started using more things such as search history, and the ability to have a document store and the extensibility of a schema-less approach makes a lot of sense," he said.
Skyscanner still faces challenges regarding the volume of data it has to handle, and the complexity of the queries, according to Hann.
"We need to be in a position where all of our databases are horizontally scalable, and we can overcome the challenges over replication of data, especially between data centres in different continents. It's the sheer volume of data we create and the volatility of it," he said.
"But it's a good time to be having those challenges as there are a range of technologies now available." µ