Last week I attended Strata + Hadoop World 2014 in New York. Here are some of my notes.
Spark was everywhere at the conference, and was mentioned in many of the talks I went to. It really seems to be the tool to use for writing batch processing jobs.
The talks were very popular. In fact, I couldn’t even get in to one of the talks as it was full.
One of the talks I did attend was about Spark SQL, which allows you to run SQL against a Spark RDD (an RDD is how Spark represents data). It doesn’t yet fully support any particular flavour of SQL, but you can register your own UDFs to extend the SQL.
Spark also provides a JDBC server, so you can connect to it from traditional BI applications. Last week, Tableau announced they will support connecting to Spark SQL using JDBC.
It was announced future Strata conferences will have a dedicated Spark track. Maybe one day the conference will be called “Strata + Spark World”.
SQL on Hadoop
There are a number of projects providing SQL support for Hadoop. Presumably the aim is to make the data in Hadoop accessible to traditional ETL authors and their tools.
Impala was the one I heard most about. It is able to load data from HDFS itself, so you don’t need to write a load job. It can also do schema inference from structured data files. Currently this only supports Parquet, but JSON, Avro, Thrift and XML are being worked on.
Impala is open source and backed by Cloudera (lead sponsor of the conference). It claims to be the fastest database on Hadoop.
A data service is an application that exposes data over an API. Often the data will come from a relational database, but it could be anywhere.
Having all data access going through the data service gives you a number of benefits. You are able to change the database without making any changes to the many applications that are using the data. You could also implement caching or fail over here, rather than in the applications.
Data services are also a good place to implement governance for the flow of data in the organisation.
At Amazon, there is no direct access to their databases. All requests for data go through a data service.
I went to a talk on scaling Storm. It was interesting, even though I haven’t (yet) used Storm myself. Some of the key points:
- Keep the tuple execute code tight (fast)
- Use a local cache (Guava)
- Expose metrics
- Externalise configuration, for tuning
- If you have slow sinks, parallelise
- Storm uses a lot of threads, so more CPUs are better
I also heard about Trident, which is a higher level API which essentially compiles down to Storms spouts and bolts. One key feature is that it gives you global consistency for free. According to the presenter, you should probably be using Trident instead of Storm unless you have a very good reason.
PayPal described their internal real time BI system. The most interesting part to me was the D3.js based interface they have built which allows users to explore the data, building their own queries. By making this easy and accessible they are democratising their data.
This is similar to what we are trying to do by using Tableau. I’m not sure the tool they built is any easier to use than Tableau, so I think they may have been better off using it or something similar rather than building their own.
Slides for most of the talks are available on the Strata + Hadoop World website. Videos will be available to purchase in a couple of weeks.
Cover image by O’Reilly Conferences.