SQL on Hadoop - Comparison and Key Facts


#1

In general SQL on Hadoop databases share these qualities (with the exception of HAWQ):

  • They have immature or incomplete SQL support and require some workarounds
  • They require a separate connection to a “Hive metastore” for the generator to work (See the section Querying data schema below for more detail.)

Pivotal HAWQ

  • H=Hadoop, Q=Query
  • HAWQ is the Greenplum MPP database running on the Hadoop file system, which is called HDFS (Hadoop distributed file system)
  • It has PostgreSQL syntax, just like Greenplum
  • HAWQ is the only software on this list that is not open source

Impala

  • Cloudera’s SQL-on-Hadoop offering. Other Hadoop distributions like MapR’s distro also include it.
  • Written entirely in C++ (unlike other Hadoop projects, which are in Java)
  • Impala and the Parquet file format are directly inspired by Google’s Dremel project, the successor to Google’s MapReduce project.
  • Impala uses the Hive metastore for storing information about table schema, column types, etc. (See below)

Hive

  • Hive started out as a SQL-to-MapReduce abstraction layer. (It takes a SQL statement and converts it to several MapReduce jobs.)
  • HortonWorks put considerable work into speeding up Hive, and this is HW’s SQL-on-Hadoop offering.
  • For more detail look up: Tez on YARN, ORC file format
  • Another version of Hive is under development which will use Spark as the backend.
  • Hive stores its data in a regular HDFS directory structure. Meta information about the data (i.e. schema, table information, column types, etc.) is stored in a separate database called the Hive Metastore. This is usually a MySQL or PostgreSQL database.

Shark

  • Shark is a fork of Hive that uses Spark as the back end instead of Map Reduce.
  • It has been deprecated by Data Bricks in favor of Spark SQL.

Spark SQL

  • A native SQL engine inside of Spark
  • Shark is being deprecated in favor of Spark SQL

Other

Google BigQuery

  • Google’s on-line data warehouse offering
  • Launched October 2010

NoSQL or “Not only SQL”

This is a general term referring to data stores that don’t use tabular (relational) relationships in RDBMSes.

The most popular ones are Hadoop-related, and a popular use case is storing and retrieving large amounts of machine-generated data.

Many of them will have a “SQL-like” query language for retrieving data.

Most NoSQL databases can fit in roughly 4 categories (thanks Wikipedia):

  1. Column stores (BigTables): Accumulo, Cassandra, HBase
  2. Document stores: CouchBase, MongoDB
  3. Key-value stores: Redis, Memcache
  4. Graph: Allegro, Neo4J

Wikipedia also has this nice graph showing the tradeoffs of the different types of showing why you might use one over a traditional relational database:

These are the NoSQL databases that you are most likely to encounter or be asked about:

Cassandra

  • DataStax is the main company behind it
  • CQL (Cassandra Query Language) is too functionally weak. For example it only supports COUNT(*) or COUNT(1). Also, it doesn’t support joins or subqueries.
  • Has tunable “eventual consistency” because row data is stored in multiple replicas.

MongoDB

  • “Most popular NoSQL database system”
  • Document-oriented database that stores in JSON-like dynamic schemas
  • Some customers use something called MoSQL to export MongoDB into PostgreSQL

HBase

  • Open source implementation of Google BigTable
  • Included in most Hadoop distributions
  • Uses one key to read or write one or more of a wide number of columns

Accumulo

  • The NSA version of HBase
  • Spun into a company called Sqrrl

unlisted #2

listed #3