In general SQL on Hadoop databases share these qualities (with the exception of HAWQ):
- They have immature or incomplete SQL support and require some workarounds
- They require a separate connection to a “Hive metastore” for the generator to work (See the section Querying data schema below for more detail.)
- H=Hadoop, Q=Query
- HAWQ is the Greenplum MPP database running on the Hadoop file system, which is called HDFS (Hadoop distributed file system)
- It has PostgreSQL syntax, just like Greenplum
- HAWQ is the only software on this list that is not open source
- Cloudera’s SQL-on-Hadoop offering. Other Hadoop distributions like MapR’s distro also include it.
- Written entirely in C++ (unlike other Hadoop projects, which are in Java)
- Impala and the Parquet file format are directly inspired by Google’s Dremel project, the successor to Google’s MapReduce project.
- Impala uses the Hive metastore for storing information about table schema, column types, etc. (See below)
- Hive started out as a SQL-to-MapReduce abstraction layer. (It takes a SQL statement and converts it to several MapReduce jobs.)
- HortonWorks put considerable work into speeding up Hive, and this is HW’s SQL-on-Hadoop offering.
- For more detail look up: Tez on YARN, ORC file format
- Another version of Hive is under development which will use Spark as the backend.
- Hive stores its data in a regular HDFS directory structure. Meta information about the data (i.e. schema, table information, column types, etc.) is stored in a separate database called the Hive Metastore. This is usually a MySQL or PostgreSQL database.
- Shark is a fork of Hive that uses Spark as the back end instead of Map Reduce.
- It has been deprecated by Data Bricks in favor of Spark SQL.
- A native SQL engine inside of Spark
- Shark is being deprecated in favor of Spark SQL
- Google’s on-line data warehouse offering
- Launched October 2010
NoSQL or “Not only SQL”
This is a general term referring to data stores that don’t use tabular (relational) relationships in RDBMSes.
The most popular ones are Hadoop-related, and a popular use case is storing and retrieving large amounts of machine-generated data.
Many of them will have a “SQL-like” query language for retrieving data.
Most NoSQL databases can fit in roughly 4 categories (thanks Wikipedia):
- Column stores (BigTables): Accumulo, Cassandra, HBase
- Document stores: CouchBase, MongoDB
- Key-value stores: Redis, Memcache
- Graph: Allegro, Neo4J
Wikipedia also has this nice graph showing the tradeoffs of the different types of showing why you might use one over a traditional relational database:
These are the NoSQL databases that you are most likely to encounter or be asked about:
- DataStax is the main company behind it
- CQL (Cassandra Query Language) is too functionally weak. For example it only supports COUNT(*) or COUNT(1). Also, it doesn’t support joins or subqueries.
- Has tunable “eventual consistency” because row data is stored in multiple replicas.
- “Most popular NoSQL database system”
- Document-oriented database that stores in JSON-like dynamic schemas
- Some customers use something called MoSQL to export MongoDB into PostgreSQL
- Open source implementation of Google BigTable
- Included in most Hadoop distributions
- Uses one key to read or write one or more of a wide number of columns
- The NSA version of HBase
- Spun into a company called Sqrrl