Amazon Redshift is a fully managed, petabyte-scale data warehouse service that AWS has been offering since Feb 2013. You can just start with a few GBs of data and scale to a PB or more. Redshift is ANSI-SQL compliant and uses industry-standard ODBC and JDBC connections which enable you to use your existing SQL Clients and Business Intelligence (BI) Tools. You can get started with Redshift with no Up-Front Costs or any long term commitments, and according to Amazon’s pricing you can create a cluster as low as $0.250 per hour to $6.800 per hour depending on the storage and compute that you select. These are On-Demand prices and you can save even more if you go for a 1 or a 3 years contract. Being a managed service offering, AWS will handle the backup and patching of the instances, and also recover from disk failures. When you load data into your Redshift cluster, it is automatically replicated and you also have the option to continuously back up your data to Amazon S3. So at any given point in time, you will at least have 3 copies of your data. The backups to S3 are incremental and if you need disaster recovery, then you can synchronously replicate your snapshots to S3 buckets in other regions. Scaling your cluster is as easy as creating it in the first place. You can scale up your cluster without any read access downtime. Redshift supports data encryption both at rest and in-transit. And once you create your cluster, you can retrieve the JDBC and ODBC connection strings using the management console or APIs, and connect to the cluster using your existing clients and tools. From where I stand, I see Redshift as a very lucrative solution for organizations who want to get started with a Data Warehousing solution, but at the same time, don’t want to make that huge initial investment.
I would now like to pivot our discussion and talk about the cluster architecture for Redshift. It is a massively parallel and shared nothing architecture, where we have one leader node and multiple compute nodes.
The Leader node acts as the SQL endpoint, it stores the metadata and it coordinates parallel SQL processing. On the other hand, Compute nodes are the ones that execute these queries in parallel. All the data sits in the compute nodes, they execute the queries and then send the results back to the leader node. The main selling point for AWS Redshift is its Columnar storage technology which enables parallel processing of data by distributing data and queries across multiple nodes. This increases the analytic query processing performance and it also reduces the amount of data that needs to be loaded from the disk. Say for example I only want the details of all the users on my website, who are located in Germany. I can just search for ‘Germany’ in the location column and then return those rows, which is way better than having to go through all the rows and then filtering out the rows where location==Germany. Having columnar storage also helps with data compression. We get higher compression ratios due to the same type of data that is stored in a column, and this, in turn, reduces the amount of storage that is required to store our data.
Redshift also has the concept of Zone maps, which is basically in-memory block metadata. It contains the MIN and MAX values for the blocks. A block is just a 1MB chunk of data.
Having zone maps reduces the required I/O even more as your queries can be targeted, which makes it more efficient and you get results faster.
Next, let’s talk about the concept of Slices. A slice can be thought of as a virtual compute node. It’s a unit of data partitioning and each compute node can have either 2, 16 or 32 slices. A slice only processes its own data so you can select one of the following Data Distribution styles, to enable parallel processing and minimize data movement during query processing:
- Key: Going back to the previous example of users with their locations. In this case, all the users with location == Germany will be stored on one node and all the users with location == France will be stored on the other node.
- Even: This is a basic round-robin algorithm so it will distribute the data evenly across all the nodes, irrespective of the values.
- All: This makes a full copy of the table on each node in the cluster. Obviously, this isn’t very efficient and will only work for very specific use cases.
As you might have already guessed, it is very important to select the appropriate data distribution style to get the best performance out of our cluster. And if you aren’t able to decide, the recommendation is to just go with Even.
I feel that there is a lot more that we can discuss, but since this is just a 101 post, we will keep it short. If you want to learn more, you can refer to the following links: