Datacoral supports several database engines as part of its transformation technology. Using existing databases allows us to leverage the significant R&D effort that specialized vendors have invested in functionality, performance, and scalability. As database technology continues to evolve, it is important for us and our customers to understand the new capabilities that are being developed and take advantage of them.
Datacoral supports Amazon Redshift and was in fact show cased as a Redshift Ready Partner in the Global Partner Keynote at the AWS re:Invent 2019 event. Redshift is changing at a rate of hundreds of new features and enhancements per year. Hence, keeping up with what is happening with Redshift requires some work that is crucial for us in order to help our customers get the maximum value from a product that is growing in complexity.
The annual AWS conference re:Invent returned to Las Vegas in early December and there were plenty of announcements and sessions about numerous topics. Much has changed since the first conference in 2012, when it had about 6,000 attendees, to the 2019 version, which was at least an order of magnitude larger.
Amazon Redshift was announced at the 2012 re:Invent and became an instant hit and was the fastest growing AWS service until it was surpassed by Aurora couple years ago. Much of Redshift’s success was due to its pricing slogan of $1,000 per TB per year for certain instance types, which was dramatically lower than people were used to spend on databases and a very good deal for a managed database service. At this year’s conference, it was claimed that Redshift is the most popular cloud data warehouse technology with tens of thousands of customers.
Highlighted at the conference were recent and upcoming Redshift features as well as trends in Redshift usage, like customers moving to data-lake architectures. There were three particular areas of customer demand:
- Cloud migration
- Exponential growth of event data
- End-to-end analysis of data
Among the existing benefits that were highlighted were integration with a variety of other AWS technologies, scalability and performance, the fact that it is a managed service, and that it has a variety of security features and compliance certifications. Also highlighted was the cost that is hard for anyone to beat without having major benefits from separating compute and storage.
In addition, a major focus was features and functionality that are making Redshift increasingly data-lake friendly. When Redshift was first rolled out, it was a traditional MPP database where the data was stored on the local disks of the nodes of a cluster. Over the years, features have been added to facilitate queries against other sources of data, like S3, and to separate compute and storage, which were tightly coupled in the original architecture. Features like Spectrum fit into this general data-lake trend.
Among newer stuff, it was claimed that Redshift has had 200+ new features and enhancements in the last 18 months. Obviously, not every single one was presented in detail. Some of the newer features have been rolled out very recently while others are in beta preview and won’t be publically available until next year. Below are a few features that we believe will help our customers who use Redshift.
Some of the more prominent features that were discussed at the conference are listed below.
The RA3 is a new type of instance that promises an unprecedented combination of scalability, performance, and price while allowing the separation of compute and storage costs. It seems like the Redshift people are finally taking the separation of compute and storage really seriously with Concurrency Scaling as a key feature. Supposedly, key ingredients for RA3 are managed storage, high-speed caching, and high-bandwidth networking. RA3 instances are already generally available with some happy beta customers (including Western Digital and Yelp) mentioned at the presentation.
AQUA query acceleration through FPGAs
One of the new features that is to become available next year is Redshift AQUA, the “Advanced QUery Accelerator.” There is a more detailed description of the architecture here. This is hardware acceleration based on FPGAs sitting between the Redshift clusters and storage. When you think of hardware acceleration through FPGAs, it’s hard not to think about Netezza. Netezza was a data warehouse appliance vendor that IBM bought in 2010. Using FPGAs can be an inexpensive way to provide processing power to CPU-intensive workloads and AWS claims that AQUA can be “up to” 10x faster. However, the benefits of hardware acceleration are highly workload dependent. It can also have the benefit of reducing the data traffic from storage to the DB servers through storage-level predicate evaluation and projection. Redshift already has that ability when using range-restricted scans (aka zone maps) on local disks where the columnar storage provides projections. On S3, the Parquet storage format provides similar benefits. So it will be interesting to see what impact this feature will have on various workloads. In any case, the technology will be available with RA3 instances.
AZ64 is a proprietary compression encoding that promises high degrees of compression and fast decompression for numeric and time-related data types. It was originally announced in October. AWS claims 35 percent less storage and 40 percent faster than LZO. It was listed as one reason that Redshift has improved by more than 2x on the TPC-DS benchmark along with other performance enhancements like
- Bloom filters
- Planner enhancements
- HLL (HyperLogLog) for statistics
- Cache optimized aggregation and join processing
Another performance feature currently in beta preview is materialized views. It seems like a fairly standard MV feature with rewrite and complete and incremental refresh. It currently has quite a few limitations.
A new data type, Geometry, has been introduced to support ingesting, analyzing, and storing spatial data. It comes with 40+ spatial SQL functions and predicates, like ST_Covers, ST_Within, and ST_Distance
Another feature in preview is federated queries that can access data in Postgres databases on RDS or Aurora. Apparently, it’s part of the story of Redshift becoming more data-lake friendly. The ability to access data in other databases is highly useful so this is a welcome addition. They gave an example of creating a UNION ALL view in Redshift with branches combining hot data in stored Aurora, recent data stored in Redshift, and archived data stored in S3.
Data export in Parquet
Another feature is that data can be unloaded from Redshift to S3 in Parquet — another data-lake friendly feature. In S3, the data can be accessed by a variety of services.
Performance features based on ML
Another item was the optimization of table design and maintenance based on usage patterns using ML algorithms. It included automation of analyze, vacuum and sorting as well as advisors for sort keys and distribution styles.
Other highlighted management features included a new and improved management console, Auto WLM, and a scheduler for elastic resizing of clusters.
Stored procedures have made their way back to Redshift. They existed in the original source code for ParAccel that AWS had acquired as the basis for Redshift but were removed because of concerns about security. It’s about time they came back since stored procedures are highly useful. Redshift uses the PL/pgSQL format, which is no surprise since it has a Postgres-based front end.
At Datacoral, we can’t wait to try out all of these features! We take it as our responsibility to figure out which of these features will truly benefit our customers. We will report back our findings in a future blog post.
Please reach out to us at email@example.com if you have comments or want to try out Datacoral.