Azure Synapse Analytics

 


Intro

 

Azure Synapse is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.

 


Documentation

 


Tips and Tidbits

  • Azure Synapse Analytics is an analytics service that brings together enterprise data warehousing and Big Data analytics.

  • Dedicated SQL pool (formerly SQL DataWarehousing) refers to the enterprise data warehousing features that are available in Azure Synapse Analytics.

  • Integrates Apache Spark--the most popular open source big data engine used for data preparation, data engineering, ETL, and machine learning

 


Hash-distributed tables

  • A hash distributed table can deliver the highest query performance for joins and aggregations on large tables.

  • To shard data into a hash-distributed table, a hash function is used to deterministically assign each row to one distribution.

 

Round-robin distributed tables

 

A round-robin table is the simplest table to create and delivers fast performance when used as a staging table for loads.

A round-robin distributed table distributes data evenly across the table but without any further optimization. A distribution is first chosen at random and then buffers of rows are assigned to distributions sequentially. It is quick to load data into a round-robin table, but query performance can often be better with hash distributed tables. Joins on round-robin tables require reshuffling data, which takes additional time.

 

Replicated Tables

A replicated table provides the fastest query performance for small tables.

A table that is replicated caches a full copy of the table on each compute node. Consequently, replicating a table removes the need to transfer data among compute nodes before a join or aggregation. Replicated tables are best utilized with small tables. Extra storage is required and there is additional overhead that is incurred when writing data, which make large tables impractical.