Commit 6e8634ec authored by Carp's avatar Carp

updates

parent c05d4f73
## Azure Databricks Overview
Databricks is "managed Spark"? So what is [Spark](spark.md)?
This repo is meant to be the starting point for demos and presentations around Azure Databricks.
When running Spark you have a few options:
## Content
| Deployment Option | Description | Pros | Cons
|---|---|---|---|
| IaaS | you build the Spark cluster yourself using VMs in Azure, manage the infrastructure, upgrades, etc. | <ul><li>Has the most flexibility</li><li>if your workload is 24x7 this is likely the cheapest from an Azure spend perspective</li></ul> |Has the highest Ops overhead|
| PaaS/HDInsight | Within the hour you are running a Spark cluster (in VMs) configured however you like with working Hadoop tooling you have come to expect |<ul><li>Cheaper if your workload is not 24x7</li><li>Does not require AS MUCH Ops overhead as IaaS</li></ul> |<ul><li>Since it is PaaS it is slightly more expensive than IaaS</li></ul> |
| Azure Databricks | Pure PaaS Spark offering created by the inventors of Spark. Connects to ADLS for ad-hoc analytics workloads that are better developed with python or scala in a notebook experience. | <ul><li>Zero Ops requirement</li><li>Data engineers and data scientists are immediately productive</li><li>Perfect for getting started with Spark quickly or quick ADLS analytics with a Jupyter experience</li></ul> |Not many configuration/extensibility options.|
[Presentation Notes and Documentation](notes/README.md) is the "Getting Started" guide/100-level content. This can be converted into a pptx if needed.
[Installation](notes/installation.md) covers how to demo the installation of Azure Databricks using the portal.
[assets folder](assets/) contains the screenshots that can be incorporated into a pptx later.
Databricks is a cloud-based Spark platform that removes a lot of the Ops burden of running a Spark cluster. It allows you to focus on your ETL streams adn analytics. Originally http://databricks.com was solely an AWS offering. Recently the Azure version has GA'd with native integration with other Azure services such as ADLS and WASB.
The Databricks service provides you with a master, workers, and executors, just like regular Spark, but the process is configuration-free and automated.
The typical use case is a customer that has an immediate need for Spark-based analytics for a defined period of time. With an PaaS offering you must terminate the service to reduce your costs when the service is not utilized.
## Azure Databricks Workloads and Pricing
Like HDI, VMs are still being provisioned for you under-the-covers.
### Workload "Customizaton"
Databricks is optimized for two workloads:
* Data Engineering
* This is generally an "automated job" that starts the cluster, does its task, then terminates the cluster.
* These workloads are meant for ETL and defined analytics batch jobs that are desired to run in a Spark environment instead of HDI, ADF, or something else.
* These clusters spin up much faster than equivalent HDI clusters and can scale in a similar fashion
* Data Analytics
* Any adhoc workload
* can support multiple users and can scale elastically
* the user is responsible for terminating the cluster
In general, a Data Analytics (adhoc) workload is twice the price of a Data Engineering (automated job) workload. This is because, under-the-covers, Databricks has knowledge of what workloads will be required during what times of day. The service can be optimized for future workloads more efficiently and those cost-savings are passed on to the customer.
### Databricks Serverless Option
If you have a Data Analytics (adhoc) workload you can choose the Databricks Serverless option. Azure maintains always-on Databricks servers for adhoc analytics. This option allows you to select how much you are willing to scale your workload and then you are billed for exactly what you use. You choose a min/max worker nodes (say, 2/8) and then if your workload only uses 8 nodes for 1 hour and 2 nodes for the remaining hours the cluster is active, you pay for what you consume.
### Pricing Tiers
Like other PaaS offerings, you are billed by the "consumption unit", in this case the "Databricks Units" (DBUs) which is a processing capability unit per hour.
Standard: Provides a full Spark-like environment with AAD authentication
Premium:
* RBAC for notebooks, jobs, clusters, tables
* JDBC/ODBC endpoint authentication
For most new users Standard will provide exactly what they need.
Premium overcomes some **severe** limitations that have always existed in the Hadoop world. For instance, LDAP authentication is possible but needs to be configured for each individual Hadoop service. Then, services such as job scheduling run under a system-level account, meaning that anyone that could schedule a job could potentially read data they should not. Python notebooks were another offender. The jupyter service logs into the Hadoop cluster with a single account and anyone with access to a jupyter notebook has access to any of the underlying data. The Premium tier overcomes this by preconfiguring ALL Spark services to be secure-by-default.
In general, Premium is about 2x the price of Standard.
## Get Running Quickly
### Create a base Azure Databricks Namespace
An organization can have 1 or more Databricks namespaces. A namespace is tied to a pricing tier but can include 1 or more Databricks clusters at that pricing tier.
* Login to [portal.azure.com](portal.azure.com)
* `+ Create a resource`
* Search for `Azure Databricks`
* Click `Create`
* Pick the options you need in the `Azure Databricks Service` blade
\ No newline at end of file
A Notebook is a special type of Databricks folder that can be used to create Spark
scripts. Notebooks can call the Notebook scripts to create a hierarchy of functionality.
When created, the type of Notebook must be specified (Python, Scala, or SQL), and
a cluster can then specify that the Notebook functionality can be run against it. The
following screenshot shows the Notebook creation.
## Azure Databricks Overview
Databricks is "managed Spark"? So what is [Spark](spark.md)?
When running Spark you have a few options:
| Deployment Option | Description | Pros | Cons
|---|---|---|---|
| IaaS | you build the Spark cluster yourself using VMs in Azure, manage the infrastructure, upgrades, etc. | <ul><li>Has the most flexibility</li><li>if your workload is 24x7 this is likely the cheapest from an Azure spend perspective</li></ul> |Has the highest Ops overhead|
| PaaS/HDInsight | Within the hour you are running a Spark cluster (in VMs) configured however you like with working Hadoop tooling you have come to expect |<ul><li>Cheaper if your workload is not 24x7</li><li>Does not require AS MUCH Ops overhead as IaaS</li></ul> |<ul><li>Since it is PaaS it is slightly more expensive than IaaS</li></ul> |
| Azure Databricks | Pure PaaS Spark offering created by the inventors of Spark. Connects to ADLS for ad-hoc analytics workloads that are better developed with python or scala in a notebook experience. | <ul><li>Zero Ops requirement</li><li>Data engineers and data scientists are immediately productive</li><li>Perfect for getting started with Spark quickly or quick ADLS analytics with a Jupyter experience</li></ul> |Not many configuration/extensibility options.|
Databricks is a cloud-based Spark platform that removes a lot of the Ops burden of running a Spark cluster. It allows you to focus on your ETL streams adn analytics. Originally http://databricks.com was solely an AWS offering. Recently the Azure version has GA'd with native integration with other Azure services such as ADLS and WASB.
The Databricks service provides you with a master, workers, and executors, just like regular Spark, but the process is configuration-free and automated.
The typical use case is a customer that has an immediate need for Spark-based analytics for a defined period of time. With an PaaS offering you must terminate the service to reduce your costs when the service is not utilized.
## Azure Databricks Workloads and Pricing
Like HDI, VMs are still being provisioned for you under-the-covers.
### Workload "Customizaton"
Databricks is optimized for two workloads:
* Data Engineering
* This is generally an "automated job" that starts the cluster, does its task, then terminates the cluster.
* These workloads are meant for ETL and defined analytics batch jobs that are desired to run in a Spark environment instead of HDI, ADF, or something else.
* These clusters spin up much faster than equivalent HDI clusters and can scale in a similar fashion
* Data Analytics
* Any adhoc workload
* can support multiple users and can scale elastically
* the user is responsible for terminating the cluster
In general, a Data Analytics (adhoc) workload is twice the price of a Data Engineering (automated job) workload. This is because, under-the-covers, Databricks has knowledge of what workloads will be required during what times of day. The service can be optimized for future workloads more efficiently and those cost-savings are passed on to the customer.
### Pricing Tiers
Like other PaaS offerings, you are billed by the "consumption unit", in this case the "Databricks Units" (DBUs) which is a processing capability unit per hour.
Standard:
* Provides a full Spark-like environment with AAD authentication
Premium:
* RBAC for notebooks, jobs, clusters, tables
* JDBC/ODBC endpoint authentication
For most new users Standard will provide exactly what they need.
Premium overcomes some **severe** limitations that have always existed in the Hadoop world. For instance, LDAP authentication is possible but needs to be configured for each individual Hadoop service. Then, services such as job scheduling run under a system-level account, meaning that anyone that could schedule a job could potentially read data they should not. Python notebooks were another offender. The jupyter service logs into the Hadoop cluster with a single account and anyone with access to a jupyter notebook has access to any of the underlying data. The Premium tier overcomes this by preconfiguring ALL Spark services to be secure-by-default.
In general, Premium is about 2x the price of Standard.
## Get Running Quickly and Next Steps
See [Installation](installation.md)
[Navigating Your Cluster Dashboard](navigating.md)
This page is a good demo for installing and getting familiar with the product.
## Get Running Quickly
### Create a base Azure Databricks Namespace
An organization can have 1 or more Databricks namespaces. A namespace is tied to a pricing tier but can include 1 or more Databricks clusters at that pricing tier. **Creating a Databricks namespace incurs NO COST**.
* Login to [portal.azure.com](portal.azure.com)
* `+ Create a resource`
* Search for `Azure Databricks`
* Click `Create`
* Pick the options you need in the `Azure Databricks Service` blade
It takes about 10 minutes to deploy a Databricks namespace. The resulting portal blade will look like this:
<img src="/assets/install01.PNG">
Note that all this has really provided to us is:
* the link to launch the workspace to build Databricks services
* links to documentation
* links to methods to import data
* the link to the notebook server
### Logging in to the Workspace
Databricks is a multi-tenant PaaS offering. To login to OUR namespace we use the link provided in the Azure Databricks Service blade (https://eastus2.azuredatabricks.net) and then login with our AAD credentials. Based on our AAD user the service knows which Databricks workspace(s) we have access to.
After logging in you should see the Databricks workspace portal.
<img src="/assets/install02.PNG">
This can be bookmarked and you can login directly to the workspace in the future and bypass the Azure portal.
### Creating a Cluster
Before we can do anything else with Databricks we have to create a cluster.
<img src="/assets/install03.PNG">
Since this is a new workspace we have no clusters available. Click `Clusters >> Create Cluster`
You have a choice between a Standard Cluster or Serverless Pool
#### Databricks Serverless Option
If you have a Data Analytics (adhoc) workload you can choose the Databricks Serverless option. Azure maintains always-on Databricks servers for adhoc analytics. This option allows you to select how much you are willing to scale your workload and then you are billed for exactly what you use. The downside is the Serverless option is a "pool" of resources you share with other potential "noisy neighbors". The service is optimized to avoid multi-tenant problems and you should not see any issues.
Serverless pools are ideal for SQL, Python, and R workloads. They are light-touch configuration (low Ops). For most data exploratory use cases this works very well and can save you a lot of money. Serverless pools are NOT a good option if you need Scala, a specific Spark version or need to configure Spark in a custom manner for your workload.
> Our advice: if you are new to Spark and Databricks, choose the Serverless Pool Option. It is much easier to configure.
#### Serverless Pools vs Standard Clusters
Let's compare the two offerings by configuring *similar* Databricks clusters using both options:
Screenshot of Serverless Pools options:
<img src="/assets/install04.PNG">
Screenshot of Standard Cluster options:
<img src="/assets/install05.PNG">
Note the following differences:
* configuring a standard cluster has far more options
* a standard cluster can be set to "auto-terminate" after a period of inactivity. While this sounds like a good idea you may find your cluster and data are gone if you don't set the inactivity threshold. Be sure to set it high.
* in reality, Serverless Pools do NOT need this setting because you are billed for the ACTUAL DBUs you consume using the Serverless option. So if you leave for lunch you will be billed about 2 DBUs in the above configuration because the Driver must always be ready to service a new request.
* Note that the same cost is quoted at the top for both cluster "types".
#### Finishing Cluster Setup
Your cluster will be in a pending state until it is fully provisioned:
<img src="/assets/install06.PNG">
It takes about 10 minutes to provision a Databricks cluster. The final cluster screen:
<img src="/assets/install07.PNG">
### Cluster Configuration Best Practices
* Always choose Python 3 unless you have a specific need. Python 2 is EOL in less than 2 years.
* The "Driver" node handles Spark node coordination and connectivity. It can generally be sized the same as the worker nodes.
* If you are unsure how much memory or workers you need, simply set a large spread for the Min Workers/Max Workers and Enable Autoscaling if using a Standard Cluster. Databricks provides monitoring that can be used to fine tune settings.
* Worker Type advice:
* Spark is generally more memory-constrained than Compute-constrained. You will want to default to `Standard_DSx_v2` instances. These are Memory-optimized instances.
[Navigating Your Cluster Dashboard](navigating.md)
\ No newline at end of file
## Cluster Dashboard
Your cluster dashboard can be launched from the Azure Portal:
<img src="/assets/install01.PNG">
and then clicking your cluster in the dashboard:
<img src="/assets/install06.PNG">
Your screen may look different based on the cluster setup configuration.
<img src="/assets/nav01.PNG">
`Edit` allows you to reconfigure the underlying Spark VMs.
`Clone` allows you to create a replica Databricks cluster.
`Restart` will allow you to "reboot the cluster" if it is unresponsive or a Driver job is hung and cannot be restarted. (This is rare).
`Terminate` will deprovision the cluster. You will lose any data stored in the Databricks File System that is not saved to ADLS or WASB. This option stops billing and consumption spend.
`Spark UI` will allow you to see any running jobs. From here you can see the Spark equivalent of an "explain" plan, known as a DAG, that can help you determine when a job will finish, if it needs more resources, or where the bottleneck is in your code.
<img src="/assets/nav02.PNG">
`Spark cluster UI` will show you running programs and an overview of their consumed resources.
<img src="/assets/nav03.PNG">
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment