Commit c05d4f73 authored by Carp's avatar Carp

init

parents
## Azure Databricks Overview
Databricks is "managed Spark"? So what is [Spark](spark.md)?
When running Spark you have a few options:
| Deployment Option | Description | Pros | Cons
|---|---|---|---|
| IaaS | you build the Spark cluster yourself using VMs in Azure, manage the infrastructure, upgrades, etc. | <ul><li>Has the most flexibility</li><li>if your workload is 24x7 this is likely the cheapest from an Azure spend perspective</li></ul> |Has the highest Ops overhead|
| PaaS/HDInsight | Within the hour you are running a Spark cluster (in VMs) configured however you like with working Hadoop tooling you have come to expect |<ul><li>Cheaper if your workload is not 24x7</li><li>Does not require AS MUCH Ops overhead as IaaS</li></ul> |<ul><li>Since it is PaaS it is slightly more expensive than IaaS</li></ul> |
| Azure Databricks | Pure PaaS Spark offering created by the inventors of Spark. Connects to ADLS for ad-hoc analytics workloads that are better developed with python or scala in a notebook experience. | <ul><li>Zero Ops requirement</li><li>Data engineers and data scientists are immediately productive</li><li>Perfect for getting started with Spark quickly or quick ADLS analytics with a Jupyter experience</li></ul> |Not many configuration/extensibility options.|
Databricks is a cloud-based Spark platform that removes a lot of the Ops burden of running a Spark cluster. It allows you to focus on your ETL streams adn analytics. Originally http://databricks.com was solely an AWS offering. Recently the Azure version has GA'd with native integration with other Azure services such as ADLS and WASB.
The Databricks service provides you with a master, workers, and executors, just like regular Spark, but the process is configuration-free and automated.
The typical use case is a customer that has an immediate need for Spark-based analytics for a defined period of time. With an PaaS offering you must terminate the service to reduce your costs when the service is not utilized.
## Azure Databricks Workloads and Pricing
Like HDI, VMs are still being provisioned for you under-the-covers.
### Workload "Customizaton"
Databricks is optimized for two workloads:
* Data Engineering
* This is generally an "automated job" that starts the cluster, does its task, then terminates the cluster.
* These workloads are meant for ETL and defined analytics batch jobs that are desired to run in a Spark environment instead of HDI, ADF, or something else.
* These clusters spin up much faster than equivalent HDI clusters and can scale in a similar fashion
* Data Analytics
* Any adhoc workload
* can support multiple users and can scale elastically
* the user is responsible for terminating the cluster
In general, a Data Analytics (adhoc) workload is twice the price of a Data Engineering (automated job) workload. This is because, under-the-covers, Databricks has knowledge of what workloads will be required during what times of day. The service can be optimized for future workloads more efficiently and those cost-savings are passed on to the customer.
### Databricks Serverless Option
If you have a Data Analytics (adhoc) workload you can choose the Databricks Serverless option. Azure maintains always-on Databricks servers for adhoc analytics. This option allows you to select how much you are willing to scale your workload and then you are billed for exactly what you use. You choose a min/max worker nodes (say, 2/8) and then if your workload only uses 8 nodes for 1 hour and 2 nodes for the remaining hours the cluster is active, you pay for what you consume.
### Pricing Tiers
Like other PaaS offerings, you are billed by the "consumption unit", in this case the "Databricks Units" (DBUs) which is a processing capability unit per hour.
Standard: Provides a full Spark-like environment with AAD authentication
Premium:
* RBAC for notebooks, jobs, clusters, tables
* JDBC/ODBC endpoint authentication
For most new users Standard will provide exactly what they need.
Premium overcomes some **severe** limitations that have always existed in the Hadoop world. For instance, LDAP authentication is possible but needs to be configured for each individual Hadoop service. Then, services such as job scheduling run under a system-level account, meaning that anyone that could schedule a job could potentially read data they should not. Python notebooks were another offender. The jupyter service logs into the Hadoop cluster with a single account and anyone with access to a jupyter notebook has access to any of the underlying data. The Premium tier overcomes this by preconfiguring ALL Spark services to be secure-by-default.
In general, Premium is about 2x the price of Standard.
## Get Running Quickly
### Create a base Azure Databricks Namespace
An organization can have 1 or more Databricks namespaces. A namespace is tied to a pricing tier but can include 1 or more Databricks clusters at that pricing tier.
* Login to [portal.azure.com](portal.azure.com)
* `+ Create a resource`
* Search for `Azure Databricks`
* Click `Create`
* Pick the options you need in the `Azure Databricks Service` blade
\ No newline at end of file
## Overview of Spark
Traditionally working with Big Data in Hadoop meant MapReduce. MapReduce is a functional programming paradigm where you write programs as small functions that are "mapped" (executed like a loop) over small subsets of huge datasets. Each mapper's output (think temp table data) is written to disk and transferred to a single node where it is "reduced" (think aggregations like sum or avg) further to get the desired output.
MapReduce has a few problems:
* it is slow. Each map operations needs to take its data and persist it to disk, there is no in-memory operations.
* it is resource-intensive. Intermediate result sets need to be transmitted to other nodes for reduction.
* it requires low-level coding skills. You must write your own mappers and reducers which have historically been buggy and difficult to unit test.
* it has no APIs
Hadoop's creators solved the last problem first. Since most developers know SQL, a SQL-to-MapReduce tier was invented, known as Hive.
But the big problems remained, namely the ability to share intermediate datasets without persisting them to HDFS and replicating them to each node in the cluster.
## Spark's Goal
The creators of Spark (and also Databricks) wanted to fix these problems. In the process a lot of new features were added to solve other nagging Big Data problems such as creating Streaming applications.
## What is Spark?
* written in Scala (a concise, functional language)
* Resilient Distributed Data Sets (RDDs) are the core building block (discussed next)
* by default, and whenever possible, RDDs remain in RAM and are not resolved to disk.
### Resilient Distributed Data Sets (RDDs)
An RDD is:
* an immutable, partitioned set of data distributed across a cluster
* many machines with insufficient RAM can still solve big data problems
* can be in RAM or persisted to HDFS
* are automatically rebuilt if a node fails
* an RDD is *immutable* in that the original data is persisted (usually in RAM) and then only the transformation *lineage* is maintained. The lineage allows portions of RDDs to be rebuilt quickly
There are two "operations" on an RDD:
* transformations: Examples: maps, filters, and groupBys. A transformation is "declared" but is not executed until an "action" occurs. This means that transformations can be optimized and rewritten by the Spark engine without the developer having intimate knowledge of Spark.
* Actions: Examples: count, save, and collect. An "action" such as count requires all transformations to be executed.
### Supported Languages
Spark supports applications written in Scala (most complete implementation), Java for standalone applications, and Python.
### Scala
* runs in the JVM (ie, runs natively in Spark containers)
* full interoperation with Java
* Launch on the Spark cluster via `spark-shell` which provides a scala interactive prompt `scala>`
#### Sample Code
```scala
// generate an RDD
val rdd = sc.parallelize(List(1,2,3))
//square each number. This is a transformation so no "action" is done yet
val squares = rdd.map(x: x*x) //{1,4,9}
//retrieve the RDD contents on the head node. These are actions
squares.collect()
squares.count()
squares.saveAsTextFile("hdfs://myfile.txt")
```
### Python
* The Spark python implementation (pyspark) is not as feature-complete as scala.
* example: data partitioning is more flexible and performant if written in scala vs Python.
* syntactically python is very similar to scala.
* python data scientists that are comfortable with pandas dataframes need to re-learn Spark dataframes. Pandas dataframes are not performant in pySpark.
* Launch on the Spark cluster via `pyspark` or use a Jupyter notebook
#### Sample Code
This is the python equivalent code from above
```python
rdd = sc.parallelize([1,2,3])
# square each number. This is a transformation so no "action" is done yet
squares = rdd.map(lambda x: x*x) //{1,4,9}
# retrieve the RDD contents on the head node. These are actions
squares.collect()
squares.count()
squares.saveAsTextFile("hdfs://myfile.txt")
```
Other sample code:
```python
# load a directory of files into an RDD
df = sc.textFile("hdfs://namenode:9000/path")
```
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment