Commit 60ddad86 authored by Carp's avatar Carp

prelim demo

parent 28326bc3
......@@ -5,7 +5,12 @@ This repo is meant to be the starting point for demos and presentations around A
## Content
[Presentation Notes and Documentation](notes/ is the "Getting Started" guide/100-level content. This can be converted into a pptx if needed.
[Installation](notes/ covers how to demo the installation of Azure Databricks using the portal.
[Navigating Your Workspace](notes/ gives a tour of the Databricks interface.
[assets folder](assets/) contains the screenshots that can be incorporated into a pptx later.
[Demo1](demo1/ this is a very brief demo that shows how to quickly navigate around notebooks using SparkSQL, pySpark, md, and the DBFS.
{"cells":[{"cell_type":"markdown","source":["## Getting to Know Databricks Notebooks\n\nThis is a \"documentation cell\". Double-click it to edit it. \n\nThe `%md` is known as a \"cell magic\". A cell magic alters the behavior of a cell. \n\nClick the keyboard icon above to see notebook keyboard shortcuts."],"metadata":{}},{"cell_type":"markdown","source":["## Notebooks have some Apache Spark variables already defined\n`SparkContext` is `sc` \n`SQLContext` and `HiveContext` is `sqlContext` \n`SparkSession` is `spark`"],"metadata":{}},{"cell_type":"code","source":["# to run a cell simply click it and press Shift+Enter\n# This is a python notebook, so unless you have a cell magic that alters the state of the cell you must write python. Notebooks can \n# be created with various default interpreters. \n\n# this command will run and show the SparkContext\nsc"],"metadata":{},"outputs":[],"execution_count":3},{"cell_type":"markdown","source":["The `Out[1]:` entry shows any output from executing a cell."],"metadata":{}},{"cell_type":"code","source":["%sh \n# this cell uses a bash cell magic and will allow you to run shell code\n# the Databricks \"user\" is ubuntu. We can see any data we may have uploaded for our acct\nls -alF /home/ubuntu/databricks"],"metadata":{},"outputs":[],"execution_count":5},{"cell_type":"markdown","source":["However, files we upload to that location will not be distributed to all workder nodes, only the \"driver\" node. \nthe correct way is to look at the \"databricks filesystem\", which is available to all worker nodes. Note the different cell magic for the next cell. %fs is meant to be equivalent to hdfs commands but for the DBFS. \n\n\nDatabricks ships with tutorial datasets."],"metadata":{}},{"cell_type":"code","source":["%fs ls /\n"],"metadata":{},"outputs":[],"execution_count":7},{"cell_type":"code","source":["%fs ls /databricks-datasets/"],"metadata":{},"outputs":[],"execution_count":8},{"cell_type":"code","source":["# we can also run Databricks commands using python and not a cell magic. We simply use the databricks python classes. Here is one example\ndisplay(\"dbfs:/databricks-datasets/samples/\"))\n# display is a function that allows the results to be in notebook table format which makes copy/paste and downloading results easier. "],"metadata":{},"outputs":[],"execution_count":9},{"cell_type":"code","source":["# let's read in a sample file to an RDD (it's actually a pyspark dataframe since this is a python cell) and \"infer\" its schema\n# the trailing \\ makes the command easier to read\n# Note that nothing will really happen. If you are quick you can view the DAG (how Spark is going to execute the request)\n\n# this is a demo and thus a small file. In the big data world you would simply set filepath to the directory of MANY datafiles and Spark will distribute the \n# workload to all worker nodes in the cluster. \n\nfilepath = \"/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv\"\ndf_diamonds =\"csv\")\\\n .option(\"header\",\"true\")\\\n .option(\"inferSchema\", \"true\")\\\n .load(filepath)"],"metadata":{},"outputs":[],"execution_count":10},{"cell_type":"code","source":["# we inferred the schema, but what does this dataset really look like? \ndf_diamonds.printSchema()"],"metadata":{},"outputs":[],"execution_count":11},{"cell_type":"code","source":["type(df_diamonds) # this confirms that we have created a dataframe (note we can also use this from SparkSQL)\ndisplay(df_diamonds) # this is similar to \"select * from table LIMIT 1000\". In general it's better to use \"head\" or \"tail\" command for speed, but display \n# has benefits noted above, plus I can graph the data. \n# Try it. Aggregation=AVG, Keys=cut, Groupings=color, values=price\n"],"metadata":{},"outputs":[],"execution_count":12},{"cell_type":"code","source":["# basic data exploration can be done with the describe command which is the pyspark equivalent of R's summary function\n# we always wrap in \"display\" so we can see the output in a pretty HTML table\ndisplay(df_diamonds.describe())"],"metadata":{},"outputs":[],"execution_count":13},{"cell_type":"code","source":["# that chart looks interesting, let's group the data to see it raw. Here we use python to do the same grouping. Note the syntax is VERY similar to SQL or pandas. We also create 2 additional dataframes. This does NOT mean that we've duplicated the data. RDDs (and thus dataframes) are only pointers to the original data plus the metadata needed to generate the data when needed. \n\ndf_diamonds_grouped = df_diamonds.groupBy(\"cut\", \"color\").avg(\"price\") \n\n# we are joining our \"grouped df\" to the original df where we are building an on-the-fly aggregation\ndf_diamonds_final = df_diamonds_grouped\\\n .join(df_diamonds, on='color', how='inner')\\\n .select(\"`avg(price)`\", \"carat\")\n"],"metadata":{},"outputs":[],"execution_count":14},{"cell_type":"markdown","source":["Note: that was almost instantaneous. That's because no \"action\" was done, only a \"transformation\". Spark only carries out actions and instead waits for you to code up all needed transformations first. This optimization is a key to Spark being so performant, especially over huge datasets. \n\nAn action is anything that requires the transformations to be done. During development you often want to see if your transformation code actually works. The easiest way to do that is with a `take` action. This will take a few seconds to execute."],"metadata":{}},{"cell_type":"code","source":["df_diamonds_final.take(10)\n# display(df_diamonds_final.take(10)) #notice the difference"],"metadata":{},"outputs":[],"execution_count":16},{"cell_type":"code","source":["# this is a tiny dataset but with realworld data you'll find that Spark and Databricks can be slow when accessing csv and json raw data. \n# caching the df is one trick, but it's often better to save the data in parquet format, which is a columnar format optimized for Spark. This \n# is a really good trick if you are going to be running many queries over the same df or raw data files. \n\n# tab completion *should* work for this (filed a bug that this doesn't always work)\ndf_diamonds.write.mode(\"overwrite\").format(\"parquet\").saveAsTable(\"diamonds_pqt\")"],"metadata":{},"outputs":[],"execution_count":17},{"cell_type":"markdown","source":["Notebooks are known as REPL environments (read, evaluate, print, loop). Not every programming language supports REPL environments, but all Spark-supported languages do. With a REPL we place small pieces of code with explanations of our thought processes. But, if we find we made a mistake higher up in the notebook we can change a cell and choose the `Run All Below` option. This let's you change your code and quickly re-execute only what is needed.\n\nTry it: Let's assume we want the Sum instead of avg and the `carat` to be displayed as the first column\n\n* change avg(price) TWICE to sum(price)\n* change `.select(\"carat\",\"`sum(price)`\")`"],"metadata":{}},{"cell_type":"markdown","source":["Unless you know Python it can be very tedious. The above demo might be easier for most people using SparkSQL. Let's try that. \n\nRemember that `df_diamonds` was our original dataframe from data we read from a csv. Let's convert that to a SQL table. \nLet's use autocompletion. Type `df_diamonds.creat` and then `tab`. Note, this works inconsistently in Azure Databricks. This is a known issue."],"metadata":{}},{"cell_type":"code","source":["df_diamonds.createOrReplaceTempView(\"diamonds\")"],"metadata":{},"outputs":[],"execution_count":20},{"cell_type":"code","source":["%sql show tables\n\n"],"metadata":{},"outputs":[],"execution_count":21},{"cell_type":"code","source":["%sql \nSELECT * FROM diamonds LIMIT 10;"],"metadata":{},"outputs":[],"execution_count":22},{"cell_type":"code","source":["%sql \nDROP TABLE IF EXISTS AvgPriceByCarat;\n\nCREATE TABLE AvgPriceByCarat AS\nSELECT \n avg(price) AS AvgPrice,\n carat\nFROM diamonds\nGROUP BY carat;\n\nSELECT * FRoM AvgPriceByCarat\n/* we don't need the CTAS above unless we may want to use that data in the future */"],"metadata":{},"outputs":[],"execution_count":23},{"cell_type":"code","source":["%sql\n/* \nIn the cells above I first \"referenced\" the data in a pySpark df and then \"mixed-and-matched\" that with SparkSQL. \n\nBut you CAN have a pure SparkSQL solution without an intermediate pySpark df\n\nOne problem with this is that you can no longer \"infer\" the schema. \n*/\n\nDROP TABLE IF EXISTS diamonds_direct;\n\nCREATE TABLE diamonds_direct (\n id int,\n carat double,\n cut string,\n color string,\n clarity string,\n depth double,\n tbl double,\n price integer,\n x double,\n y double,\n z double\n)\nusing csv options (\n path '/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv',\n delimiter ',',\n header 'true'\n);\n\nSELECT * FRoM diamonds_direct LIMIT 20;\n"],"metadata":{},"outputs":[],"execution_count":24},{"cell_type":"markdown","source":["* Notebooks are automatically saved but can be exported so they can be run on other Spark clusters or Databricks instances outside of Azure\n* When the notebook is closed the memory is freed (it may be freed sooner if the Spark cluster is auto-terminated), but if you re-open the notebook you will see all of the result cells. Notebooks are saved with the result cells. This allows you to share your notebooks with others without them having to re-execute all of your code (and wait). \n * There is a `Clear` option on the menu bar. \n * You can also perform a `Run All` to get the latest data at any time."],"metadata":{}},{"cell_type":"code","source":[""],"metadata":{},"outputs":[],"execution_count":26}],"metadata":{"name":"IntroPy","notebookId":2516761319109765},"nbformat":4,"nbformat_minor":0}
This is a very brief demo for basic familiarity with Spark, notebooks, and Databricks.
We are going to import a python notebook that also has some SparkSQL cells. The actual .ipynb file should be imported the day before and tested. This will build all of the OUT cells and cache them.
All instructions are in the notebook file. Feel free to edit accordingly.
## build and start a new cluster
* doesn't matter if premium or standard or serverless
* node size doesn't matter
* it's probably best to build the cluster a day before the demo, then "terminate it, then restart it an hour before the demo
* in fact, you can use the "Start" process as a part of the demo to show how fast it is to restart a terminated cluster
* it takes about 5 mins to restart a cluster, so plan accordingly
## Import the demo notebook
* Download it locally
* Click the Workspace button
* Right click your user and choose import
......@@ -25,6 +25,26 @@ The typical use case is a customer that has an immediate need for Spark-based an
<img src="/assets/intro02.PNG">
## Where does this fit in my Big Data pipeline?
<img src="/assets/intro03.PNG">
**As either an ELT tool or as the method to generate ML model data that is fed to an EDW**
<img src="/assets/intro05.PNG">
## Databricks Use Cases
Any use case suitable for Spark ([learn more about Spark]( is suitable for Databricks. Databricks is simply "managed Spark" where the Operations and administration burden is handled by Microsoft.
|Use Case | Description |Databricks/Spark Component |
|Streaming data |Ingest real-time data changes and feed the data into data lakes, EDWs, or run ML algorithms against the data | Spark Streaming/Structured Streaming |
|Querying |Querying massive amounts of data quickly |SparkSQL and Jupyter Notebooks |
|Machine Learning |build, train, and deploy ML models |MLLib and SparkML |
## Azure Databricks Workloads and Pricing
Like HDI, VMs are still being provisioned for you under-the-covers.
......@@ -67,6 +87,10 @@ In general, Premium is about 2x the price of Standard.
## Get Running Quickly and Next Steps
See [Installation](
[Navigating Your Cluster Dashboard](
[Databricks Workspace Deeper Dive](
## Getting Started
......@@ -27,4 +27,8 @@ Your screen may look different based on the cluster setup configuration.
`Spark cluster UI` will show you running programs and an overview of their consumed resources.
<img src="/assets/nav03.PNG">
\ No newline at end of file
<img src="/assets/nav03.PNG">
## Navigating the Workspace
[Workspace Navigation Deeper Dive](
\ No newline at end of file
## Overview of Spark
## Why do I need Spark?
* You need more flexible analytics and ELT than you can get with existing U-SQL and ADF
* You are using Hadoop but it is too slow and expensive
* Spark is essentially in-memory, optimized Hadoop and in many cases is 100x faster
* You want a highly extensible, supportable platform that your ETL developers and data scientists can learn quickly
* Spark is suitable for batch processing, real-time data streaming, ML, and interactive SQL
<img src="/assets/intro04.PNG">
## History of Spark
Traditionally working with Big Data in Hadoop meant MapReduce. MapReduce is a functional programming paradigm where you write programs as small functions that are "mapped" (executed like a loop) over small subsets of huge datasets. Each mapper's output (think temp table data) is written to disk and transferred to a single node where it is "reduced" (think aggregations like sum or avg) further to get the desired output.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment