Commit a6dbc2f3 authored by Your Name's avatar Your Name

nonbookcontent update

parent 77dc7c3f
......@@ -2,6 +2,13 @@
This book will be THE GOTO book for Data Scientists and DBAs to reference for all new data science projects using SQL Server and R.
# Mission Statement
The mission of this book is to get both parties to use this new technology successfully and communicate using a lingua franca. These are brand new technologies (R Server/Services especially) and we want to make sure people understand how best to use them.
This is much more important than having a lot of recipes and code samples that can be reproduced. In other words, the mission is to show “patterns” to good solutions.
# Pitch
SQL Server 2016 released a new feature where R can be integrated directly into the database engine. This is actually a mature feature, Microsoft bought Revolution Analytics R Server product about a year ago and is integrating it into many of its products. The RevoR product is lightyears ahead of Open Source R in that it handles low-memory conditions and parallelism where Open Source R does not.
......@@ -16,5 +23,20 @@ I want to have a lot of these examples.
# Audience
There are two typical readers. The first half of the book will be focused on DBAs, data developers, and those with a traditional SQL background. The second half is geared toward data scientists that want to learn how to integrate their R models with SQL Server. However, I think most data scientists will WANT to read the first half just to learn “how the other half lives”.
No advanced R skills will be needed, nor will this be taught in this book. It is assumed that data scientists will either learn R skills elsewhere or bring their existing skills to learn how to integrate with SQL Server. Having said that we will code some basic R as part of demos and recipes. This R code will be so basic that it shouldn’t be a problem for anyone to understand it without any code annotations.
This format is important to both the data professional and data scientist. The goal of this book is getting both parties to collaborate to provide solutions. Traditionally a data scientist got data in csv files from an ETL developer and then did her magic. There is no reason why a data scientist can’t learn some SQL and run their models embedded in a stored proc. Likewise, there is no reason why a data professional can’t help a data scientist do data wrangling.
This is not meant to be a comprehensive manual of SQL R Services. Rather, it is a place where both parties can come together to get just enough information to work together collaboratively
# Objectives and Achievements
- For the data professional, understand basic data science terminology
- For the data scientist, understand how SQL R Services can improve your ability to create predictive solutions
- Understand best practices and patterns around SQL and R integration
- Understand when to use SQL vs R in a solution
1. Why you should care?
# Why SQL Server R Services?
* This will cover the traditional data scientist/DBA impedence mismatch and how the product solves this.
......@@ -52,3 +52,78 @@ a. Taking a business problem and working it from start to finish using R client/
b. We’ll monitor performance at each step of the way
# More Detail
Chapter 1: Why SQL Server R Services - 10 pages
This first chapter introduces the history and reasons why you should understand SQL and R integration. We’ll discuss some of the problems this solves,
Topics covered
1. What is R
2. The impedance mismatch -- data scientists and data professionals.
3. History of RevoR and SQL Server
4. Why this may initially scare a DBA, and why it should NOT (security, performance)
5. What if you don’t have a data scientist?
Skills learned
This is an overview chapter. This should whet the appetite of both parties to want to learn more.
Chapter 2: Basic Data Science Terminology - 10 pages
The second chapter will be mostly interesting to data professionals. Data scientists can be advised to skip over this chapter (maybe by giving a little up-front quiz) if they want. We want to cover those terms that are scary and misunderstood by non-data scientists…supervised/unsupervised learning, features, labels, over/underfitting, etc. This will be extremely over-generalized. We don’t want to scare off the DPs, we just want them to be able to understand the jargon a little better.
Topics covered
1. Different types of models/algorithms and when to choose each – (un)supervised
2. Labels, features/factors
3. Other terms
Skills learned
Basic terminology and level setting. This can be thought of as another intro chapter with no specific skills being learned.
Chapter 3: Installing SQL Server R Services - 20 pages
We cover all of the topics and issues around preparing for, and installing, SQL Server R Services. Finally we’ll verify the installation by running some basic stack queries to ensure everything is working. We’ll do this using a free Azure account to get started quickly, but this can be done using a laptop or an existing on-prem server.
This is geared more toward the DP than the DS, but DSs will probably find this interesting as well, especially the last portion where we test the installation.
We’ll also provide a file with sample data that can be loaded and used for the remaining chapters and their recipes. Finally we’ll also install RStudio so we have the ability to have a working R environment familiar to data scientists.
Topics covered
1. Prerequisites and licensing
2. Security and Performance Considerations
3. Installation using an existing server (we’ll demo Azure)
4. Some quick queries to test everything is working
5. Install a set of files and data for future recipes.
6. Install RStudio
Skills learned
Installing the product and running basic queries to ensure everything is working correctly.
Chapter 4: Traditional Data Access with R – 30 pages
The objective is to show how a data scientist traditionally accesses and works with data. We’ll start with loading data from a csv into a DataFrame, which is the most common use case. We’ll briefly talk about the scalability limits of doing this. Next we’ll connect to the R Server and change the compute context to demonstrate how we can achieve scalability using the ScaleR components. Finally we’ll load our sample data into a SQL table and show how we can achieve even better scalabilty by using R where the data lives, without marshalling it to a csv first.
This chapter is equally relevant to both DSs and DPs. Both will find it illuminating.
Topics covered
1. Use RStudio to analyze a csv
2. Change the execution context to use the R Server libraries for remote execution
3. Demonstrate rODBC which is the traditional method to marshall data back to R
4. Load and query the data using SQL and R natively.
Skills learned
R data connectivity using various methods
Chapter 4: Iterative Solution Development – Predicting Loan Charge-offs – 40 pages
In this chapter we’ll put everything together that we’ve learned so far. We’ll start with a real-world business problem and real data. We’ll start by querying it locally using RStudio. When we have the data and R code in a somewhat good state we’ll work on migrating the model to SQL R Services and finally create a stored procedure that can either do batch classification or a single prediction.
This chapter is equally relevant to both DSs and DPs.
Topics covered
1. Use RStudio to analyze a csv and develop basic R code
2. Load the csv to SQL and have a DP create a basic view to perform data wrangling
3. The DS uses the new view to access the data and continue model development
4. When the model is developed the DS works with the DP to operationalize the code in 2 stored procedures.
Skills learned
The data science process, iterative development, solution lifecycle management.
Markdown is supported
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment