First rev

tirthajyoti · tirthajyoti · commit d7438ce20016 · 2019-07-12T01:01:45.000-07:00
diff --git a/Dataframe_SQL_query.ipynb b/Dataframe_SQL_query.ipynb
@@ -0,0 +1,259 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dataframe running SQL query\n",
+    "### Dr. Tirthajyoti Sarkar, Fremont, CA 94536\n",
+    "### Relational data with flexible, powerful analytics\n",
+    "Relational data stores are easy to build and query. Users and developers often prefer writing easy-to-interpret, declarative queries in a human-like readable language such as SQL. However, as data starts increasing in volume and variety, the relational approach does not scale well enough for building Big Data applications and analytical systems. Following are some major challenges.\n",
+    "\n",
+    "* Dealing with different types and sources of data, which can be structured, semi-structured, and unstructured\n",
+    "* Building ETL pipelines to and from various data sources, which may lead to developing a lot of specific custom code, thereby increasing technical debt over time\n",
+    "* Having the capability to perform both traditional business intelligence (BI)-based analytics and advanced analytics (machine learning, statistical modeling, etc.), the latter of which is definitely challenging to perform in relational systems\n",
+    "\n",
+    "We have had success in the domain of Big Data analytics with Hadoop and the MapReduce paradigm. This was powerful, but often slow, and gave users a low-level, procedural programming interface that required people to write a lot of code for even very simple data transformations. However, once Spark was released, it really revolutionized the way Big Data analytics was done with a focus on in-memory computing, fault tolerance, high-level abstractions, and ease of use.\n",
+    "\n",
+    "From then, several frameworks and systems like Hive, Pig, and Shark (which evolved into Spark SQL) provided rich relational interfaces and declarative querying mechanisms to Big Data stores. The challenge remained that these tools were either relational or procedural-based, and we couldn't have the best of both worlds.\n",
+    "\n",
+    "![spark-1](https://opensource.com/sites/default/files/uploads/2_hadoop-vs-spark.png)\n",
+    "\n",
+    "However, in the real world, most data and analytical pipelines might involve a combination of relational and procedural code. Forcing users to choose either one ends up complicating things and increasing user efforts in developing, building, and maintaining different applications and systems. Apache Spark SQL builds on the previously mentioned SQL-on-Spark effort called Shark. Instead of forcing users to pick between a relational or a procedural API, Spark SQL tries to enable users to seamlessly intermix the two and perform data querying, retrieval, and analysis at scale on Big Data.\n",
+    "\n",
+    "### Spark SQL\n",
+    "Spark SQL essentially tries to bridge the gap between the two models we mentioned previously—the relational and procedural models.\n",
+    "\n",
+    "Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark's built-in distributed collections—at scale!\n",
+    "\n",
+    "To support a wide variety of diverse data sources and algorithms in Big Data, Spark SQL introduces a novel extensible optimizer called Catalyst, which makes it easy to add data sources, optimization rules, and data types for advanced analytics such as machine learning.\n",
+    "Essentially, Spark SQL leverages the power of Spark to perform distributed, robust, in-memory computations at massive scale on Big Data. \n",
+    "\n",
+    "Spark SQL provides state-of-the-art SQL performance and also maintains compatibility with all existing structures and components supported by Apache Hive (a popular Big Data warehouse framework) including data formats, user-defined functions (UDFs), and the metastore. Besides this, it also helps in ingesting a wide variety of data formats from Big Data sources and enterprise data warehouses like JSON, Hive, Parquet, and so on, and performing a combination of relational and procedural operations for more complex, advanced analytics.\n",
+    "\n",
+    "![Spark-2](https://cdn-images-1.medium.com/max/2000/1*OY41hGbe4IB9-hHLRPuCHQ.png)\n",
+    "\n",
+    "### Spark SQL with Dataframe API is fast\n",
+    "Spark SQL has been shown to be extremely fast, even comparable to C++ based engines such as Impala.\n",
+    "\n",
+    "![spark_speed](https://opensource.com/sites/default/files/uploads/9_spark-dataframes-vs-rdds-and-sql.png)\n",
+    "\n",
+    "Following graph shows a nice benchmark result of DataFrames vs. RDDs in different languages, which gives an interesting perspective on how optimized DataFrames can be.\n",
+    "\n",
+    "![spark-speed-2](https://opensource.com/sites/default/files/uploads/10_comparing-spark-dataframes-and-rdds.png)\n",
+    "\n",
+    "Why is Spark SQL so fast and optimized? The reason is because of a new extensible optimizer, **Catalyst**, based on functional programming constructs in Scala.\n",
+    "\n",
+    "Catalyst's extensible design has two purposes.\n",
+    "\n",
+    "* Makes it easy to add new optimization techniques and features to Spark SQL, especially to tackle diverse problems around Big Data, semi-structured data, and advanced analytics\n",
+    "* Ease of being able to extend the optimizer—for example, by adding data source-specific rules that can push filtering or aggregation into external storage systems or support for new data types\n",
+    "\n",
+    "Catalyst supports both rule-based and cost-based optimization. While extensible optimizers have been proposed in the past, they have typically required a complex domain-specific language to specify rules. Usually, this leads to having a significant learning curve and maintenance burden. In contrast, Catalyst uses standard features of the Scala programming language, such as pattern-matching, to let developers use the full programming language while still making rules easy to specify.\n",
+    "\n",
+    "![catalyst-2](https://cdn-images-1.medium.com/max/1500/1*81ZOMxCci-tM2b-HNUX6Ww.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create a SparkSession and read the a stock price dataset CSV"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark import SparkContext as sc\n",
+    "from pyspark.sql import SparkSession"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "spark1 = SparkSession.builder.appName('SQL').getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = spark1.read.csv('Data/appl_stock.csv',inferSchema=True,header=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "root\n",
+      " |-- Date: timestamp (nullable = true)\n",
+      " |-- Open: double (nullable = true)\n",
+      " |-- High: double (nullable = true)\n",
+      " |-- Low: double (nullable = true)\n",
+      " |-- Close: double (nullable = true)\n",
+      " |-- Volume: integer (nullable = true)\n",
+      " |-- Adj Close: double (nullable = true)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "df.printSchema()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create a temporary view"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.createOrReplaceTempView('stock')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Now run a simple SQL query directly on this view. It returns a DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "DataFrame[Date: timestamp, Open: double, High: double, Low: double, Close: double, Volume: int, Adj Close: double]"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "result = spark1.sql(\"SELECT * FROM stock LIMIT 5\")\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "result.columns"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+-------------------+----------+----------+------------------+------------------+---------+------------------+\n",
+      "|               Date|      Open|      High|               Low|             Close|   Volume|         Adj Close|\n",
+      "+-------------------+----------+----------+------------------+------------------+---------+------------------+\n",
+      "|2010-01-04 00:00:00|213.429998|214.499996|212.38000099999996|        214.009998|123432400|         27.727039|\n",
+      "|2010-01-05 00:00:00|214.599998|215.589994|        213.249994|        214.379993|150476200|27.774976000000002|\n",
+      "|2010-01-06 00:00:00|214.379993|    215.23|        210.750004|        210.969995|138040000|27.333178000000004|\n",
+      "|2010-01-07 00:00:00|    211.75|212.000006|        209.050005|            210.58|119282800|          27.28265|\n",
+      "|2010-01-08 00:00:00|210.299994|212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|\n",
+      "+-------------------+----------+----------+------------------+------------------+---------+------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "result.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Run a slightly more complex query\n",
+    "How many entries in the `Close` field are higher than 500?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+------------+\n",
+      "|count(Close)|\n",
+      "+------------+\n",
+      "|         403|\n",
+      "+------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "count_greater_500 = spark1.sql(\"SELECT COUNT(Close) FROM stock WHERE Close > 500\").show()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}