Updated

tirthajyoti · tirthajyoti · commit 7c3ebf4de572 · 2019-07-10T23:49:03.000-07:00
diff --git a/Dataframe_basics.ipynb b/Dataframe_basics.ipynb
@@ -35,8 +35,7 @@
    "outputs": [],
    "source": [
     "import pyspark\n",
-    "from pyspark import SparkContext as sc\n",
-    "from pyspark.sql import Row"
+    "from pyspark import SparkContext as sc"
    ]
   },
   {
@@ -68,7 +67,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Read in a JSON file"
+    "### Read in a JSON file and examine the data"
    ]
   },
   {
@@ -142,7 +141,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Use `printSchema()` to show he schema of the data. Note, how tightly it is integrated to the SQL-like framework. You can even see that the schema accepts `null` values because nullable property is set `True`."
+    "### The data schema\n",
+    "Use `printSchema()` to show he schema of the data. Note, how tightly it is integrated to the SQL-like framework. You can even see that the schema accepts `null` values because nullable property is set `True`."
    ]
   },
   {
@@ -174,7 +174,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
@@ -183,7 +183,7 @@
        "['age', 'name']"
       ]
      },
-     "execution_count": 9,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -201,7 +201,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -210,7 +210,7 @@
        "DataFrame[summary: string, age: string, name: string]"
       ]
      },
-     "execution_count": 10,
+     "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -228,7 +228,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [
     {
@@ -252,12 +252,130 @@
     "df.describe().show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### There is `summary` method for more stats"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+-------+------------------+-------+\n",
+      "|summary|               age|   name|\n",
+      "+-------+------------------+-------+\n",
+      "|  count|                 2|      3|\n",
+      "|   mean|              24.5|   null|\n",
+      "| stddev|7.7781745930520225|   null|\n",
+      "|    min|                19|   Andy|\n",
+      "|    25%|                19|   null|\n",
+      "|    50%|                19|   null|\n",
+      "|    75%|                30|   null|\n",
+      "|    max|                30|Michael|\n",
+      "+-------+------------------+-------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "df.summary().show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### How you can define your own Data Schema\n",
+    "#### Import data types and structure types to build the data schema yourself"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "from pyspark.sql.types import StructField, IntegerType, StringType, StructType"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Define your data schema by supplying name and data types to the structure fields you will be importing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_schema = [StructField('age',IntegerType(),True),\n",
+    "StructField('name',StringType(),True)]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Now create a `StrucType` with this schema as field"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "final_struc = StructType(fields=data_schema)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Now read in the same old JSON with this new schema"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+----+-------+\n",
+      "| age|   name|\n",
+      "+----+-------+\n",
+      "|null|Michael|\n",
+      "|  30|   Andy|\n",
+      "|  19| Justin|\n",
+      "+----+-------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = spark1.read.json('Data/people.json',schema=final_struc)\n",
+    "df.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now when you print the schema, you will see that the `age` is read as int and not long. By default Spark could not figure out for this column the exact data type that you wanted, so it went with long. But this is how you can build your own schema and instruct Spark to read the data accoridngly."
+   ]
   }
  ],
  "metadata": {