Skip to content

Commit 19d837a

Browse files
authored
Update README.md
1 parent 10b5aa8 commit 19d837a

File tree

1 file changed

+19
-3
lines changed

1 file changed

+19
-3
lines changed

README.md

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ There are two ways to create RDDs,
7676
* parallelizing an existing collection in your driver program,
7777
* referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.
7878

79-
## Basics of the `DataFrame`
79+
## Basics of the `Dataframe`
8080
<p align='center'><img src="https://cdn-images-1.medium.com/max/1202/1*wiXLNwwMyWdyyBuzZnGrWA.png" width="600" height="400"></p>
8181

8282
### DataFrame
@@ -87,15 +87,15 @@ In Apache Spark, a DataFrame is a distributed collection of rows under named col
8787
* __Lazy Evaluations__: Which means that a task is not executed until an action is performed.
8888
* __Distributed__: RDD and DataFrame both are distributed in nature.
8989

90-
### Advantages of the DataFrame
90+
### Advantages of the Dataframe
9191

9292
* DataFrames are designed for processing large collection of structured or semi-structured data.
9393
* Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries.
9494
* DataFrame in Apache Spark has the ability to handle petabytes of data.
9595
* DataFrame has a support for wide range of data format and sources.
9696
* It has API support for different languages like Python, R, Scala, Java.
9797

98-
### Spark SQL
98+
## Spark SQL
9999
Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark's built-in distributed collections—at scale!
100100

101101
To support a wide variety of diverse data sources and algorithms in Big Data, Spark SQL introduces a novel extensible optimizer called Catalyst, which makes it easy to add data sources, optimization rules, and data types for advanced analytics such as machine learning.
@@ -104,3 +104,19 @@ Essentially, Spark SQL leverages the power of Spark to perform distributed, robu
104104
Spark SQL provides state-of-the-art SQL performance and also maintains compatibility with all existing structures and components supported by Apache Hive (a popular Big Data warehouse framework) including data formats, user-defined functions (UDFs), and the metastore. Besides this, it also helps in ingesting a wide variety of data formats from Big Data sources and enterprise data warehouses like JSON, Hive, Parquet, and so on, and performing a combination of relational and procedural operations for more complex, advanced analytics.
105105

106106
![Spark-2](https://cdn-images-1.medium.com/max/2000/1*OY41hGbe4IB9-hHLRPuCHQ.png)
107+
108+
### Speed of Spark SQL
109+
Spark SQL has been shown to be extremely fast, even comparable to C++ based engines such as Impala.
110+
111+
![spark_speed](https://opensource.com/sites/default/files/uploads/9_spark-dataframes-vs-rdds-and-sql.png)
112+
113+
Following graph shows a nice benchmark result of DataFrames vs. RDDs in different languages, which gives an interesting perspective on how optimized DataFrames can be.
114+
115+
![spark-speed-2](https://opensource.com/sites/default/files/uploads/10_comparing-spark-dataframes-and-rdds.png)
116+
117+
Why is Spark SQL so fast and optimized? The reason is because of a new extensible optimizer, **Catalyst**, based on functional programming constructs in Scala.
118+
119+
Catalyst's extensible design has two purposes.
120+
121+
* Makes it easy to add new optimization techniques and features to Spark SQL, especially to tackle diverse problems around Big Data, semi-structured data, and advanced analytics
122+
* Ease of being able to extend the optimizer—for example, by adding data source-specific rules that can push filtering or aggregation into external storage systems or support for new data types

0 commit comments

Comments
 (0)