You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -87,15 +87,15 @@ In Apache Spark, a DataFrame is a distributed collection of rows under named col
87
87
*__Lazy Evaluations__: Which means that a task is not executed until an action is performed.
88
88
*__Distributed__: RDD and DataFrame both are distributed in nature.
89
89
90
-
### Advantages of the DataFrame
90
+
### Advantages of the Dataframe
91
91
92
92
* DataFrames are designed for processing large collection of structured or semi-structured data.
93
93
* Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries.
94
94
* DataFrame in Apache Spark has the ability to handle petabytes of data.
95
95
* DataFrame has a support for wide range of data format and sources.
96
96
* It has API support for different languages like Python, R, Scala, Java.
97
97
98
-
###Spark SQL
98
+
## Spark SQL
99
99
Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark's built-in distributed collections—at scale!
100
100
101
101
To support a wide variety of diverse data sources and algorithms in Big Data, Spark SQL introduces a novel extensible optimizer called Catalyst, which makes it easy to add data sources, optimization rules, and data types for advanced analytics such as machine learning.
@@ -104,3 +104,19 @@ Essentially, Spark SQL leverages the power of Spark to perform distributed, robu
104
104
Spark SQL provides state-of-the-art SQL performance and also maintains compatibility with all existing structures and components supported by Apache Hive (a popular Big Data warehouse framework) including data formats, user-defined functions (UDFs), and the metastore. Besides this, it also helps in ingesting a wide variety of data formats from Big Data sources and enterprise data warehouses like JSON, Hive, Parquet, and so on, and performing a combination of relational and procedural operations for more complex, advanced analytics.
Following graph shows a nice benchmark result of DataFrames vs. RDDs in different languages, which gives an interesting perspective on how optimized DataFrames can be.
Why is Spark SQL so fast and optimized? The reason is because of a new extensible optimizer, **Catalyst**, based on functional programming constructs in Scala.
118
+
119
+
Catalyst's extensible design has two purposes.
120
+
121
+
* Makes it easy to add new optimization techniques and features to Spark SQL, especially to tackle diverse problems around Big Data, semi-structured data, and advanced analytics
122
+
* Ease of being able to extend the optimizer—for example, by adding data source-specific rules that can push filtering or aggregation into external storage systems or support for new data types
0 commit comments