Skip to content

Files

Latest commit

88796f2 · Jan 12, 2020

History

History

Project 3 - Spark Data Processing and Trend Detection from Airbnb data

README.md

Project 3: Spark Data Processing and Trend Detection from Airbnb data

by Sebastian Sbirna, Yingrui Li and Aijie Shu


This is the third of three mandatory projects to be handed in as part of the assessment for the course 02807 Computational Tools for Data Science at Technical University of Denmark, autumn 2019.

Introduction

Airbnb is an online marketplace for arranging or offering lodgings.

In this project you will use Spark to analyze data obtained from the Airbnb website. The purpose of the analysis is to extract information about trends and patterns from the data.

The project has two parts.

Part 1: Loading, describing and preparing the data

There's quite a lot of data. Make sure that you can load and correctly parse the data, and that you understand what the dataset contains. You should also prepare the data for the analysis in part two. This means cleaning it and staging it so that subsequent queries are fast.

Par 2: Analysis

In this part your goal is to learn about trends and usage patterns from the data. You should give solutions to the tasks defined in this notebook, and you should use Spark to do the data processing. You may use other libraries like for instance Pandas and matplotlib for visualisation.

Guidelines

  • Processing data should be done using Spark. Once data has been reduced to aggregate form, you may use collect to extract it into Python for visualisation.
  • Your solutions will be evaluated by correctness, code quality and interpretability of the output. This means that you have to write clean and efficient Spark code that will generate sensible execution plans, and that the tables and visualisations that you produce are meaningful and easy to read.
  • You may add more cells for your solutions, but you should not modify the notebook otherwise.

Create Spark session and define imports

from pyspark.sql import *
from pyspark.sql import functions as f
from pyspark.sql.types import *
from pyspark.sql.functions import split, explode
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from bokeh.io import output_notebook
from bokeh.io import show
from bokeh.layouts import column
from bokeh.models import ColumnDataSource, RangeTool
from bokeh.plotting import figure
import string
# !jupyter nbextension enable --py --sys-prefix widgetsnbextension
spark = SparkSession.builder.appName("Project3").getOrCreate()

Part 1: Loading, describing and preparing the data

The data comes in two files. Start by downloading the files and putting them in your data/ folder.

Load the data

The data has multiline rows (rows that span multiple lines in the file). To correctly parse these you should use the multiline option and set the escape character to be ".

path1 = '/data/listings.csv'
path2 = '/data/reviews.csv'
listings = spark.read\
  .option("multiLine", True)\
  .option("header", True)\
  .option("escape", "\"")\
  .option("inferSchema", True)\
  .csv(path1)
reviews = spark.read\
  .option("multiLine", True)\
  .option("header", True)\
  .option("escape", "\"")\
  .option("inferSchema", True)\
  .csv(path2)

Describe the data

List the features (schema) and sizes of the datasets.

Listings
listings.printSchema()    
root
 |-- id: string (nullable = true)
 |-- listing_url: string (nullable = true)
 |-- scrape_id: string (nullable = true)
 |-- last_scraped: string (nullable = true)
 |-- name: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- space: string (nullable = true)
 |-- description: string (nullable = true)
 |-- experiences_offered: string (nullable = true)
 |-- neighborhood_overview: string (nullable = true)
 |-- notes: string (nullable = true)
 |-- transit: string (nullable = true)
 |-- access: string (nullable = true)
 |-- interaction: string (nullable = true)
 |-- house_rules: string (nullable = true)
 |-- thumbnail_url: string (nullable = true)
 |-- medium_url: string (nullable = true)
 |-- picture_url: string (nullable = true)
 |-- xl_picture_url: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_url: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- host_since: string (nullable = true)
 |-- host_location: string (nullable = true)
 |-- host_about: string (nullable = true)
 |-- host_response_time: string (nullable = true)
 |-- host_response_rate: string (nullable = true)
 |-- host_acceptance_rate: string (nullable = true)
 |-- host_is_superhost: string (nullable = true)
 |-- host_thumbnail_url: string (nullable = true)
 |-- host_picture_url: string (nullable = true)
 |-- host_neighbourhood: string (nullable = true)
 |-- host_listings_count: string (nullable = true)
 |-- host_total_listings_count: string (nullable = true)
 |-- host_verifications: string (nullable = true)
 |-- host_has_profile_pic: string (nullable = true)
 |-- host_identity_verified: string (nullable = true)
 |-- street: string (nullable = true)
 |-- neighbourhood: string (nullable = true)
 |-- neighbourhood_cleansed: string (nullable = true)
 |-- neighbourhood_group_cleansed: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zipcode: string (nullable = true)
 |-- market: string (nullable = true)
 |-- smart_location: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- country: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- is_location_exact: string (nullable = true)
 |-- property_type: string (nullable = true)
 |-- room_type: string (nullable = true)
 |-- accommodates: string (nullable = true)
 |-- bathrooms: string (nullable = true)
 |-- bedrooms: string (nullable = true)
 |-- beds: string (nullable = true)
 |-- bed_type: string (nullable = true)
 |-- amenities: string (nullable = true)
 |-- square_feet: string (nullable = true)
 |-- price: string (nullable = true)
 |-- weekly_price: string (nullable = true)
 |-- monthly_price: string (nullable = true)
 |-- security_deposit: string (nullable = true)
 |-- cleaning_fee: string (nullable = true)
 |-- guests_included: string (nullable = true)
 |-- extra_people: string (nullable = true)
 |-- minimum_nights: string (nullable = true)
 |-- maximum_nights: string (nullable = true)
 |-- minimum_minimum_nights: string (nullable = true)
 |-- maximum_minimum_nights: string (nullable = true)
 |-- minimum_maximum_nights: string (nullable = true)
 |-- maximum_maximum_nights: string (nullable = true)
 |-- minimum_nights_avg_ntm: string (nullable = true)
 |-- maximum_nights_avg_ntm: string (nullable = true)
 |-- calendar_updated: string (nullable = true)
 |-- has_availability: string (nullable = true)
 |-- availability_30: string (nullable = true)
 |-- availability_60: string (nullable = true)
 |-- availability_90: string (nullable = true)
 |-- availability_365: string (nullable = true)
 |-- calendar_last_scraped: string (nullable = true)
 |-- number_of_reviews: string (nullable = true)
 |-- number_of_reviews_ltm: string (nullable = true)
 |-- first_review: string (nullable = true)
 |-- last_review: string (nullable = true)
 |-- review_scores_rating: string (nullable = true)
 |-- review_scores_accuracy: string (nullable = true)
 |-- review_scores_cleanliness: string (nullable = true)
 |-- review_scores_checkin: string (nullable = true)
 |-- review_scores_communication: string (nullable = true)
 |-- review_scores_location: string (nullable = true)
 |-- review_scores_value: string (nullable = true)
 |-- requires_license: string (nullable = true)
 |-- license: string (nullable = true)
 |-- jurisdiction_names: string (nullable = true)
 |-- instant_bookable: string (nullable = true)
 |-- is_business_travel_ready: string (nullable = true)
 |-- cancellation_policy: string (nullable = true)
 |-- require_guest_profile_picture: string (nullable = true)
 |-- require_guest_phone_verification: string (nullable = true)
 |-- calculated_host_listings_count: string (nullable = true)
 |-- calculated_host_listings_count_entire_homes: string (nullable = true)
 |-- calculated_host_listings_count_private_rooms: string (nullable = true)
 |-- calculated_host_listings_count_shared_rooms: string (nullable = true)
 |-- reviews_per_month: string (nullable = true)
listings.count()
1330480
Reviews
reviews.printSchema()
root
 |-- listing_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- reviewer_id: string (nullable = true)
 |-- reviewer_name: string (nullable = true)
 |-- comments: string (nullable = true)
reviews.count()
32297300

Prepare the data for analysis

You should prepare two dataframes to be used in the analysis part of the project. You should not be concerned with cleaning the data. There's a lot of it, so it will be sufficient to drop rows that have bad values. You may want to go back and refine this step at a later point when doing the analysis.

You may also want to consider if you can stage your data so that subsequent processing is more efficient (this is not strictly necessary for Spark to run, but you may be able to decrease the time you sit around waiting for Spark to finish things)

1. filter out necessary columns which will be involved in following questions. 2. Keep the `NaN` entries and only drop them when necessary
listings_sub = listings.select(f.col('id').alias('listing_id'), 'city', 'neighbourhood', 'property_type', 'price', 'review_scores_rating')\
               .drop_duplicates()
listings_sub.show()
+----------+---------------+--------------------+-------------+---------+--------------------+
|listing_id|           city|       neighbourhood|property_type|    price|review_scores_rating|
+----------+---------------+--------------------+-------------+---------+--------------------+
|   1170802|      Stockholm|           Skarpnäck|    Apartment|  $352.00|                null|
|   1867260|      Stockholm|            Norrmalm|    Apartment|$1,301.00|                 100|
|   3108437|     Stockholm |Hägersten-Liljeho...|    Apartment|  $196.00|                  73|
|   3259719|      Stockholm|           Södermalm|    Apartment|$1,996.00|                  97|
|   5218468|      Stockholm|          Skärholmen|        House|  $949.00|                 100|
|   5360266|      Stockholm|           Skarpnäck|    Townhouse|$1,771.00|                  97|
|   5706203|      Stockholm|         Kungsholmen|    Apartment|  $890.00|                  93|
|   6340350|      Stockholm|           Östermalm|    Apartment|  $998.00|                  96|
|   7063087|      Stockholm|Hägersten-Liljeho...|    Apartment|  $900.00|                  97|
|   7614052|      Stockholm|         Kungsholmen|    Apartment|  $695.00|                  91|
|   7616523|      Stockholm|              Farsta|    Apartment|  $274.00|                  93|
|  13255147|      Stockholm|         Kungsholmen|    Apartment|$1,154.00|                 100|
|  14018055|      Stockholm|Hägersten-Liljeho...|    Apartment|  $597.00|                  73|
|  16190458|      Stockholm|            Norrmalm|    Apartment|$1,800.00|                null|
|  16654904|      Stockholm|            Norrmalm|    Apartment|  $998.00|                  96|
|  18290399|      Stockholm|            Norrmalm|    Apartment|$1,497.00|                null|
|  18840650|      Stockholm|           Skarpnäck|    Apartment|  $949.00|                 100|
|  19522814|Ladugårdsgärdet|           Östermalm|    Apartment|  $479.00|                  93|
|  19966118|       Norrmalm|            Norrmalm|    Apartment|$1,096.00|                  88|
|  20633450|Skarpnäcks Gård|           Skarpnäck|    Apartment|  $597.00|                null|
+----------+---------------+--------------------+-------------+---------+--------------------+
only showing top 20 rows
listings_sub.count()
1330380

Part 2: Analysis

Use Spark and your favorite tool for data visualization to solve the following tasks.

The basics

Compute and show a dataframe with the number of listings and neighbourhoods per city.

city_neigh_listing = listings_sub.groupBy('city')\
                     .agg(f.countDistinct('listing_id').alias('listing_counts'), f.countDistinct('neighbourhood').alias('neighbourhood_counts'))\
                     .filter(f.col('city').isNotNull())\
                     .orderBy(f.desc('listing_counts'))
city_neigh_listing.show(5)
+--------------+--------------+--------------------+
|          city|listing_counts|neighbourhood_counts|
+--------------+--------------+--------------------+
|         Paris|         61923|                  63|
|Greater London|         46521|                 149|
|        London|         33100|                 148|
|       Beijing|         32338|                  61|
|   Los Angeles|         27763|                  97|
+--------------+--------------+--------------------+
only showing top 5 rows

Based on the table above, you should choose a city that you want to continue your analysis for. The city should have mulitple neighbourhoods with listings in them.

We choose Beijing for furthur analysis. It contains 32338 listings and 61 neighbourhood

Compute and visualize the number of listings of different property types per neighbourhood in your city.

beijing_neigh_listing = listings_sub.groupBy('city', 'neighbourhood', 'property_type')\
                        .agg(f.countDistinct('listing_id').alias('listing_counts_by_type'))\
                        .filter(f.col('city')=='Beijing')\
                        .filter(f.col('neighbourhood').isNotNull())\
                        .filter(f.col('property_type').isNotNull())\
                        .orderBy(f.desc('listing_counts_by_type'))
beijing_neigh_listing.show()
+-------+------------------+------------------+----------------------+
|   city|     neighbourhood|     property_type|listing_counts_by_type|
+-------+------------------+------------------+----------------------+
|Beijing|          Chaoyang|         Apartment|                  3238|
|Beijing|          Chaoyang|       Condominium|                  1462|
|Beijing|           Haidian|         Apartment|                   821|
|Beijing|           Fengtai|         Apartment|                   777|
|Beijing|          Chaoyang|             House|                   701|
|Beijing|         Dongcheng|         Apartment|                   636|
|Beijing|           Haidian|       Condominium|                   436|
|Beijing|          Chaoyang|Serviced apartment|                   402|
|Beijing|          Sanlitun|         Apartment|                   378|
|Beijing|          Chaoyang|              Loft|                   352|
|Beijing|           Xicheng|         Apartment|                   304|
|Beijing|           Fengtai|       Condominium|                   291|
|Beijing|Jinsong/Panjiayuan|         Apartment|                   281|
|Beijing|           Xicheng|             House|                   279|
|Beijing|           Haidian|             House|                   271|
|Beijing|         Dongcheng|             House|                   255|
|Beijing|         Dongcheng|       Condominium|                   255|
|Beijing|          Wangjing|         Apartment|                   233|
|Beijing|           Fengtai|             House|                   221|
|Beijing|       Chongwenmen|         Apartment|                   190|
+-------+------------------+------------------+----------------------+
only showing top 20 rows
# convert to pandas dataframe for visualization
beijing_df = beijing_neigh_listing.toPandas()
# visualize the number of differnt property type in different neighbourhood
plot = sns.FacetGrid(beijing_df, col='neighbourhood', sharex=False, sharey=False, col_wrap=3, size=6)
plot.map(sns.barplot, 'property_type', 'listing_counts_by_type')
plot.set_xticklabels(rotation=90)
plot.fig.tight_layout()
for ax in plot.axes.flat:
    ax.set_ylabel('listing number of each property type')
/anaconda3/lib/python3.6/site-packages/seaborn/axisgrid.py:230: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
/anaconda3/lib/python3.6/site-packages/seaborn/axisgrid.py:715: UserWarning: Using the barplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)

png

Prices

Compute the minimum, maximum and average listing price in your city.

In this part, we first check the price format and clean/transfer it. Then find the min, max, avg.
# filter out Beijing
beijing_price = listings_sub.filter(f.col('city')=='Beijing')                         
# define a function to get the currency sign of a price
extract_currency = f.udf(lambda x : x[0])   
# check if there is only one type of currency sign -- $
beijing_price.withColumn('currency', extract_currency(f.col('price')))\
             .select('currency').distinct()\
             .show()
+--------+
|currency|
+--------+
|       $|
+--------+
from re import sub
from decimal import Decimal

# define a function to clean the price. only keep numbers and drop $ and . ,
extract_price = f.udf(lambda price : Decimal(sub(r'[^\d.]', '', price)))
# apply the function on column price and make a new column
beijing_price = beijing_price.withColumn('price', extract_price(f.col('price')).cast(DoubleType()))
# sort the price
beijing_price.orderBy(f.desc('price')).show()
+----------+-------+--------------------+------------------+-------+--------------------+
|listing_id|   city|       neighbourhood|     property_type|  price|review_scores_rating|
+----------+-------+--------------------+------------------+-------+--------------------+
|  15666822|Beijing|                null|             House|71597.0|                null|
|  12689987|Beijing|            Chaoyang|         Apartment|71110.0|                 100|
|  28134193|Beijing|             Haidian|         Apartment|68984.0|                  60|
|  22382855|Beijing|            Chaoyang|       Condominium|68980.0|                 100|
|  35658816|Beijing|            Chaoyang|         Apartment|68088.0|                null|
|  34071825|Beijing|            Chaoyang|             House|68002.0|                null|
|  27587044|Beijing|                null|       Condominium|66665.0|                null|
|  28803519|Beijing|           Dongcheng|       Condominium|65973.0|                  80|
|  15488817|Beijing|              Zhuang|Serviced apartment|63345.0|                  86|
|  21942314|Beijing|             Fengtai|             House|60002.0|                 100|
|  20748712|Beijing|            Chaoyang|       Condominium|60002.0|                  60|
|  29138170|Beijing|            Chaoyang|       Condominium|60002.0|                null|
|  31432793|Beijing|Liang Ma Qiao/San...|         Apartment|60002.0|                 100|
|  32905425|Beijing|                null|             House|55003.0|                 100|
|  35353869|Beijing|                null|       Condominium|46164.0|                  95|
|  29520327|Beijing|            Chaoyang|             House|43798.0|                null|
|  15526505|Beijing|             Shilipu|         Apartment|39358.0|                 100|
|  31543015|Beijing|                null|         Apartment|36771.0|                null|
|  30355531|Beijing|             Fengtai|             House|30001.0|                null|
|  25434257|Beijing|                null|            Hostel|28501.0|                null|
+----------+-------+--------------------+------------------+-------+--------------------+
only showing top 20 rows
beijing_price.count()
32338
beijing_price.filter(f.col('neighbourhood').isNull()).count()
11416
According to above counts, 1/3 of the `neighbourhood` entries are null. So we decide to drop them.
# drop null value
beijing_price = beijing_price.filter(f.col('neighbourhood').isNotNull())
beijing_price.orderBy(f.desc('price')).show()
+----------+-------+--------------------+------------------+-------+--------------------+
|listing_id|   city|       neighbourhood|     property_type|  price|review_scores_rating|
+----------+-------+--------------------+------------------+-------+--------------------+
|  12689987|Beijing|            Chaoyang|         Apartment|71110.0|                 100|
|  28134193|Beijing|             Haidian|         Apartment|68984.0|                  60|
|  22382855|Beijing|            Chaoyang|       Condominium|68980.0|                 100|
|  35658816|Beijing|            Chaoyang|         Apartment|68088.0|                null|
|  34071825|Beijing|            Chaoyang|             House|68002.0|                null|
|  28803519|Beijing|           Dongcheng|       Condominium|65973.0|                  80|
|  15488817|Beijing|              Zhuang|Serviced apartment|63345.0|                  86|
|  20748712|Beijing|            Chaoyang|       Condominium|60002.0|                  60|
|  21942314|Beijing|             Fengtai|             House|60002.0|                 100|
|  31432793|Beijing|Liang Ma Qiao/San...|         Apartment|60002.0|                 100|
|  29138170|Beijing|            Chaoyang|       Condominium|60002.0|                null|
|  29520327|Beijing|            Chaoyang|             House|43798.0|                null|
|  15526505|Beijing|             Shilipu|         Apartment|39358.0|                 100|
|  30355531|Beijing|             Fengtai|             House|30001.0|                null|
|  22035885|Beijing|             Xicheng|    Boutique hotel|26531.0|                 100|
|  32824344|Beijing|            Chaoyang|Serviced apartment|25998.0|                null|
|  32825776|Beijing|            Chaoyang|Serviced apartment|25998.0|                null|
|  33539654|Beijing|             Shilipu|         Apartment|25998.0|                  80|
|  33943498|Beijing|  Wangfujing/Dongdan|Serviced apartment|25002.0|                null|
|  33751384|Beijing|             Shilipu|             House|24000.0|                null|
+----------+-------+--------------------+------------------+-------+--------------------+
only showing top 20 rows
The most expensive listings in beijing is $711,110. It locates in `Chaoyang` district and it is an `Apartment`.
# calculate the min, max and average
beijing_price.select(f.min('price'), f.max('price'), f.avg('price')).show()
+----------+----------+-----------------+
|min(price)|max(price)|       avg(price)|
+----------+----------+-----------------+
|       0.0|   71110.0|565.5259535417264|
+----------+----------+-----------------+

Compute and visualize the distribution of listing prices in your city.

A good way to show the distribution of listing prices is the histogram.
beijing_price_df = beijing_price.toPandas()
# histogram show the price distribution
plt.figure(figsize = [20,12])
plot = sns.distplot(beijing_price_df.price, kde=False, bins = 4000, hist_kws = {'alpha': 0.9}, color = sns.color_palette()[0])
plt.xticks(np.arange(0, beijing_price_df.price.max(), 100))
plt.xlim(0, 1000)
plt.xlabel('Listing price distribution(frequency)', fontsize=13)
plt.show()

png

The value of a listing is its rating divided by its price.

Compute and show a dataframe with the 3 highest valued listings in each neighbourhood.

We compute the 3 highest valued listings in each neighbourhood in 4 steps :

  1. cast the review ratings into integers in order to caculate the value.

  2. sort the value in descending way.

  3. make a window filter to rank the value in each neigbourhood.

  4. slect lisitings with rank number less than 3 in each neighbourhood.

# cast string to integer
beijing_price = beijing_price.withColumn('review_scores_rating', (f.col('review_scores_rating')).cast(IntegerType()))
# calculate value
beijing_listing_value = beijing_price.withColumn('value', (f.col('review_scores_rating')/f.col('price')).alias('value'))
beijing_listing_value.show()
+----------+-------+--------------------+------------------+-----+--------------------+-------------------+
|listing_id|   city|       neighbourhood|     property_type|price|review_scores_rating|              value|
+----------+-------+--------------------+------------------+-----+--------------------+-------------------+
|   6317579|Beijing|           Dongcheng| Bed and breakfast|220.0|                  99|               0.45|
|   6622351|Beijing|           Dongcheng|        Guesthouse|555.0|                  95|0.17117117117117117|
|  12597460|Beijing|             Haidian|         Apartment|525.0|                  97|0.18476190476190477|
|  13495703|Beijing|            Chaoyang|             House|164.0|                 100| 0.6097560975609756|
|  13564871|Beijing|Liang Ma Qiao/San...|    Boutique hotel|640.0|                null|               null|
|  14473988|Beijing|          Shuangjing|         Apartment|525.0|                  94|0.17904761904761904|
|  14997784|Beijing|           Dongcheng|Serviced apartment|668.0|                  98| 0.1467065868263473|
|  16430595|Beijing|            Sanlitun|         Apartment|853.0|                 100|0.11723329425556858|
|  18940394|Beijing|              Xuanwu|          Bungalow|498.0|                  95|0.19076305220883535|
|  19025271|Beijing|             Haidian|             House|468.0|                  90|0.19230769230769232|
|  19086556|Beijing|             Haidian|         Apartment|498.0|                  50|0.10040160642570281|
|  19088267|Beijing|  Jinsong/Panjiayuan|         Apartment|170.0|                null|               null|
|  19102060|Beijing|            Ahn Jung|         Apartment|121.0|                null|               null|
|  19506317|Beijing|            Chaoyang|         Apartment|191.0|                 100| 0.5235602094240838|
|  20929761|Beijing|            Chaoyang|         Apartment|177.0|                  99|  0.559322033898305|
|  21103240|Beijing|            Chaoyang|         Apartment|910.0|                  93| 0.1021978021978022|
|  21642531|Beijing|            Chaoyang|Serviced apartment|391.0|                  80|0.20460358056265984|
|  22005612|Beijing|           Dongcheng|             House|740.0|                  90|0.12162162162162163|
|  22986438|Beijing|             Beiyuan|         Apartment|199.0|                  50|0.25125628140703515|
|  23172274|Beijing|            Chaoyang|         Apartment|697.0|                  98| 0.1406025824964132|
+----------+-------+--------------------+------------------+-----+--------------------+-------------------+
only showing top 20 rows
# build a window
sorted_by_value = Window.partitionBy('neighbourhood').orderBy(f.desc('value'))

# rank with window and filter by rank number
beijing_ranked_neigh = beijing_listing_value.withColumn('value_rank', f.rank().over(sorted_by_value))
beijing_ranked_neigh.filter(f.col('value_rank') <= 3).drop('value_rank').orderBy('neighbourhood', f.desc('value')).show()
+----------+-------+-------------------+-------------+-----+--------------------+------------------+
|listing_id|   city|      neighbourhood|property_type|price|review_scores_rating|             value|
+----------+-------+-------------------+-------------+-----+--------------------+------------------+
|  20195162|Beijing|           Ahn Jung|    Apartment| 85.0|                 100|1.1764705882352942|
|  16695169|Beijing|           Ahn Jung|    Apartment| 85.0|                  95|1.1176470588235294|
|  17971300|Beijing|           Ahn Jung|        House| 92.0|                  89| 0.967391304347826|
|  23209150|Beijing|Bei Tai Ping Zhuang|    Apartment| 85.0|                 100|1.1764705882352942|
|  15697701|Beijing|Bei Tai Ping Zhuang|    Apartment| 85.0|                  99|1.1647058823529413|
|  16318798|Beijing|Bei Tai Ping Zhuang|    Apartment| 92.0|                 100|1.0869565217391304|
|  33866823|Beijing|Bei Tai Ping Zhuang|       Hostel| 92.0|                 100|1.0869565217391304|
|  31556177|Beijing| Beijing University|        House|100.0|                  93|              0.93|
|  18222384|Beijing| Beijing University|   Guesthouse|135.0|                 100|0.7407407407407407|
|  35962784|Beijing| Beijing University|        House|142.0|                 100| 0.704225352112676|
|  34515565|Beijing|            Beiyuan|        House| 71.0|                 100| 1.408450704225352|
|  38238385|Beijing|            Beiyuan|    Apartment| 92.0|                 100|1.0869565217391304|
|  25457982|Beijing|            Beiyuan|    Apartment|107.0|                 100|0.9345794392523364|
|  12190198|Beijing|            Caoqiao|  Condominium| 78.0|                  90|1.1538461538461537|
|  13922593|Beijing|            Caoqiao|    Apartment|100.0|                 100|               1.0|
|  33304596|Beijing|            Caoqiao|  Condominium|100.0|                 100|               1.0|
|  32769153|Beijing|           Chaoyang|  Condominium| 64.0|                 100|            1.5625|
|  34194662|Beijing|           Chaoyang|        House| 64.0|                 100|            1.5625|
|  34243820|Beijing|           Chaoyang|        House| 64.0|                 100|            1.5625|
|  14392809|Beijing|           Chaoyang|    Apartment| 64.0|                 100|            1.5625|
+----------+-------+-------------------+-------------+-----+--------------------+------------------+
only showing top 20 rows

Trends

Now we want to analyze the "popularity" of your city. The data does not contain the number of bookings per listing, but we have a large number of reviews, and we will assume that this is a good indicator of activity on listings.

Compute and visualize the popularity(i.e. number of reviews) of your city over time.

# check the data type
reviews.printSchema()
root
 |-- listing_id: string (nullable = true)
 |-- id: string (nullable = true)
 |-- date: string (nullable = true)
 |-- reviewer_id: string (nullable = true)
 |-- reviewer_name: string (nullable = true)
 |-- comments: string (nullable = true)
# join reviews with beijing's listings data
beijing_popularity = listings_sub.filter(f.col('city') == 'Beijing')\
                                 .join(reviews, ['listing_id'], 'inner')\
                                 .withColumn('date', f.col('date').cast(DateType()))

Compute and visualize the popularity (i.e., number of reviews) of your city over time.

# groupby day
beijing_popularity_by_city = beijing_popularity.groupBy('date')\
                                               .agg(f.count('id').alias('review_counts'))\
                                               .orderBy('date')\
                                               .toPandas()
beijing_popularity_by_city.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
date review_counts
0 2010-08-25 1
1 2010-10-13 1
2 2011-06-02 1
3 2011-06-05 1
4 2011-08-02 1
# visulize the number of review_counts by time
dates = np.array(beijing_popularity_by_city['date'], dtype=np.datetime64)
source = ColumnDataSource(data=dict(date=dates, close=beijing_popularity_by_city['review_counts']))

p = figure(plot_height=300, plot_width=800, tools="xpan", toolbar_location=None,
           x_axis_type="datetime", x_axis_location="above",
           background_fill_color="#efefef", x_range=(dates[2000], dates[2284]))

p.line('date', 'close', source=source, legend_label='Beijing popularity over time')
p.yaxis.axis_label = 'Number of reviews'

select = figure(title="Drag the middle and edges of the selection box to change the range above",
                plot_height=130, plot_width=800, y_range=p.y_range,
                x_axis_type="datetime", y_axis_type=None,
                tools="", toolbar_location=None, background_fill_color="#efefef")

range_tool = RangeTool(x_range=p.x_range)
range_tool.overlay.fill_color = "navy"
range_tool.overlay.fill_alpha = 0.2

select.line('date', 'close', source=source)
select.ygrid.grid_line_color = None
select.add_tools(range_tool)
select.toolbar.active_multi = range_tool

output_notebook()
show(column(p, select))
Loading BokehJS ...

Compute and visualize the popularity of neighbourhoods over time.

If there are many neighbourhoods in your city, you should select a few interesting ones for comparison.

We select Chaoyang, Haidian, Fengtai, Dongchen based on the number of listings

beijing_popularity_by_neigh = beijing_popularity.filter((f.col('neighbourhood')=='Chaoyang') | (f.col('neighbourhood')=='Haidian') | (f.col('neighbourhood')=='Fengtai') | (f.col('neighbourhood')=='Dongcheng'))\
                                                .groupBy('neighbourhood', 'date')\
                                                .agg(f.count('id').alias('review_counts'))\
                                                .orderBy('date')\
                                                .toPandas()
beijing_popularity_by_neigh.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
neighbourhood date review_counts
0 Chaoyang 2010-08-25 1
1 Chaoyang 2010-10-13 1
2 Dongcheng 2011-06-02 1
3 Dongcheng 2011-06-05 1
4 Dongcheng 2011-08-02 1
# visualize the popularity of neighbourhoods over time
source1 = beijing_popularity_by_neigh[beijing_popularity_by_neigh['neighbourhood']=='Chaoyang']
source2 = beijing_popularity_by_neigh[beijing_popularity_by_neigh['neighbourhood']=='Haidian']
source3 = beijing_popularity_by_neigh[beijing_popularity_by_neigh['neighbourhood']=='Fengtai']
source4 = beijing_popularity_by_neigh[beijing_popularity_by_neigh['neighbourhood']=='Dongcheng']

p = figure(plot_height=300, plot_width=800, tools="xpan", toolbar_location=None,
           x_axis_type="datetime", x_axis_location="above",
           background_fill_color="#efefef", x_range=(dates[2000], dates[2284]))

p.line(source1['date'], source1['review_counts'], legend_label='Chaoyang popularity over time', line_color='blue')
p.line(source2['date'], source2['review_counts'], legend_label='Haidian popularity over time', line_color='orange')
p.line(source3['date'], source3['review_counts'], legend_label='Fengtai popularity over time', line_color='red')
p.line(source4['date'], source4['review_counts'], legend_label='Dongcheng popularity over time', line_color='green')

p.yaxis.axis_label = 'Number of reviews'

select = figure(title="Drag the middle and edges of the selection box to change the range above",
                plot_height=130, plot_width=800, y_range=p.y_range,
                x_axis_type="datetime", y_axis_type=None,
                tools="", toolbar_location=None, background_fill_color="#efefef")

range_tool = RangeTool(x_range=p.x_range)
range_tool.overlay.fill_color = "navy"
range_tool.overlay.fill_alpha = 0.2

select.line(source1['date'], source1['review_counts'])
select.line(source2['date'], source2['review_counts'])
select.line(source3['date'], source3['review_counts'])
select.line(source4['date'], source4['review_counts'])

select.ygrid.grid_line_color = None
select.add_tools(range_tool)
select.toolbar.active_multi = range_tool

output_notebook()
show(column(p, select))
Loading BokehJS ...

Compute and visualize the popularity of your city by season.

For example, visualize the popularity of your city per month.

# manipulate data into the format "MM-01-yyyy"
beijing_popularity_by_city_seasonal = beijing_popularity.withColumn('month', f.date_format(f.col('date'), 'MM-01-yyyy'))\
                                                        .groupBy('month')\
                                                        .agg(f.count('id').alias('review_counts'))\
                                                        .orderBy(f.col('month'))\
                                                        .toPandas()
beijing_popularity_by_city_seasonal.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
month review_counts
0 01-01-2012 3
1 01-01-2013 7
2 01-01-2014 23
3 01-01-2015 42
4 01-01-2016 311
# Turn month column into datetime to fix the formatting issues and make processing easier
beijing_popularity_by_city_seasonal.month = pd.to_datetime(beijing_popularity_by_city_seasonal.month)
beijing_popularity_by_city_seasonal['Month_Name'] = beijing_popularity_by_city_seasonal.month.apply(lambda row: row.month_name())
beijing_popularity_by_city_seasonal['Year'] = beijing_popularity_by_city_seasonal.month.apply(lambda row: str(row.year))
beijing_popularity_by_city_seasonal = beijing_popularity_by_city_seasonal.set_index(['Month_Name', 'Year'])['review_counts'].unstack().loc[:, ['2012', '2013', '2014', '2015', '2016', '2017', '2018']].astype(int).reset_index()
beijing_popularity_by_city_seasonal
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Year Month_Name 2012 2013 2014 2015 2016 2017 2018
0 April 6 23 33 112 491 2002 4897
1 August 15 24 47 248 1268 3782 8603
2 December 10 23 70 258 1338 3000 7774
3 February 4 7 22 38 248 1180 2874
4 January 3 7 23 42 311 1311 3625
5 July 11 22 46 208 900 3090 7126
6 June 15 29 36 181 682 2397 5819
7 March 11 25 26 64 324 1452 1495
8 May 12 26 36 118 658 2391 5512
9 November 12 29 47 218 892 2521 6616
10 October 12 36 62 314 1281 2618 8419
11 September 9 32 51 251 1169 3268 7321
#visualize the popularity of your city by season
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
years = ['2012', '2013', '2014', '2015', '2016', '2017', '2018']
colors = sns.color_palette().as_hex()[2:len(years)+2]

p = figure(x_range = months, plot_height=250, plot_width = 1000, title="Fruit Counts by Year",
           toolbar_location=None, tools="")

p.vbar_stack(years, x='Month_Name', width=0.9, color=colors, source=beijing_popularity_by_city_seasonal, legend_label=years)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"

output_notebook()
show(column(p))
Loading BokehJS ...

As the Airbnd is not so popular in Beijing before 2016, it is popularity is not shown obviously in above figure.

Reviews

In this part you should determine which words used in reviews that are the most positive.

The individual reviews do not have a rating of the listing, so we will assume that each review gave the average rating to the listing, i.e., the one on the listing.

You should assign a positivity weight to each word seen in reviews and list the words with the highest weight. It is up to you to decide what the weight should be. For example, it can be a function of the rating on the listing on which it occurs, the number of reviews it occurs in, and the number of unique listings for which it was used to review.

Depending on your choice of weight function, you may also want to do some filtering of words. For example, remove words that only occur in a few reviews.

We solve this question in steps:

  1. join the reviews into the listings data based on the listing's id.
  2. split the review comments into words and explode it into rows. Assume the word score euqals to the review score.
  3. groupy by distinct word and calculate the mean of word score as well as the word frequecy.
  4. calculate the weighted score of each word based on the formula below.

$$ weighted_score = average_score * \sqrt{word_frequency} $$ 5. show 5 words with the highest weighted score. 6. show 5 words with the highest average score but drop the word whose frequency are less than 200 million. 7. compare the words list.

# data clean
rated_reviews = listings_sub.join(reviews, ['listing_id'], 'inner')\
                            .withColumn('date', f.col('date').cast(DateType()))\
                            .withColumn('review_scores_rating', (f.col('review_scores_rating')).cast(IntegerType()))\
                            .filter(f.col('review_scores_rating').isNotNull())\
                            .orderBy(f.desc('review_scores_rating'))
# calculate weighted score, average score and frequency
weighted_rated_reviews = rated_reviews.withColumn('words', explode(split(f.col('comments'), '\W+').alias('words')))\
                                      .select('words', 'review_scores_rating')\
                                      .groupBy('words')\
                                      .agg(f.mean('review_scores_rating').alias('avg_scores'), f.count('review_scores_rating').alias('frequency'))\
                                      .withColumn('weighted_scores', f.col('avg_scores')*f.col('frequency')**(1/2))\
                                      .orderBy(f.desc('weighted_scores'))\
weighted_rated_reviews.show(100)
+-----------+-----------------+---------+------------------+
|      words|       avg_scores|frequency|   weighted_scores|
+-----------+-----------------+---------+------------------+
|        and|95.38446260851613| 56651028| 717929.0631763524|
|        the|94.78737839063318| 45500093| 639376.4765151071|
|          a| 94.9671802056149| 37460320| 581245.0728823153|
|         to| 95.1163928305043| 35362264| 565620.8406016177|
|           |94.81618909027294| 29747527| 517139.7600633213|
|        was|94.98309220656003| 26994652|493497.74286926084|
|         is|95.02753095496094| 24109625| 466599.9340859086|
|         in|94.91229512144986| 22275819|  447960.082154477|
|        The|94.94596874803665| 15598380|374987.05716002936|
|       very| 95.1088069819577| 15197113|370767.17711119284|
|         of|95.12319003040311| 14630080| 363839.4250530417|
|        for|94.99795637736182| 14382303| 360270.3153953516|
|          s| 94.9018364609585| 13925262| 354141.0773476358|
|          I|95.12058574757597| 12203482| 332289.3318177492|
|      place|95.24744166661456| 11996775| 329902.4769323235|
|       with|95.29559016574098| 11143412|318113.35198231135|
|       stay| 95.4442867301635| 11069948| 317557.7580453187|
|      great| 95.4029896860036| 10444642|308325.01688483945|
|         we|95.09696650190183| 10011643|300898.02769293496|
|        you|95.09179293254158|  9094698|286772.28525679064|
|  apartment|94.82646076013864|  8509816| 276623.8531842354|
|         it|94.52071476886938|  8400142| 273949.3731020657|
|   location|94.77713941782233|  8275721|272650.63721528917|
|         We|95.53479996573648|  7623269| 263774.1432055377|
|      clean|95.30230785323009|  6873834| 249863.5681926265|
|         us|95.35846774238718|  6248897|238375.13234979034|
|        our|95.47740065772818|  6185534| 237459.3016153283|
|         de|93.74897452229801|  6350455| 236248.4388482613|
|         at|95.18798446582971|  6141300|   235891.49758361|
|        had|95.07513312131093|  5982701| 232549.5970911644|
|       were|95.09261953650125|  5946510|231887.79340753102|
|       host| 95.1651470095354|  5710718|227417.18541364872|
|         as|94.85934620833214|  5478473| 222029.0955895442|
|       nice|94.80211029246783|  5352623|219331.66364097325|
|       from|95.05265281867527|  5186408|216469.93229633317|
|         on|94.78545617625683|  5083945| 213718.5044028016|
|         It|95.10986162385046|  4923175| 211031.9376673511|
|        but|93.90564827589859|  4909746|208075.63451729692|
|       this|94.93024145947574|  4778483|207515.05422092127|
|  recommend|95.64356330749465|  4672322|206738.86633391897|
|       that|94.44120294469836|  4782697|206537.03786968024|
|         so|95.17539793221641|  4542545| 202849.6769625939|
|        are|95.07367498491193|  4488647|201427.14993116292|
|        all|95.13279709109176|  4427228|200168.71919574295|
|      would| 95.2274827012498|  4200737|195175.38742457228|
|          t|94.24757158348358|  4192238|192971.48278786556|
|comfortable|95.72393772561357|  3966960|  190655.556193693|
| everything|95.56136087136093|  3832809| 187085.8384615892|
|      there| 94.5004643538194|  3896813|186547.19587704944|
|      again|95.76117241289091|  3745677|185333.79413434947|
|      Great|94.96335454616879|  3742074|183701.30150652467|
|         an|95.14065317803181|  3720762|183519.43953982703|
|     really|95.07656651409276|  3717761| 183321.8466525952|
|       well|95.35040109951755|  3636504| 181829.5898453963|
|         la|93.43163484651143|  3777480|181591.29882558843|
|       good|94.08479450231043|  3657973| 179944.9605218761|
|         et|94.12369670495197|  3646047|179725.66826057073|
|    perfect| 95.7042850242088|  3475803| 178426.3371079245|
|       room|94.34828743789681|  3540076|177517.14901347584|
|          y|93.98835911243717|  3555399|177222.24776193604|
|       have|95.01385633593384|  3423055|175789.88572218787|
|          e|94.20108522118414|  3448145|174923.70449616283|
|      house|95.71923325060698|  3338718|174899.70636633915|
|      close| 94.9004603649237|  3396436| 174896.0644278762|
|         my|95.19648545993361|  3203321|170380.99317156064|
|       time| 95.2376125554801|  3173435| 169657.5938950928|
|         be|94.70024204006448|  3198644|169369.04647434072|
|        not|93.43461413772452|  3232656|167991.59070206375|
|       home|96.47683896861794|  3004551|167229.48627856103|
|       easy|95.17947182275599|  3050529|166238.21595214322|
| definitely|95.83080206157007|  2772256|159559.17754859792|
|        est| 93.8650850734403|  2877984|159238.56883673763|
|       walk|95.03518309200224|  2800095|159026.97532524943|
|        out| 94.8822404317823|  2744346|157182.56212669692|
|       here|95.69576425772148|  2581531|153755.75759276975|
|    helpful|95.29366891599118|  2600929|153683.87297580077|
|         en|93.90334225927072|  2647341|152786.85559092654|
|       This|  95.239948063156|  2563113|152476.53936398905|
|       just|95.02492837790035|  2572530| 152411.5132089172|
|         un| 93.6565633733043|  2632244|151950.20487581272|
|       also|95.19813965220956|  2537160|151636.02437145615|
|         tr|94.13388111059297|  2581828|151254.95756859318|
|      which|94.70748894043716|  2547343|151156.92042574752|
|restaurants|95.41023770225424|  2498588| 150814.2236322502|
|         if|94.69323012018941|  2483663|149233.13585447246|
|       area|95.23842834502614|  2349167|145971.86797675598|
|       city|95.02954644584405|  2340823|145392.81477611206|
|      super| 95.6033348691873|  2307077|145212.52814732556|
|       Very|95.02841004155232|  2319004|144711.88873668524|
|        has|95.29689232907549|  2303751|144642.69556136854|
|          n| 93.5364786319922|  2389577| 144591.0896892946|
|      space| 95.6577555584908|  2256098|143680.94210296762|
|         by|94.77250781735309|  2289138| 143389.8334229332|
|     lovely|95.96121838963714|  2225101|143143.16576973384|
|         me|94.63210411802447|  2263124|142361.53816218747|
|      quiet|95.62769643965468|  2215375|142333.56248263494|
|        can|94.93997959458645|  2204317|140956.84128817485|
|  beautiful|96.39824455807377|  2133252|140795.96412913894|
|      check|94.41381449813223|  2098491|136769.44978582306|
|       back| 95.8136207053185|  2036385|136727.91674854665|
+-----------+-----------------+---------+------------------+
only showing top 100 rows

In the above table, we can find some words has very weighted score. but they do not show positive or negative emotion in real life. So we will ignore them and concentrate on the adjactives. Then, we list the most 5 postive adjactives as below.

  1. great
  2. clean
  3. nice
  4. recommend
  5. comfortable
weighted_rated_reviews.orderBy(f.desc('avg_scores')).filter(f.col('frequency') > 2000000).show(100)
+-----------+-----------------+---------+------------------+
|      words|       avg_scores|frequency|   weighted_scores|
+-----------+-----------------+---------+------------------+
|       home|96.47683896861794|  3004551|167229.48627856103|
|  beautiful|96.39824455807377|  2133252|140795.96412913894|
|     lovely|95.96121838963714|  2225101|143143.16576973384|
| definitely|95.83080206157007|  2772256|159559.17754859792|
|       back| 95.8136207053185|  2036385|136727.91674854665|
|      again|95.76117241289091|  3745677|185333.79413434947|
|comfortable|95.72393772561357|  3966960|  190655.556193693|
|      house|95.71923325060698|  3338718|174899.70636633915|
|    perfect| 95.7042850242088|  3475803| 178426.3371079245|
|       here|95.69576425772148|  2581531|153755.75759276975|
|      space| 95.6577555584908|  2256098|143680.94210296762|
|  recommend|95.64356330749465|  4672322|206738.86633391897|
|      quiet|95.62769643965468|  2215375|142333.56248263494|
|      super| 95.6033348691873|  2307077|145212.52814732556|
| everything|95.56136087136093|  3832809| 187085.8384615892|
|         We|95.53479996573648|  7623269| 263774.1432055377|
|        our|95.47740065772818|  6185534| 237459.3016153283|
|       stay| 95.4442867301635| 11069948| 317557.7580453187|
|restaurants|95.41023770225424|  2498588| 150814.2236322502|
|      great| 95.4029896860036| 10444642|308325.01688483945|
|        and|95.38446260851613| 56651028| 717929.0631763524|
|         us|95.35846774238718|  6248897|238375.13234979034|
|       well|95.35040109951755|  3636504| 181829.5898453963|
|      clean|95.30230785323009|  6873834| 249863.5681926265|
|        has|95.29689232907549|  2303751|144642.69556136854|
|       with|95.29559016574098| 11143412|318113.35198231135|
|    helpful|95.29366891599118|  2600929|153683.87297580077|
|      place|95.24744166661456| 11996775| 329902.4769323235|
|       This|  95.239948063156|  2563113|152476.53936398905|
|       area|95.23842834502614|  2349167|145971.86797675598|
|       time| 95.2376125554801|  3173435| 169657.5938950928|
|      would| 95.2274827012498|  4200737|195175.38742457228|
|       also|95.19813965220956|  2537160|151636.02437145615|
|         my|95.19648545993361|  3203321|170380.99317156064|
|         at|95.18798446582971|  6141300|   235891.49758361|
|       easy|95.17947182275599|  3050529|166238.21595214322|
|         so|95.17539793221641|  4542545| 202849.6769625939|
|       host| 95.1651470095354|  5710718|227417.18541364872|
|         an|95.14065317803181|  3720762|183519.43953982703|
|        all|95.13279709109176|  4427228|200168.71919574295|
|         of|95.12319003040311| 14630080| 363839.4250530417|
|          I|95.12058574757597| 12203482| 332289.3318177492|
|         to| 95.1163928305043| 35362264| 565620.8406016177|
|         It|95.10986162385046|  4923175| 211031.9376673511|
|       very| 95.1088069819577| 15197113|370767.17711119284|
|         we|95.09696650190183| 10011643|300898.02769293496|
|       were|95.09261953650125|  5946510|231887.79340753102|
|        you|95.09179293254158|  9094698|286772.28525679064|
|     really|95.07656651409276|  3717761| 183321.8466525952|
|        had|95.07513312131093|  5982701| 232549.5970911644|
|        are|95.07367498491193|  4488647|201427.14993116292|
|       from|95.05265281867527|  5186408|216469.93229633317|
|       walk|95.03518309200224|  2800095|159026.97532524943|
|       city|95.02954644584405|  2340823|145392.81477611206|
|       Very|95.02841004155232|  2319004|144711.88873668524|
|         is|95.02753095496094| 24109625| 466599.9340859086|
|       just|95.02492837790035|  2572530| 152411.5132089172|
|       have|95.01385633593384|  3423055|175789.88572218787|
|        for|94.99795637736182| 14382303| 360270.3153953516|
|        was|94.98309220656003| 26994652|493497.74286926084|
|          a| 94.9671802056149| 37460320| 581245.0728823153|
|      Great|94.96335454616879|  3742074|183701.30150652467|
|        The|94.94596874803665| 15598380|374987.05716002936|
|        can|94.93997959458645|  2204317|140956.84128817485|
|       this|94.93024145947574|  4778483|207515.05422092127|
|         in|94.91229512144986| 22275819|  447960.082154477|
|          s| 94.9018364609585| 13925262| 354141.0773476358|
|      close| 94.9004603649237|  3396436| 174896.0644278762|
|        out| 94.8822404317823|  2744346|157182.56212669692|
|         as|94.85934620833214|  5478473| 222029.0955895442|
|  apartment|94.82646076013864|  8509816| 276623.8531842354|
|           |94.81618909027294| 29747527| 517139.7600633213|
|       nice|94.80211029246783|  5352623|219331.66364097325|
|        the|94.78737839063318| 45500093| 639376.4765151071|
|         on|94.78545617625683|  5083945| 213718.5044028016|
|   location|94.77713941782233|  8275721|272650.63721528917|
|         by|94.77250781735309|  2289138| 143389.8334229332|
|      which|94.70748894043716|  2547343|151156.92042574752|
|        get|94.70643457843138|  2081737|136644.58236776834|
|         be|94.70024204006448|  3198644|169369.04647434072|
|         if|94.69323012018941|  2483663|149233.13585447246|
|         me|94.63210411802447|  2263124|142361.53816218747|
|         it|94.52071476886938|  8400142| 273949.3731020657|
|      there| 94.5004643538194|  3896813|186547.19587704944|
|       that|94.44120294469836|  4782697|206537.03786968024|
|      check|94.41381449813223|  2098491|136769.44978582306|
|       room|94.34828743789681|  3540076|177517.14901347584|
|          t|94.24757158348358|  4192238|192971.48278786556|
|          e|94.20108522118414|  3448145|174923.70449616283|
|          d| 94.1569644563994|  2026497| 134037.2260400446|
|         tr|94.13388111059297|  2581828|151254.95756859318|
|         et|94.12369670495197|  3646047|179725.66826057073|
|       good|94.08479450231043|  3657973| 179944.9605218761|
|        muy|94.05540382498624|  2035040|134174.57471043413|
|          y|93.98835911243717|  3555399|177222.24776193604|
|        but|93.90564827589859|  4909746|208075.63451729692|
|         en|93.90334225927072|  2647341|152786.85559092654|
|        est| 93.8650850734403|  2877984|159238.56883673763|
|         de|93.74897452229801|  6350455| 236248.4388482613|
|         un| 93.6565633733043|  2632244|151950.20487581272|
+-----------+-----------------+---------+------------------+
only showing top 100 rows

Then, we sort the average score and drop the words whose frequencies are less than 2 million, the most positive adjactives could be as below:

  1. beautiful
  2. lovely
  3. comfortable
  4. perfect
  5. recommend