The overall goal of the project is to analyze and mine a Yelp review text data set to discover useful knowledge and solve a real-world data mining problem. I will work on a restaurant review data set provided by Yelp.com. The project builds on knowledge of various interesting data science and computer science topics, such as pattern discovery, clustering, text retrieval, text mining, and data visualization - all of which are necessary to effectively mine the data to discover interesting insights and meet the requirements described below.
Explore and visualize the review content to understand what people have said in those reviews.
It can be shown that this requirement is met by producing a visualization of a topic model. A topic model can be used to help convey what reviewers have written about in these reviews. A topic model can also be used to help convey the similarities between topics
Mine the data set to understand the landscape of different types of cuisines and their similarities.
It can be shown that this requirement is met by constructing a cuisine map. A cuisine map can help convey the landscape of different types of cuisines and their similarities, including what cuisines are available and their relations. A visualization incorporating clustering can help convey major categories of cuisines.
Mine the data set to rank restaurants for a specific dish.
It can be shown that this requirement is met by ranking dish names of a particular cuisine based on reviews that mention the dish names, e.g., by ranking the popular dishes with positive comments on the top. I can also rank restaurants offering particular dishes based on reviews about the dish so that the restaurant reviews with many positive comments about the dish are highly ranked. This requirement can be accomplished by creating a visualization showing the ranking of the recommended restaurants.
Use machine learning on hygiene data to predict the hygiene condition of a restaurant.
It can be shown that this requirement is met by training a classifier over the data to make binary predictions.