Introduction

So you want to be a Data Scientist and don’t know where to start? Well, you’ve come to the right place.

Today you’ll learn a little about the economy, a whole bunch of data science principles and practice and where to stay on your next trip to New York City!

New York

Data Science Pipeline

First, let’s look at the data science pipeline.

The data science pipeline starts with defining what major questions one wants to answer and subsequently acquiring and importing the relevant data to be analyzed.

Then, the data is viewed and data tidying must occur; where a rectangular data structure model is assumed and three requirements must be met. Each observation (called an entity) forms a row, each variable (called an attribute) forms a column and each observational unit (type of entity) forms a table.

Leading to the exploratory data analysis process, where the data is transformed and visualized. Data cleaning may be necessary for missing data. When handling missing data, the missing data may be removed, encoded or imputation (replace missing values with the mean of non-missing values) of a numeric variable may be necessary.

Hypothesis testing and machine learning (ML) modeling are the final steps before the data and its results can be communicated.

Image of Pipeline

Source: https://r4ds.had.co.nz/explore-intro.html

Economy & Vacations

The rise of companies like Airbnb has given rise to The Sharing Economy. The sharing economy is a model defined as the facilitation of goods and services on a peer-to-peer level usually through online community platforms. This new model has made it possible for a great deal of people to gain another source of income and for you to have an affordable vacation.

As more sharing economy companies have opened, like Airbnb and Uber, the way we vacation has changed. This change has been documented and open data on it is available.

DataSet Used

The data we will be using in this tutorial is New York City Airbnb Open Data from Kaggle. We will use this data to look at the relationships between types of housing and location.

Preparing Data

Download the dataset.

In this section, we will learn how to load in our dataset, view the data in our dataset, and clean it up so it’s easy for us to work with.

First, let’s load in the following libraries so we can use certain functions:

# for data wranging
library(tidyverse)
library(dplyr)

# for data analysis
library(geosphere)
library(ggplot2)
library(broom)

Loading Data

CSV files are files that include data which are “comma-separated values”, meaning that data values are literally separated by commas.

After we’ve downloaded our CSV file from Kaggle into our working directory, we can use the read_csv function to load the CSV file data into our program’s data frame, which is a table of the data.

# create a dataframe from our CSV file
airbnb_tab <- read.csv("AB_NYC_2019.csv", header=TRUE)

There are some attributes that we don’t need for our purposes, like host_id, host_name, minimum_nights, number_of_reviews, last_review, reviews_per_month, and calculated_host_listings_count. So, let’s remove these from our data frame:

# a vector called "to_remove" that has the names of the attributes we don't want
to_remove <- c('host_id', 
              'host_name', 
               'minimum_nights', 
               'number_of_reviews', 
               'last_review', 
               'reviews_per_month', 
               'calculated_host_listings_count')

# removing attributes from data frame using "to_remove"
airbnb_tab = airbnb_tab[ , !(names(airbnb_tab) %in% to_remove)]

Viewing Data

Here, we see the first 10 rows in our dataset:

knitr::kable(head(airbnb_tab, n=10))

id	name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	availability_365
2539	Clean & quiet apt home by the park	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	365
2595	Skylit Midtown Castle	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	355
3647	THE VILLAGE OF HARLEM….NEW YORK !	Manhattan	Harlem	40.80902	-73.94190	Private room	150	365
3831	Cozy Entire Floor of Brownstone	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	194
5022	Entire Apt: Spacious Studio/Loft by central park	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	0
5099	Large Cozy 1 BR Apartment In Midtown East	Manhattan	Murray Hill	40.74767	-73.97500	Entire home/apt	200	129
5121	BlissArtsSpace!	Brooklyn	Bedford-Stuyvesant	40.68688	-73.95596	Private room	60	0
5178	Large Furnished Room Near B’way	Manhattan	Hell’s Kitchen	40.76489	-73.98493	Private room	79	220
5203	Cozy Clean Guest Room - Family Apt	Manhattan	Upper West Side	40.80178	-73.96723	Private room	79	0
5238	Cute & Cozy Lower East Side 1 bdrm	Manhattan	Chinatown	40.71344	-73.99037	Entire home/apt	150	188

Some Notes:

knitr::kable() is used to make the table “pretty” and easier to read
head(df, n=10) is used to view the dataframe with a specific number of rows (head() is not always necessary, you can just list the data frame for it to render)
the first argument is where the dataframe goes, in this case, airbnb_tab
n = determines the number of rows visible, in this case, 10

The following is a list of descriptions for the attributes of our data set:

Attribute	Description/Unit
`id`	Unique ID for each Airbnb listing
`name`	Name or description of the Airbnb listing
`neighbourhood_group`	Boroughs of New York (Manhattan, Brooklyn, Queens, Bronx, Staten Island)
`neighbourhood`	Neighborhoods of New York
`latitude`	Degrees of latitude, measures distance North and South from Equator
`longitude`	Degrees of longitude, measure distance East and West of Prime Meridian
`room_type`	Type of space offered (Entire home/apt, Private room, Shared room)
`price`	Price of listing, in US Dollars
`availability_365`	Number of days in a year when the listing is available for booking

Tidying Data

Tidying Data entails the elements listed in the list below.

Elements of a tidy dataset:

Each observation/entity forms a row
Each variable/attribute forms a column
Each observational unit (type of entity) forms a column (i.e. not dependent on one another)

Our dataset is already tidy and meets the criteria above. Each entity is a row and each attribute is a column, where no entity is dependent on another.

However, if your data set is untidy, below is an example on a different small dataset, to show you what to do.

Sample Tidying

Let’s tidy up this small messy-airlines.csv dataset containing data about the arrival status of two airlines across five destinations.

airline_schedule <- read.csv("messy-airlines.csv", header=TRUE) # read downloaded CSV file into a dataframe (array of data)
knitr::kable(airline_schedule)

X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
Alaska	on time	497	221	212	503	1841
Alaska	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Let’s rename the first 2 columns from “X” to “airline” and “X.1” to “status” in order to clarify what the values are representing:

airline_schedule = rename(airline_schedule, c("airline"="X", "status"="X.1"))
knitr::kable(airline_schedule)

airline	status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
Alaska	on time	497	221	212	503	1841
Alaska	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Now, notice that there are data values are the headers for the rest of the columns, e.g. Los Angeles, Phoenix, San Diego, San Francisco, and Seattle! We have to make sure that column headers are only variable names that describe the values, otherwise, the dataset is considered “untidy”.

Create a new column called “destination” to describe where each airline is heading. We use the gather dplyr function, which takes a set of column names and places them into a single key column, “destination”, and collects the cells of those columns into a single value column, “count”.
Make values of “status” attribute as their own attributes. We do this in order to decrease the number of entries containing the same airline and destination information. We use the spread dplyr function, which does the inverse of gather by spreading columns, “status” and “count”, into separate columns.
Get rid of the full stop (.) in destination names for readability. We use the mutate function to modify the destination column. We are reassigning all the values in the “destination” column to values without the full stop character in city names, ie. “Los.Angeles” to “Los Angeles”. We use the gsub function for pattern matching with regular expressions (regex) and replaces all occurrences of the full stop character with a space character.
Rearrange the entities to sort by destination first, and then airlines. The arrange function helps us with this.

tidy_data <- gather(airline_schedule, "destination", "count", 3:7) %>% # Step 1
  spread(status, count) %>% # Step 2
  mutate(destination=gsub("\\.", " ", destination)) %>% # Step 3
  arrange(destination, airline) # Step 4

knitr::kable(tidy_data)

airline	destination	delayed	on time
Alaska	Los Angeles	62	497
AM WEST	Los Angeles	117	694
Alaska	Phoenix	12	221
AM WEST	Phoenix	415	4840
Alaska	San Diego	20	212
AM WEST	San Diego	65	383
Alaska	San Francisco	102	503
AM WEST	San Francisco	129	320
Alaska	Seattle	305	1841
AM WEST	Seattle	61	201

Now this data is tidy!

Remember to avoid these common problems that is found with messy data:

column headers are values, not variable names
multiple variables stored in one column
variables stored in both rows and columns
multiple types of observational units are stored in the same table
single observational unit stored in multiple tables

You can read more about how to fix these problems at CMSC 320 Tidying Data Lecture Notes by Professor Hector Corrada Bravo.

Exploratory Data Analysis

In this section, we begin exploring what our data can tell us using visualizations. This will help us to better understand our data and help us make decisions about how we may want to further manipulate the data to see something specific, or decide which methods are best for modelling and Machine Learning!

The main reason for exploratory data analysis, or EDA, is to help us find any problems in our data preparation and gain a sense of variable properties, such as central trends (mean), spread (variance), skew, outliers, and relationships between pairs of variables, like their correlation or covariance.

You can read more about EDA at CMSC 320 EDA Lecture Notes by Professor Hector Corrado Bravo.

Handling Missing Data

Recall that the attribute availability_365 tells us how many days in the year that this particular listing is available for people to book.

Notice that 0 is a value for some of the entities (Airbnb listings). It doesn’t make much sense for us to look at entities that aren’t available at all during the year. In fact, more than 17000 entities are listed at being available for 0 days out of the year! That’s about 1/3 of our dataset.

We’ll call this “missing data”, and remove these entities from our dataset:

airbnb_tab <- airbnb_tab %>%
  filter(availability_365 > 0) # filter() is used to filter the dataframe via specific conditions

knitr::kable(head(airbnb_tab, n=10))

id	name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	availability_365
2539	Clean & quiet apt home by the park	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	365
2595	Skylit Midtown Castle	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	355
3647	THE VILLAGE OF HARLEM….NEW YORK !	Manhattan	Harlem	40.80902	-73.94190	Private room	150	365
3831	Cozy Entire Floor of Brownstone	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	194
5099	Large Cozy 1 BR Apartment In Midtown East	Manhattan	Murray Hill	40.74767	-73.97500	Entire home/apt	200	129
5178	Large Furnished Room Near B’way	Manhattan	Hell’s Kitchen	40.76489	-73.98493	Private room	79	220
5238	Cute & Cozy Lower East Side 1 bdrm	Manhattan	Chinatown	40.71344	-73.99037	Entire home/apt	150	188
5295	Beautiful 1br on Upper West Side	Manhattan	Upper West Side	40.80316	-73.96545	Entire home/apt	135	6
5441	Central Manhattan/near Broadway	Manhattan	Hell’s Kitchen	40.76076	-73.98867	Private room	85	39
5803	Lovely Room 1, Garden, Best Area, Legal rental	Brooklyn	South Slope	40.66829	-73.98779	Private room	89	314

Note that a way to handle missing data, as mentioned in the data science pipline section (data cleaning), is removing missing data altogether. Having 0 as a value for availability_365 is a form of missing data.

Data Visualizations

What are data visualizations? Data visualizations are representations of data and/or information in visual format (like a graph or chart). These visualizations allow for patterns and larger data to be presented and communicated straightforwardly. Data is used to gain insight and is valuable so the way it’s presented is important. Humans are visual, it’s how the brain works!

There are many different forms of data visualization, each of which have their own advantages.

The types of data to be analyzed (categorical and/or numeric) is an important factor in deciding what data visualization is applicable and pertinent.

Want to know more about picking the best data visualization for your message, click here.

Interactive Map

Another layer of data visualizations is the addition of interactions. Interactive data visulizations add more functionality and allow the users to learn more about the massive amounts of data presented and the data its relationship relative to itself.

There are many types of interactive data visualizations and each type has there own benefits. Here are some examples of captivating visualizations and how they are effective.

Let’s first create the map using the leaflet package. The map will be centered in NYC.

# Download necessary library to integrate and control maps
library(leaflet)

# Creating NYC Map
nyc_map <- leaflet(airbnb_tab) %>%
  addTiles() %>%
  setView(
    lat=40.730610, #set latitude of NYC
    lng=-73.935242, #set longitude of NYC
    zoom=11)

nyc_map #outputs the map

Now, let’s add our data via location icons. The latitude and longitude coordinate for each listing will be used for icon placement on the map.

leaflet(airbnb_tab) %>% #pass in 
  addTiles() %>%
    addAwesomeMarkers(
      #pass in given longitude of entity
      lng = ~longitude,
      
      #pass in given latitude of entity
      lat = ~latitude,
      
      # Setting up icon for entity on map
      icon = awesomeIcons(
              icon = 'ios-close',
              iconColor = 'black',
              library = 'ion',
              # Determines color of the icon on map with nested if else statements
              markerColor = ~ifelse(room_type == 'Entire home/apt', "green", 
                                    ifelse(room_type =='Private room', "orange", 
                                           "red"
                                    )
                            )
            ),
      # Price Label
      label = ~paste("$", as.character(price), "per night"),
    
      # Clustering for identifying arrest density
      clusterOptions = markerClusterOptions()
    ) %>%
  addLegend(
    position = 'bottomright', 
    # Color keys correspond to values, respectively
    colors= c("green", "orange", "red"), labels=c("Entire Home/Apt", "Private Room", "Shared Room"), 
    title='Types of Rentals', 
  )

The icon color is dependent on the room_type attribute that has three categories; enitre home or appt. is green, private rooms are orange and shared rooms are red. Another functionality is that the icons have labels when they are hovered over, the labels contain the price of the listing as to contribute to the ease of viewing and comparing listings.
The map contains a legend so that the user knows how to interpret the map’s colors.
A clustering function was added and can be omitted, it was just added for density analysis based on coordinates of listing on the map.

This map could help you plan your next trip to NYC and save money just by staying across the street. Who knew!

Boxplots

Boxplots, although simple, are very useful to view the relationship of numeric variables relationship of numeric variables relative to each other, creating insight into the range and stats of the data. If multiple boxplots are used, we can view the relationship between the categorical and numeric attributes, as well.

Here we look at the complete range listing prices based on the room_type attribute in the dataset. We will split up the listing and look at them subsequently to see how their ranges compare to one another.

The three step process is:

Filter the data into a new dataframe based on the room type
Graph the data via a boxplot
Section off the y-axis range of the boxplot to create an effective and interpretable visulization

Purpose: We want to see the listing’s price ranges of places based on neighborhood groups, to see how they compare to one another. (We will separate the data by room type, as the price for an entire house versus a shared room on the same street will vary greatly.)

First, let’s graph the Entire home/appt. room type.

We filter the data.

# Download subsequent libraries
library(ggplot2)  #data visualization package for the statistical programming
library(ggthemes) #package for themes

# Filter out all the Enitre home/appt listings into new dataframe
airbnb_home <- airbnb_tab %>%
  filter(room_type == 'Entire home/apt')

# View new table
knitr::kable(head(airbnb_home, n=10))

id	name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	availability_365
2595	Skylit Midtown Castle	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	355
3831	Cozy Entire Floor of Brownstone	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	194
5099	Large Cozy 1 BR Apartment In Midtown East	Manhattan	Murray Hill	40.74767	-73.97500	Entire home/apt	200	129
5238	Cute & Cozy Lower East Side 1 bdrm	Manhattan	Chinatown	40.71344	-73.99037	Entire home/apt	150	188
5295	Beautiful 1br on Upper West Side	Manhattan	Upper West Side	40.80316	-73.96545	Entire home/apt	135	6
6848	Only 2 stops to Manhattan studio	Brooklyn	Williamsburg	40.70837	-73.95352	Entire home/apt	140	46
7097	Perfect for Your Parents + Garden	Brooklyn	Fort Greene	40.69169	-73.97185	Entire home/apt	215	321
7726	Hip Historic Brownstone Apartment with Backyard	Brooklyn	Crown Heights	40.67592	-73.94694	Entire home/apt	99	21
7750	Huge 2 BR Upper East Cental Park	Manhattan	East Harlem	40.79685	-73.94872	Entire home/apt	190	249
8490	MAISON DES SIRENES1,bohemian apartment	Brooklyn	Bedford-Stuyvesant	40.68371	-73.94028	Entire home/apt	120	233

Now, we create a boxplot to see the range:

airbnb_home %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+   #creating a boxplot
  coord_flip() +    #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "Entire Homes & Appts. Price By Neighborhood in 2019",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

These boxplots are ineffective as the data is hardly viewable, since the outliers are so far out of the range. So what do we do?

We now filter the range of y-axis accordingly to make the visulaization useful. While some outliers will be cut off, the heart of our data is still present, and limiting of the range does not disrupt the purpose of this visualization.

airbnb_home %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  
  geom_boxplot()+   #creating a boxplot
  
  # Limiting the y-axis to get a better view of data
  scale_y_continuous(limits = c(0, 1500)) +
  
  coord_flip() +  #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "2019 NYC Homes & Appts. Prices (Up to $1500/night)",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

Here we have it! A visualization that shows the price ranges of listing of entire homes and apartments, by neighboorhood groups.

Let’s hope you’re not getting ripped off.

Second, let’s graph the Private room type.

We filter the data, again, into a new dataframe:

# Filter out all the Private room listings into new dataframe
airbnb_room <- airbnb_tab %>%
  filter(room_type == 'Private room')

# View new table
knitr::kable(head(airbnb_room, n=10))

id	name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	availability_365
2539	Clean & quiet apt home by the park	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	365
3647	THE VILLAGE OF HARLEM….NEW YORK !	Manhattan	Harlem	40.80902	-73.94190	Private room	150	365
5178	Large Furnished Room Near B’way	Manhattan	Hell’s Kitchen	40.76489	-73.98493	Private room	79	220
5441	Central Manhattan/near Broadway	Manhattan	Hell’s Kitchen	40.76076	-73.98867	Private room	85	39
5803	Lovely Room 1, Garden, Best Area, Legal rental	Brooklyn	South Slope	40.66829	-73.98779	Private room	89	314
6021	Wonderful Guest Bedroom in Manhattan for SINGLES	Manhattan	Upper West Side	40.79826	-73.96113	Private room	85	333
7322	Chelsea Perfect	Manhattan	Chelsea	40.74192	-73.99501	Private room	140	12
8024	CBG CtyBGd HelpsHaiti rm#1:1-4	Brooklyn	Park Slope	40.68069	-73.97706	Private room	130	347
8025	CBG Helps Haiti Room#2.5	Brooklyn	Park Slope	40.67989	-73.97798	Private room	80	364
8110	CBG Helps Haiti Rm #2	Brooklyn	Park Slope	40.68001	-73.97865	Private room	110	304

Again, we create a boxplot to see the range:

airbnb_room %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+   #creating a boxplot
  coord_flip() +    #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "Private Room Price By Neighborhood in 2019",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

These boxplots are ineffective as well.

So this time, we pull in the range of y, even more and reduce it down to 500 dollars, as private rooms usually cost much less per night than entire homes.

airbnb_room %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  
  geom_boxplot()+   #creating a boxplot
  
  # Limiting the y-axis to get a better view of data
  scale_y_continuous(limits = c(0, 500)) +
  
  coord_flip() +    #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "2019 NYC Private Room Prices (Up to $500/night)",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

Now this graph is much more effective than the previous one.

We’re almost done, just one more to go.

Lastly, lets graph the Shared room type

We filter the data for the last time:

# Filter out all the Shared room listings into new dataframe
airbnb_sroom <- airbnb_tab %>%
  filter(room_type == 'Shared room')

# View new table
knitr::kable(head(airbnb_sroom, n=10))

id	name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	availability_365
12048	LowerEastSide apt share shortterm 1	Manhattan	Lower East Side	40.71401	-73.98917	Shared room	40	188
54453	MIDTOWN WEST - Large alcove studio	Manhattan	Hell’s Kitchen	40.76548	-73.98474	Shared room	105	363
173072	Cozy Pre-War Harlem Apartment	Manhattan	Harlem	40.80827	-73.95329	Shared room	49	248
391948	Single Room	Queens	Ozone Park	40.68581	-73.84642	Shared room	45	364
467634	yahmanscrashpads	Queens	Jamaica	40.67747	-73.76493	Shared room	39	353
564751	Artist space for creative nomads.	Manhattan	Upper West Side	40.80165	-73.96287	Shared room	76	324
737126	Williamsburg Loft!! Bedford L 1blk!	Brooklyn	Williamsburg	40.71714	-73.95447	Shared room	195	364
765203	Art Lover’s Abode Brooklyn	Brooklyn	Williamsburg	40.70745	-73.94307	Shared room	52	88
773497	Great spot in Brooklyn	Brooklyn	Bedford-Stuyvesant	40.69407	-73.94551	Shared room	200	365
819206	Cute shared studio apartment	Manhattan	East Harlem	40.79106	-73.95058	Shared room	45	313

Again, we create a boxplot to see the range:

airbnb_sroom %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+   #creating a boxplot
  coord_flip() +    #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "Shared Room Price By Neighborhood in 2019",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

And again the y-axis values need to be trimmed. This time we will trim it down to $200.

airbnb_sroom %>%  #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  
  geom_boxplot()+   #creating a boxplot
  
  # Limiting the y-axis to get a better view of data
  scale_y_continuous(limits = c(0, 200)) +
  
  coord_flip() +   #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "2019 NYC Shared Room Prices (Up to $200/night)",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

We’re all done now. So what have we learned?

We learned that Manhattan has the pricest listing independent of the room type of the listing. What other conclusions can we make from these visualizations?

Hypothesis Testing & Machine Learning

What is Hypothesis Testing?

Okay, so we have data. But what does it all mean? We need to first interpret our data to make assumptions about it, and test if our assumptions are valid! We refer to an assumption as a “hypothesis”. Conducting “hypothesis testing” will help us quantify the validity of our assumptions to certain questions about the data (Machine Learning Mastery).

To conduct hypothesis testing, we will first plot linear regressions over the distribution of our dataset and generate statistics about the regressions. These statistics will help us answer questions about the data!

What is Machine Learning?

We’ve all heard of it, but what is it really? Machine learning uses algorithms and statistics in order to find patterns in substantial amounts of data (meaning anything that can be digitally stored). In machine learning, a model is created using mathematical and statistical functions that are able to be modified (either manually or automatically, dependent on the type of ML) till it can make accurate predictions with new data.

Linear and logistic regressions are both machine learning algorithms, one based on supervised regression and the other is based on supervised classification, respectively.

The patterns found are usually used for recommendation systems in many of today’s technologies.

ML & Hypothesis Testing Walk Through

With datasets that are large, it can be very useful to generate a linear regression, or a line of “best fit”, for an easier interpretation of the data. This data analysis technique is also an effective way to learn about general trends of our data set and lets us construct confidence intervals and do hypothesis testing, which analyzes and tests for relationships between variables.

We want to look at the relationship between price and distance away from Times Square in New York City, one of the largest populated cities in New York. We are looking at Time Square since it is a major commercial intersection, tourist destination, entertainment center, and neighborhood in the Midtown Manhattan section of NYC (Wikipedia).

For these reasons, we would like to see if Airbnb listings would increase as their distance to Times Square (latitude 40.757, longitude -73.986) decreases, and vice versa. We will be using functions from the geosphere library to calculate distance between coordinates.

First, let’s add an attribute called distToTimesSquare in our dataset. This will contain the distance (in miles) between each listing and Times Square.

coordsTimeSquare <- c(-73.986, 40.757) # vector of Times Square coordinates, first longitude, second latitude

airbnb_tab <- airbnb_tab %>%
  mutate(distToTimesSquare = by(airbnb_tab, 1:nrow(airbnb_tab), # calculate distance using distHaversine function
                                function(row) { 
                                  distHaversine(c(row$longitude, row$latitude), coordsTimeSquare)
                                }) / 1609) # divide by 1609 to convert meters to miles

knitr::kable(head(airbnb_tab))

id	name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	availability_365	distToTimesSquare
2539	Clean & quiet apt home by the park	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	365	7.6101584
2595	Skylit Midtown Castle	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	355	0.2614253
3647	THE VILLAGE OF HARLEM….NEW YORK !	Manhattan	Harlem	40.80902	-73.94190	Private room	150	365	4.2767099
3831	Cozy Entire Floor of Brownstone	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	194	5.1585482
5099	Large Cozy 1 BR Apartment In Midtown East	Manhattan	Murray Hill	40.74767	-73.97500	Entire home/apt	200	129	0.8654731
5178	Large Furnished Room Near B’way	Manhattan	Hell’s Kitchen	40.76489	-73.98493	Private room	79	220	0.5487460

Second, let’s split our current airbnb_tab data frame into two data frames, one with room_type == "Entire home/apt" and one with room_type == "Private room". This is because prices are much more expensive for “Entire home/apt” listings, so we don’t want to get confused when regressing against distance. We only want to see the relation between distance and prices, not between prices and size of the space being listed!

# create new dataframe of listings where room_type=="Entire home/apt"
entire_tab <- airbnb_tab %>%
  filter(room_type == "Entire home/apt")

# create new dataframe of listings where room_type=="Private room"
private_tab <- airbnb_tab %>%
  filter(room_type == "Private room")

# create new dataframe of listings where room_type=="Shared room"
shared_tab <- airbnb_tab %>%
  filter(room_type == "Shared room")

knitr::kable(head(entire_tab))

id	name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	availability_365	distToTimesSquare
2595	Skylit Midtown Castle	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	355	0.2614253
3831	Cozy Entire Floor of Brownstone	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	194	5.1585482
5099	Large Cozy 1 BR Apartment In Midtown East	Manhattan	Murray Hill	40.74767	-73.97500	Entire home/apt	200	129	0.8654731
5238	Cute & Cozy Lower East Side 1 bdrm	Manhattan	Chinatown	40.71344	-73.99037	Entire home/apt	150	188	3.0224159
5295	Beautiful 1br on Upper West Side	Manhattan	Upper West Side	40.80316	-73.96545	Entire home/apt	135	6	3.3701851
6848	Only 2 stops to Manhattan studio	Brooklyn	Williamsburg	40.70837	-73.95352	Entire home/apt	140	46	3.7708536

knitr::kable(head(private_tab))

id	name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	availability_365	distToTimesSquare
2539	Clean & quiet apt home by the park	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	365	7.610158
3647	THE VILLAGE OF HARLEM….NEW YORK !	Manhattan	Harlem	40.80902	-73.94190	Private room	150	365	4.276710
5178	Large Furnished Room Near B’way	Manhattan	Hell’s Kitchen	40.76489	-73.98493	Private room	79	220	0.548746
5441	Central Manhattan/near Broadway	Manhattan	Hell’s Kitchen	40.76076	-73.98867	Private room	85	39	0.295381
5803	Lovely Room 1, Garden, Best Area, Legal rental	Brooklyn	South Slope	40.66829	-73.98779	Private room	89	314	6.138165
6021	Wonderful Guest Bedroom in Manhattan for SINGLES	Manhattan	Upper West Side	40.79826	-73.96113	Private room	85	333	3.137898

knitr::kable(head(shared_tab))

id	name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	availability_365	distToTimesSquare
12048	LowerEastSide apt share shortterm 1	Manhattan	Lower East Side	40.71401	-73.98917	Shared room	40	188	2.978924
54453	MIDTOWN WEST - Large alcove studio	Manhattan	Hell’s Kitchen	40.76548	-73.98474	Shared room	105	363	0.590397
173072	Cozy Pre-War Harlem Apartment	Manhattan	Harlem	40.80827	-73.95329	Shared room	49	248	3.939358
391948	Single Room	Queens	Ozone Park	40.68581	-73.84642	Shared room	45	364	8.821836
467634	yahmanscrashpads	Queens	Jamaica	40.67747	-73.76493	Shared room	39	353	12.832089
564751	Artist space for creative nomads.	Manhattan	Upper West Side	40.80165	-73.96287	Shared room	76	324	3.318301

Third, we want to create a scatter plot of the prices of listings against their distance to Times Square. We’ll also add a regression line to this scatter plot to the general increasing or decreasing trend in our data! Let’s do this three times, once for each room_type we are interested in.

entire_tab %>%
    ggplot(aes(x=entire_tab$distToTimesSquare,y=entire_tab$price)) +
    geom_point() + # plot points for scatter plot
    geom_smooth(method=lm) + # plot linear regression line or line of best fit
    ylim(0, 1500) + # set the upper limit of prices to $1500
    labs(title="Homes & Appts. Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")

private_tab %>%
    ggplot(aes(x=private_tab$distToTimesSquare,y=private_tab$price)) +
    geom_point() + # plot points for scatter plot
    geom_smooth(method=lm) + # plot linear regression line or line of best fit
    ylim(0, 500) + # set the upper limit of prices to $500
    labs(title="Private Room Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")

shared_tab %>%
    ggplot(aes(x=shared_tab$distToTimesSquare,y=shared_tab$price)) +
    geom_point() + # plot points for scatter plot
    geom_smooth(method=lm) + # plot linear regression line or line of best fit
    ylim(0, 200) + # set the upper limit of prices to $200
    labs(title="Shared Room Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")

Lastly, let’s analyze the resulting models quantitatively using broom::tidy.

entire_fit <- lm(distToTimesSquare~price, data=entire_tab) # create the simple regression model
entire_fit %>%
  tidy() # turn model into a tibble with information about the model

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  4.43    0.0284        156.  0.      
## 2 price       -0.00149 0.0000760     -19.6 6.45e-85

private_fit <- lm(distToTimesSquare~price, data=private_tab) # create the simple regression model
private_fit %>%
  tidy() # turn model into a tibble with information about the model

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  5.44     0.0277       197.  0.      
## 2 price       -0.00237  0.000141     -16.8 7.26e-63

shared_fit <- lm(distToTimesSquare~price, data=shared_tab) # create the simple regression model
shared_fit %>%
  tidy() # turn model into a tibble with information about the model

## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  5.41      0.128       42.3  2.95e-212
## 2 price       -0.00555   0.00109     -5.11 3.93e-  7

As we can see in all three of these linear regression plots, the prices of all the types of listing decreases slowly as the location of the listing gets further away from Times Square. From the models, it is clear that prices of Airbnb listings decrease by 0.00149 (homes and apts), 0.00237 (private rooms), and 0.00555 (shared rooms) on average each mile further away from Times Square.

Even though we can clearly see a trend in our linear regressions, it is best to conduct hypothesis testing in order to determine if our results are valid and there is a significantly meaningful relationship between Airbnb prices and their distance away from high traffic locations, such as Times Square in New York City (Statistics How To).

Let’s ask the question: Do we reject the null hypothesis of no relationship between price and distance from Times Square?

Our answer: Yes, we reject the null hypothesis since the p-values for all three linear regressions are significantly smaller than 0.05. A p-value less than or equal to 0.05 means that the results for our data holds, that our data is repeatable, and that our results didn’t just happen by chance (Statistics How To).

You can read more about Linear Regression at CMSC 320 Linear Regression Lecture Notes by Professor Hector Corrada Bravo.

Conclusion

Through this dataset of 2019 Airbnb listings in New York City, we can conclude that the distance between a listed Airbnb and a highly populated location, such as Times Square in NYC, is negatively correlated to the listing’s price per night. We saw this through linear regressions and conducting hypothesis tests!

We also saw the price range of listings vary greatly in the five boroughs of NYC (independent of room type) and that room type is directly related to the median pricing listing of rooms in different boroughs in NYC.

In this tutorial, we only have scraped the surface of what we can do with data sets using techniques frequently found in the field of data science. There is so much more to learn! We encourage you to visit the references throughout this tutorial to learn more, and to download and mess with different data sets. You can find data from tjese various repositories and more: