Introduction

So you want to be a Data Scientist and don’t know where to start? Well, you’ve come to the right place.

Today you’ll learn a little about the economy, a whole bunch of data science principles and practice and where to stay on your next trip to New York City!

New York

Data Science Pipeline

First, let’s look at the data science pipeline.

The data science pipeline starts with defining what major questions one wants to answer and subsequently acquiring and importing the relevant data to be analyzed.

Then, the data is viewed and data tidying must occur; where a rectangular data structure model is assumed and three requirements must be met. Each observation (called an entity) forms a row, each variable (called an attribute) forms a column and each observational unit (type of entity) forms a table.

Leading to the exploratory data analysis process, where the data is transformed and visualized. Data cleaning may be necessary for missing data. When handling missing data, the missing data may be removed, encoded or imputation (replace missing values with the mean of non-missing values) of a numeric variable may be necessary.

Hypothesis testing and machine learning (ML) modeling are the final steps before the data and its results can be communicated.

Image of Pipeline

Source: https://r4ds.had.co.nz/explore-intro.html

Economy & Vacations

The rise of companies like Airbnb has given rise to The Sharing Economy. The sharing economy is a model defined as the facilitation of goods and services on a peer-to-peer level usually through online community platforms. This new model has made it possible for a great deal of people to gain another source of income and for you to have an affordable vacation.

As more sharing economy companies have opened, like Airbnb and Uber, the way we vacation has changed. This change has been documented and open data on it is available.

DataSet Used

The data we will be using in this tutorial is New York City Airbnb Open Data from Kaggle. We will use this data to look at the relationships between types of housing and location.

Preparing Data

Download the dataset.

In this section, we will learn how to load in our dataset, view the data in our dataset, and clean it up so it’s easy for us to work with.

First, let’s load in the following libraries so we can use certain functions:

# for data wranging
library(tidyverse)
library(dplyr)

# for data analysis
library(geosphere)
library(ggplot2)
library(broom)

Loading Data

CSV files are files that include data which are “comma-separated values”, meaning that data values are literally separated by commas.

After we’ve downloaded our CSV file from Kaggle into our working directory, we can use the read_csv function to load the CSV file data into our program’s data frame, which is a table of the data.

# create a dataframe from our CSV file
airbnb_tab <- read.csv("AB_NYC_2019.csv", header=TRUE)

There are some attributes that we don’t need for our purposes, like host_id, host_name, minimum_nights, number_of_reviews, last_review, reviews_per_month, and calculated_host_listings_count. So, let’s remove these from our data frame:

# a vector called "to_remove" that has the names of the attributes we don't want
to_remove <- c('host_id', 
              'host_name', 
               'minimum_nights', 
               'number_of_reviews', 
               'last_review', 
               'reviews_per_month', 
               'calculated_host_listings_count')

# removing attributes from data frame using "to_remove"
airbnb_tab = airbnb_tab[ , !(names(airbnb_tab) %in% to_remove)]

Viewing Data

Here, we see the first 10 rows in our dataset:

knitr::kable(head(airbnb_tab, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194
5022 Entire Apt: Spacious Studio/Loft by central park Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 0
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129
5121 BlissArtsSpace! Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 0
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220
5203 Cozy Clean Guest Room - Family Apt Manhattan Upper West Side 40.80178 -73.96723 Private room 79 0
5238 Cute & Cozy Lower East Side 1 bdrm Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 188

Some Notes:

  • knitr::kable() is used to make the table “pretty” and easier to read

  • head(df, n=10) is used to view the dataframe with a specific number of rows (head() is not always necessary, you can just list the data frame for it to render)

  • the first argument is where the dataframe goes, in this case, airbnb_tab

  • n = determines the number of rows visible, in this case, 10

The following is a list of descriptions for the attributes of our data set:

Attribute Description/Unit
id Unique ID for each Airbnb listing
name Name or description of the Airbnb listing
neighbourhood_group Boroughs of New York (Manhattan, Brooklyn, Queens, Bronx, Staten Island)
neighbourhood Neighborhoods of New York
latitude Degrees of latitude, measures distance North and South from Equator
longitude Degrees of longitude, measure distance East and West of Prime Meridian
room_type Type of space offered (Entire home/apt, Private room, Shared room)
price Price of listing, in US Dollars
availability_365 Number of days in a year when the listing is available for booking

Tidying Data

Tidying Data entails the elements listed in the list below.

Elements of a tidy dataset:

  1. Each observation/entity forms a row
  2. Each variable/attribute forms a column
  3. Each observational unit (type of entity) forms a column (i.e. not dependent on one another)

Our dataset is already tidy and meets the criteria above. Each entity is a row and each attribute is a column, where no entity is dependent on another.

However, if your data set is untidy, below is an example on a different small dataset, to show you what to do.

Sample Tidying

Let’s tidy up this small messy-airlines.csv dataset containing data about the arrival status of two airlines across five destinations.

airline_schedule <- read.csv("messy-airlines.csv", header=TRUE) # read downloaded CSV file into a dataframe (array of data)
knitr::kable(airline_schedule)
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
Alaska on time 497 221 212 503 1841
Alaska delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Let’s rename the first 2 columns from “X” to “airline” and “X.1” to “status” in order to clarify what the values are representing:

airline_schedule = rename(airline_schedule, c("airline"="X", "status"="X.1"))
knitr::kable(airline_schedule)
airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
Alaska on time 497 221 212 503 1841
Alaska delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Now, notice that there are data values are the headers for the rest of the columns, e.g. Los Angeles, Phoenix, San Diego, San Francisco, and Seattle! We have to make sure that column headers are only variable names that describe the values, otherwise, the dataset is considered “untidy”.

  1. Create a new column called “destination” to describe where each airline is heading. We use the gather dplyr function, which takes a set of column names and places them into a single key column, “destination”, and collects the cells of those columns into a single value column, “count”.
  2. Make values of “status” attribute as their own attributes. We do this in order to decrease the number of entries containing the same airline and destination information. We use the spread dplyr function, which does the inverse of gather by spreading columns, “status” and “count”, into separate columns.
  3. Get rid of the full stop (.) in destination names for readability. We use the mutate function to modify the destination column. We are reassigning all the values in the “destination” column to values without the full stop character in city names, ie. “Los.Angeles” to “Los Angeles”. We use the gsub function for pattern matching with regular expressions (regex) and replaces all occurrences of the full stop character with a space character.
  4. Rearrange the entities to sort by destination first, and then airlines. The arrange function helps us with this.
tidy_data <- gather(airline_schedule, "destination", "count", 3:7) %>% # Step 1
  spread(status, count) %>% # Step 2
  mutate(destination=gsub("\\.", " ", destination)) %>% # Step 3
  arrange(destination, airline) # Step 4

knitr::kable(tidy_data)
airline destination delayed on time
Alaska Los Angeles 62 497
AM WEST Los Angeles 117 694
Alaska Phoenix 12 221
AM WEST Phoenix 415 4840
Alaska San Diego 20 212
AM WEST San Diego 65 383
Alaska San Francisco 102 503
AM WEST San Francisco 129 320
Alaska Seattle 305 1841
AM WEST Seattle 61 201

Now this data is tidy!

Remember to avoid these common problems that is found with messy data:

  • column headers are values, not variable names
  • multiple variables stored in one column
  • variables stored in both rows and columns
  • multiple types of observational units are stored in the same table
  • single observational unit stored in multiple tables

You can read more about how to fix these problems at CMSC 320 Tidying Data Lecture Notes by Professor Hector Corrada Bravo.

Exploratory Data Analysis

In this section, we begin exploring what our data can tell us using visualizations. This will help us to better understand our data and help us make decisions about how we may want to further manipulate the data to see something specific, or decide which methods are best for modelling and Machine Learning!

The main reason for exploratory data analysis, or EDA, is to help us find any problems in our data preparation and gain a sense of variable properties, such as central trends (mean), spread (variance), skew, outliers, and relationships between pairs of variables, like their correlation or covariance.

You can read more about EDA at CMSC 320 EDA Lecture Notes by Professor Hector Corrado Bravo.

Handling Missing Data

Recall that the attribute availability_365 tells us how many days in the year that this particular listing is available for people to book.

Notice that 0 is a value for some of the entities (Airbnb listings). It doesn’t make much sense for us to look at entities that aren’t available at all during the year. In fact, more than 17000 entities are listed at being available for 0 days out of the year! That’s about 1/3 of our dataset.

We’ll call this “missing data”, and remove these entities from our dataset:

airbnb_tab <- airbnb_tab %>%
  filter(availability_365 > 0) # filter() is used to filter the dataframe via specific conditions

knitr::kable(head(airbnb_tab, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220
5238 Cute & Cozy Lower East Side 1 bdrm Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 188
5295 Beautiful 1br on Upper West Side Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 6
5441 Central Manhattan/near Broadway Manhattan Hell’s Kitchen 40.76076 -73.98867 Private room 85 39
5803 Lovely Room 1, Garden, Best Area, Legal rental Brooklyn South Slope 40.66829 -73.98779 Private room 89 314

Note that a way to handle missing data, as mentioned in the data science pipline section (data cleaning), is removing missing data altogether. Having 0 as a value for availability_365 is a form of missing data.

Data Visualizations

What are data visualizations? Data visualizations are representations of data and/or information in visual format (like a graph or chart). These visualizations allow for patterns and larger data to be presented and communicated straightforwardly. Data is used to gain insight and is valuable so the way it’s presented is important. Humans are visual, it’s how the brain works!

There are many different forms of data visualization, each of which have their own advantages.

The types of data to be analyzed (categorical and/or numeric) is an important factor in deciding what data visualization is applicable and pertinent.

Want to know more about picking the best data visualization for your message, click here.

Interactive Map

Another layer of data visualizations is the addition of interactions. Interactive data visulizations add more functionality and allow the users to learn more about the massive amounts of data presented and the data its relationship relative to itself.

There are many types of interactive data visualizations and each type has there own benefits. Here are some examples of captivating visualizations and how they are effective.

Let’s first create the map using the leaflet package. The map will be centered in NYC.

# Download necessary library to integrate and control maps
library(leaflet)

# Creating NYC Map
nyc_map <- leaflet(airbnb_tab) %>%
  addTiles() %>%
  setView(
    lat=40.730610, #set latitude of NYC
    lng=-73.935242, #set longitude of NYC
    zoom=11)

nyc_map #outputs the map

Now, let’s add our data via location icons. The latitude and longitude coordinate for each listing will be used for icon placement on the map.

leaflet(airbnb_tab) %>% #pass in 
  addTiles() %>%
    addAwesomeMarkers(
      #pass in given longitude of entity
      lng = ~longitude,
      
      #pass in given latitude of entity
      lat = ~latitude,
      
      # Setting up icon for entity on map
      icon = awesomeIcons(
              icon = 'ios-close',
              iconColor = 'black',
              library = 'ion',
              # Determines color of the icon on map with nested if else statements
              markerColor = ~ifelse(room_type == 'Entire home/apt', "green", 
                                    ifelse(room_type =='Private room', "orange", 
                                           "red"
                                    )
                            )
            ),
      # Price Label
      label = ~paste("$", as.character(price), "per night"),
    
      # Clustering for identifying arrest density
      clusterOptions = markerClusterOptions()
    ) %>%
  addLegend(
    position = 'bottomright', 
    # Color keys correspond to values, respectively
    colors= c("green", "orange", "red"), labels=c("Entire Home/Apt", "Private Room", "Shared Room"), 
    title='Types of Rentals', 
  )
  • The icon color is dependent on the room_type attribute that has three categories; enitre home or appt. is green, private rooms are orange and shared rooms are red. Another functionality is that the icons have labels when they are hovered over, the labels contain the price of the listing as to contribute to the ease of viewing and comparing listings.

  • The map contains a legend so that the user knows how to interpret the map’s colors.

  • A clustering function was added and can be omitted, it was just added for density analysis based on coordinates of listing on the map.

This map could help you plan your next trip to NYC and save money just by staying across the street. Who knew!

Boxplots

Boxplots, although simple, are very useful to view the relationship of numeric variables relationship of numeric variables relative to each other, creating insight into the range and stats of the data. If multiple boxplots are used, we can view the relationship between the categorical and numeric attributes, as well.

Here we look at the complete range listing prices based on the room_type attribute in the dataset. We will split up the listing and look at them subsequently to see how their ranges compare to one another.

The three step process is:

  1. Filter the data into a new dataframe based on the room type
  2. Graph the data via a boxplot
  3. Section off the y-axis range of the boxplot to create an effective and interpretable visulization

Purpose: We want to see the listing’s price ranges of places based on neighborhood groups, to see how they compare to one another. (We will separate the data by room type, as the price for an entire house versus a shared room on the same street will vary greatly.)

First, let’s graph the Entire home/appt. room type.

We filter the data.

# Download subsequent libraries
library(ggplot2)  #data visualization package for the statistical programming
library(ggthemes) #package for themes

# Filter out all the Enitre home/appt listings into new dataframe
airbnb_home <- airbnb_tab %>%
  filter(room_type == 'Entire home/apt')

# View new table
knitr::kable(head(airbnb_home, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129
5238 Cute & Cozy Lower East Side 1 bdrm Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 188
5295 Beautiful 1br on Upper West Side Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 6
6848 Only 2 stops to Manhattan studio Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 46
7097 Perfect for Your Parents + Garden Brooklyn Fort Greene 40.69169 -73.97185 Entire home/apt 215 321
7726 Hip Historic Brownstone Apartment with Backyard Brooklyn Crown Heights 40.67592 -73.94694 Entire home/apt 99 21
7750 Huge 2 BR Upper East Cental Park Manhattan East Harlem 40.79685 -73.94872 Entire home/apt 190 249
8490 MAISON DES SIRENES1,bohemian apartment Brooklyn Bedford-Stuyvesant 40.68371 -73.94028 Entire home/apt 120 233

Now, we create a boxplot to see the range:

airbnb_home %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+   #creating a boxplot
  coord_flip() +    #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "Entire Homes & Appts. Price By Neighborhood in 2019",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

These boxplots are ineffective as the data is hardly viewable, since the outliers are so far out of the range. So what do we do?

We now filter the range of y-axis accordingly to make the visulaization useful. While some outliers will be cut off, the heart of our data is still present, and limiting of the range does not disrupt the purpose of this visualization.

airbnb_home %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  
  geom_boxplot()+   #creating a boxplot
  
  # Limiting the y-axis to get a better view of data
  scale_y_continuous(limits = c(0, 1500)) +
  
  coord_flip() +  #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "2019 NYC Homes & Appts. Prices (Up to $1500/night)",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

Here we have it! A visualization that shows the price ranges of listing of entire homes and apartments, by neighboorhood groups.

Let’s hope you’re not getting ripped off.

Second, let’s graph the Private room type.

We filter the data, again, into a new dataframe:

# Filter out all the Private room listings into new dataframe
airbnb_room <- airbnb_tab %>%
  filter(room_type == 'Private room')

# View new table
knitr::kable(head(airbnb_room, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220
5441 Central Manhattan/near Broadway Manhattan Hell’s Kitchen 40.76076 -73.98867 Private room 85 39
5803 Lovely Room 1, Garden, Best Area, Legal rental Brooklyn South Slope 40.66829 -73.98779 Private room 89 314
6021 Wonderful Guest Bedroom in Manhattan for SINGLES Manhattan Upper West Side 40.79826 -73.96113 Private room 85 333
7322 Chelsea Perfect Manhattan Chelsea 40.74192 -73.99501 Private room 140 12
8024 CBG CtyBGd HelpsHaiti rm#1:1-4 Brooklyn Park Slope 40.68069 -73.97706 Private room 130 347
8025 CBG Helps Haiti Room#2.5 Brooklyn Park Slope 40.67989 -73.97798 Private room 80 364
8110 CBG Helps Haiti Rm #2 Brooklyn Park Slope 40.68001 -73.97865 Private room 110 304

Again, we create a boxplot to see the range:

airbnb_room %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+   #creating a boxplot
  coord_flip() +    #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "Private Room Price By Neighborhood in 2019",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

These boxplots are ineffective as well.

So this time, we pull in the range of y, even more and reduce it down to 500 dollars, as private rooms usually cost much less per night than entire homes.

airbnb_room %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  
  geom_boxplot()+   #creating a boxplot
  
  # Limiting the y-axis to get a better view of data
  scale_y_continuous(limits = c(0, 500)) +
  
  coord_flip() +    #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "2019 NYC Private Room Prices (Up to $500/night)",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

Now this graph is much more effective than the previous one.

We’re almost done, just one more to go.

Lastly, lets graph the Shared room type

We filter the data for the last time:

# Filter out all the Shared room listings into new dataframe
airbnb_sroom <- airbnb_tab %>%
  filter(room_type == 'Shared room')

# View new table
knitr::kable(head(airbnb_sroom, n=10))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365
12048 LowerEastSide apt share shortterm 1 Manhattan Lower East Side 40.71401 -73.98917 Shared room 40 188
54453 MIDTOWN WEST - Large alcove studio Manhattan Hell’s Kitchen 40.76548 -73.98474 Shared room 105 363
173072 Cozy Pre-War Harlem Apartment Manhattan Harlem 40.80827 -73.95329 Shared room 49 248
391948 Single Room Queens Ozone Park 40.68581 -73.84642 Shared room 45 364
467634 yahmanscrashpads Queens Jamaica 40.67747 -73.76493 Shared room 39 353
564751 Artist space for creative nomads. Manhattan Upper West Side 40.80165 -73.96287 Shared room 76 324
737126 Williamsburg Loft!! Bedford L 1blk! Brooklyn Williamsburg 40.71714 -73.95447 Shared room 195 364
765203 Art Lover’s Abode Brooklyn Brooklyn Williamsburg 40.70745 -73.94307 Shared room 52 88
773497 Great spot in Brooklyn Brooklyn Bedford-Stuyvesant 40.69407 -73.94551 Shared room 200 365
819206 Cute shared studio apartment Manhattan East Harlem 40.79106 -73.95058 Shared room 45 313

Again, we create a boxplot to see the range:

airbnb_sroom %>%   #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  geom_boxplot()+   #creating a boxplot
  coord_flip() +    #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "Shared Room Price By Neighborhood in 2019",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

And again the y-axis values need to be trimmed. This time we will trim it down to $200.

airbnb_sroom %>%  #send in dataframe
  # Setting up what data is used from the dataframe
  ggplot(aes(x = neighbourhood_group, y = price)) +
  
  geom_boxplot()+   #creating a boxplot
  
  # Limiting the y-axis to get a better view of data
  scale_y_continuous(limits = c(0, 200)) +
  
  coord_flip() +   #flipping the coordinates to have horizontal view
  
  # Themes for graph
  theme_economist() + 
  scale_fill_economist() +
  
  # Setting up title and axis labels
  labs(title = "2019 NYC Shared Room Prices (Up to $200/night)",
       x = "Major Neighborhood Groups",
       y = "Price(USD)")

We’re all done now. So what have we learned?

We learned that Manhattan has the pricest listing independent of the room type of the listing. What other conclusions can we make from these visualizations?

Hypothesis Testing & Machine Learning

What is Hypothesis Testing?

Okay, so we have data. But what does it all mean? We need to first interpret our data to make assumptions about it, and test if our assumptions are valid! We refer to an assumption as a “hypothesis”. Conducting “hypothesis testing” will help us quantify the validity of our assumptions to certain questions about the data (Machine Learning Mastery).

To conduct hypothesis testing, we will first plot linear regressions over the distribution of our dataset and generate statistics about the regressions. These statistics will help us answer questions about the data!

What is Machine Learning?

We’ve all heard of it, but what is it really? Machine learning uses algorithms and statistics in order to find patterns in substantial amounts of data (meaning anything that can be digitally stored). In machine learning, a model is created using mathematical and statistical functions that are able to be modified (either manually or automatically, dependent on the type of ML) till it can make accurate predictions with new data.

Linear and logistic regressions are both machine learning algorithms, one based on supervised regression and the other is based on supervised classification, respectively.

The patterns found are usually used for recommendation systems in many of today’s technologies.

ML & Hypothesis Testing Walk Through

With datasets that are large, it can be very useful to generate a linear regression, or a line of “best fit”, for an easier interpretation of the data. This data analysis technique is also an effective way to learn about general trends of our data set and lets us construct confidence intervals and do hypothesis testing, which analyzes and tests for relationships between variables.

We want to look at the relationship between price and distance away from Times Square in New York City, one of the largest populated cities in New York. We are looking at Time Square since it is a major commercial intersection, tourist destination, entertainment center, and neighborhood in the Midtown Manhattan section of NYC (Wikipedia).

For these reasons, we would like to see if Airbnb listings would increase as their distance to Times Square (latitude 40.757, longitude -73.986) decreases, and vice versa. We will be using functions from the geosphere library to calculate distance between coordinates.

First, let’s add an attribute called distToTimesSquare in our dataset. This will contain the distance (in miles) between each listing and Times Square.

coordsTimeSquare <- c(-73.986, 40.757) # vector of Times Square coordinates, first longitude, second latitude

airbnb_tab <- airbnb_tab %>%
  mutate(distToTimesSquare = by(airbnb_tab, 1:nrow(airbnb_tab), # calculate distance using distHaversine function
                                function(row) { 
                                  distHaversine(c(row$longitude, row$latitude), coordsTimeSquare)
                                }) / 1609) # divide by 1609 to convert meters to miles

knitr::kable(head(airbnb_tab))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365 distToTimesSquare
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365 7.6101584
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355 0.2614253
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365 4.2767099
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194 5.1585482
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129 0.8654731
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220 0.5487460

Second, let’s split our current airbnb_tab data frame into two data frames, one with room_type == "Entire home/apt" and one with room_type == "Private room". This is because prices are much more expensive for “Entire home/apt” listings, so we don’t want to get confused when regressing against distance. We only want to see the relation between distance and prices, not between prices and size of the space being listed!

# create new dataframe of listings where room_type=="Entire home/apt"
entire_tab <- airbnb_tab %>%
  filter(room_type == "Entire home/apt")

# create new dataframe of listings where room_type=="Private room"
private_tab <- airbnb_tab %>%
  filter(room_type == "Private room")

# create new dataframe of listings where room_type=="Shared room"
shared_tab <- airbnb_tab %>%
  filter(room_type == "Shared room")

knitr::kable(head(entire_tab))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365 distToTimesSquare
2595 Skylit Midtown Castle Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 355 0.2614253
3831 Cozy Entire Floor of Brownstone Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 194 5.1585482
5099 Large Cozy 1 BR Apartment In Midtown East Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 129 0.8654731
5238 Cute & Cozy Lower East Side 1 bdrm Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 188 3.0224159
5295 Beautiful 1br on Upper West Side Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 6 3.3701851
6848 Only 2 stops to Manhattan studio Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 46 3.7708536
knitr::kable(head(private_tab))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365 distToTimesSquare
2539 Clean & quiet apt home by the park Brooklyn Kensington 40.64749 -73.97237 Private room 149 365 7.610158
3647 THE VILLAGE OF HARLEM….NEW YORK ! Manhattan Harlem 40.80902 -73.94190 Private room 150 365 4.276710
5178 Large Furnished Room Near B’way Manhattan Hell’s Kitchen 40.76489 -73.98493 Private room 79 220 0.548746
5441 Central Manhattan/near Broadway Manhattan Hell’s Kitchen 40.76076 -73.98867 Private room 85 39 0.295381
5803 Lovely Room 1, Garden, Best Area, Legal rental Brooklyn South Slope 40.66829 -73.98779 Private room 89 314 6.138165
6021 Wonderful Guest Bedroom in Manhattan for SINGLES Manhattan Upper West Side 40.79826 -73.96113 Private room 85 333 3.137898
knitr::kable(head(shared_tab))
id name neighbourhood_group neighbourhood latitude longitude room_type price availability_365 distToTimesSquare
12048 LowerEastSide apt share shortterm 1 Manhattan Lower East Side 40.71401 -73.98917 Shared room 40 188 2.978924
54453 MIDTOWN WEST - Large alcove studio Manhattan Hell’s Kitchen 40.76548 -73.98474 Shared room 105 363 0.590397
173072 Cozy Pre-War Harlem Apartment Manhattan Harlem 40.80827 -73.95329 Shared room 49 248 3.939358
391948 Single Room Queens Ozone Park 40.68581 -73.84642 Shared room 45 364 8.821836
467634 yahmanscrashpads Queens Jamaica 40.67747 -73.76493 Shared room 39 353 12.832089
564751 Artist space for creative nomads. Manhattan Upper West Side 40.80165 -73.96287 Shared room 76 324 3.318301

Third, we want to create a scatter plot of the prices of listings against their distance to Times Square. We’ll also add a regression line to this scatter plot to the general increasing or decreasing trend in our data! Let’s do this three times, once for each room_type we are interested in.

entire_tab %>%
    ggplot(aes(x=entire_tab$distToTimesSquare,y=entire_tab$price)) +
    geom_point() + # plot points for scatter plot
    geom_smooth(method=lm) + # plot linear regression line or line of best fit
    ylim(0, 1500) + # set the upper limit of prices to $1500
    labs(title="Homes & Appts. Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")

private_tab %>%
    ggplot(aes(x=private_tab$distToTimesSquare,y=private_tab$price)) +
    geom_point() + # plot points for scatter plot
    geom_smooth(method=lm) + # plot linear regression line or line of best fit
    ylim(0, 500) + # set the upper limit of prices to $500
    labs(title="Private Room Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")

shared_tab %>%
    ggplot(aes(x=shared_tab$distToTimesSquare,y=shared_tab$price)) +
    geom_point() + # plot points for scatter plot
    geom_smooth(method=lm) + # plot linear regression line or line of best fit
    ylim(0, 200) + # set the upper limit of prices to $200
    labs(title="Shared Room Prices vs Distance to Times Square", x="Distance to Times Square (miles)", y="Price (USD)")

Lastly, let’s analyze the resulting models quantitatively using broom::tidy.

entire_fit <- lm(distToTimesSquare~price, data=entire_tab) # create the simple regression model
entire_fit %>%
  tidy() # turn model into a tibble with information about the model
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  4.43    0.0284        156.  0.      
## 2 price       -0.00149 0.0000760     -19.6 6.45e-85
private_fit <- lm(distToTimesSquare~price, data=private_tab) # create the simple regression model
private_fit %>%
  tidy() # turn model into a tibble with information about the model
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  5.44     0.0277       197.  0.      
## 2 price       -0.00237  0.000141     -16.8 7.26e-63
shared_fit <- lm(distToTimesSquare~price, data=shared_tab) # create the simple regression model
shared_fit %>%
  tidy() # turn model into a tibble with information about the model
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  5.41      0.128       42.3  2.95e-212
## 2 price       -0.00555   0.00109     -5.11 3.93e-  7

As we can see in all three of these linear regression plots, the prices of all the types of listing decreases slowly as the location of the listing gets further away from Times Square. From the models, it is clear that prices of Airbnb listings decrease by 0.00149 (homes and apts), 0.00237 (private rooms), and 0.00555 (shared rooms) on average each mile further away from Times Square.

Even though we can clearly see a trend in our linear regressions, it is best to conduct hypothesis testing in order to determine if our results are valid and there is a significantly meaningful relationship between Airbnb prices and their distance away from high traffic locations, such as Times Square in New York City (Statistics How To).

Let’s ask the question: Do we reject the null hypothesis of no relationship between price and distance from Times Square?

Our answer: Yes, we reject the null hypothesis since the p-values for all three linear regressions are significantly smaller than 0.05. A p-value less than or equal to 0.05 means that the results for our data holds, that our data is repeatable, and that our results didn’t just happen by chance (Statistics How To).

You can read more about Linear Regression at CMSC 320 Linear Regression Lecture Notes by Professor Hector Corrada Bravo.

Conclusion

Through this dataset of 2019 Airbnb listings in New York City, we can conclude that the distance between a listed Airbnb and a highly populated location, such as Times Square in NYC, is negatively correlated to the listing’s price per night. We saw this through linear regressions and conducting hypothesis tests!

We also saw the price range of listings vary greatly in the five boroughs of NYC (independent of room type) and that room type is directly related to the median pricing listing of rooms in different boroughs in NYC.

In this tutorial, we only have scraped the surface of what we can do with data sets using techniques frequently found in the field of data science. There is so much more to learn! We encourage you to visit the references throughout this tutorial to learn more, and to download and mess with different data sets. You can find data from tjese various repositories and more: