Statistical Learning Project

This is a description of my short project for my class to illustrate my findings.

Boston Real Estate Predictor

Knowing the statistics of a neighborhood is very important for realtors. As population grows and more people move into homes, real estate agents are more in need than ever. It's important for them to know what their client wants and how to get a house for them. If a client wants a home right outside a city, low homeless, and low crime rate, the realtor needs to be able to give them an accurate guess on how much that house is going to cost them. For my data set I chose to predict housing costs based on different factors of suburbs in Boston. This analysis can help realtors know which factors can drastically change the price of a home.

Data Description

This data set isn’t too large with just over 500 entries. The data has thirteen continuous attributes: crime rate, proportions of lots over 25,000 sq. ft., proportion of non-retail businesses, nitric oxide concentrations, average number of rooms in houses, proportion of houses built prior to 1940, distance to Boston employment centers, accessibility to highways, property tax rate, student-teacher ratio, proportion of blacks, median value of occupied homes, and percent of lower status proportion. It also has one binary Charles river variable. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. Please take results with a grain of salt, as this data taken from 1993, and the housing market has changed greatly since then.

Exploratory Data Analysis

For my analysis I focused on predicting two of the attributes: crime rate and median value of houses. For median value, I found that using a simple linear model gave the best fit for the data, and then I removed any values that did not help the model. I tried using other models, but I found that it did not work as well There were simply too many variables it depended on to create another model that was still easily interpretable. To predict crime rate, I found few variables that could affect it. However, I think the results were very interesting. I did multiple exploratory analysis runs including linear fits, polynomial fits, and a tree. All my results are illustrated below.

A random correlation I saw with the pairs graph was the nitric oxide levels and non-retail business proportion. This is likely due to the fact that many non-retail businesses are located in large cities, where there is more traffic and thus higher nitric oxide levels. This is a pretty sufficient example of correlation, but not causation. I recorded two lines here. Blue meaning a direct correlation, and red meaning a polynomial correlation.

Similarly, the correlation between distance to employment centers and the nitric oxide levels probably have a similar affect. New booming cities also have a large amount of traffic. So they not only have employment centers, but higher nitric oxide levels as well.

Graph 1

Graph 2

Before we move onto the main point of the project, predicting median value of houses, I want to take a look at the crime rate and find what predicts it. After a lot of exploratory analysis, I came down to the following results.

I turned the data into high and low criminality, based purely on the median crime rate input. When looking at only high crime, it lowers the value of the houses by 4000 in both mean and median. This is important for realtors to know.

As well as the affect crime has on house prices, I also took a look at the different things that could change the crime rate directly. I used a binomial model to predict this and used exploratory analysis to find the best parameters. I mostly got nitric oxide levels to be the biggest contributer, but access to radial highways and ratio of non retail businesses. These three things seem to all be correlated with a city, which is typically where there is the most crime. So I am not surprised by this output.

Binomial model of the median value of crime rate prediction.

Finally, is my most important look at the data. I wanted to see what factors were the most important in deciding how high the median housing market price would be and found that the best fit was a linear model. I ran other models on the data but failed to find one that worked any better than a simple linear model. I also narrowed down to find the most accurate R-squared value with the right attributes. I found the following output from my R code.

Linear model of the median value of houses prediction.

The values with including every attribute available except for industry created the highest R-squared: 0.6163. For as little data I was given, I think that this is successful. This is the model that fit most of them with high confidence.


I was not expecting as many of the parameters I had to be important in predicting the housing market. I do wish I could've gotten more than 60% accuracy, but maybe with more data I would be sucessful. I am curious to see the correlated data in other data sets. I wonder if the correlations would be different, more precise, or not follow this pattern at all.