Big Data Project

This is a description of my short project for my class to illustrate my findings.

Autoimmune App Analysis

KThe data I chose for this project is taken from an app named Flaredown. It helps patients with chronic illnesses, mainly autoimmune, track their symptoms. The app helps them analyze their “triggers” and which treatments work for their symptoms. Triggers are what foods or activities may cause a flareup of the disorder/illness. You track anything out of the unusual in the app such as: more milk than normal, stress, riding a bike, or being in the sun a lot. Rather than tracking illnesses, the app focuses on symptoms instead. However, the data is quite broken in the way it is reported in the file.

Some different questions I will be focusing on are if weather has an effect on certain illnesses more than others. Or if there are particular tags that cause a major change in symptom severity. Can we predict condition based on symptoms? Which tags are the most common? Does age of the patient cause severity or number of symptoms to increase?

Data Description

The data has several columns. Some of the ones I will be focusing on are condition, symptom and its severity, treatment, “tag” (trigger possibilities), and weather. I won’t be focusing on some things such as HBI and food. HBI is a metric specific to Crohn’s and I want to focus on all illnesses. Food was added manually quite often so the data is difficult to sort and needs cleaning. Treatment, symptom, and tag also have this issue of a few manual entries, so I will focus on the ones that were preset and could be chosen rather than trying to clean up such a large file. I think those particular columns are too important to lose.

Exploratory Data Analysis

For this project I decided to use spark map functions to count instances. I first took only data that had entries for age and gender and removed any non-adult ages or above 100 to keep outliers out. Then I paired the data in different ways. For example, I paired the users together to do individual user analysis. I also paired all instances of important conditions for age and sex analysis on those as well. After these were all paired, I created bar graphs with spark to get a loose idea of the correlation. Then I exported them to excel to create different graphs and better understand the data.

I noticed from a brief look at the data that I didn’t see Users use the app for very long. I decided to count the number of entries/days people used the app. Sorted by age, I found that the highest average for any age was around 47. This app was designed to be used for several months to find trends and help people manage their symptoms easier. Unfortunately, users seemed to not use it very long. Some only using it for 1 or 2 days. Below is the graph that shows this.

Average Time by User Age


Another part of the data I was able to analyze was the different proportions of users by age group. I wanted to see if there were more people prone to using an app, or if I could find patterns in the average age of users and their conditions. I displayed the number of people by age in both a regular bar graph, and a histogram, to get a better idea of the data. I omitted anything below the age of 18, as adults are too different from kids. Also, a lot of the data had 0 or 1 as the age, likely because they didn’t want to provide their real age. As well as these, I omitted a variable of 2017, where I assume the user confused age with a year variable. The resulted graphs and analysis are below.

Graph 1

Graph 2

Here we can see that most of the users are fairly young. Descending as we get to older ages. Depending on the autoimmune disease, the average age for onset can vary. However, I think that our obvious descent can be told simply by the age of technology. I don’t know many people over the age of 60 who own a smart phone. If they do, I don’t know many who can use them easily enough to download a new app and learn to use it without help. This is likely something younger generations are more likely to try to help manage their symptoms and disease.

Below I also took a look at the average age by some of the most common autoimmune afflictions. We can see that it stayed pretty much in the middle of our data, predicted from out previous graphs. There’s not any significant variation here.

Average Age by Condition

However, after looking at the data and trying to narrow down the most common autoimmune conditions, I was made aware of how extremely misorginized the data was. User input caused extreme issues. The most common “Conditions” I found in the data included things like “walking”, “coffee”, “Ibuprofen”, and so many others. There is so much user input that can cause problems with sorting data like this. Such as misspellings, misclassification (like just stated), or just plain wrong. I will talk more about this in my conclusion.

Another thing I wanted to take a look at was the proportions of women versus men using this app. There was an “other” option that I kept in, but it is hard to use it as data in a scientifical sense until more data is done on those that identify as other. Of the users on the app, of 2.5 million were women. Just under 200 thousand men, and just under 180 thousand other. This outcome doesn’t surprise me. Autoimmune disorders show up in mostly women, as research has proven. This data only confirmed that. I was happy with the outcome, since it matched other scientific studies. There are some autoimmune disorders that can afflict men a little more than other disorders, so I wanted to see the different proportions in a graph.

Users and Gender by Condition.

In this graph we can see how many users, and of what gender, was in each major autoimmune condition. For the most part, the conditions look fairly consistent. With mostly female and other/male being much farther down. This is another example of data entry errors, since we are relying on user input and not a doctor diagnosis. In this case, we can see there is a slight bar for male under endometriosis. 106 users who identify as men put down in the app that they had endometriosis. This could be explained as data entry error, or as transgender users. The problem is that born men do not have uteruses and could not have endometriosis. Without knowing more about the users, we cannot tell why they would put this ailment down.

Conclusion

This data was really interesting to look into. I was able to see some patterns that matched my hypothesis’ and also look into some very neat data that has a lot to do with my interest in helping those with chronic illnesses in the future. I also got to see the frustration in dealing with user input data. The sorting that needed to be done was 50% of this project. I found that it was easy to gather data in masses like this, and significantly cheaper, but that there was little to be said about its accuracy.

Another item I wanted to mention was the app’s poor use. After reading its reviews and downloading the app myself, I noticed how poor it was to use. I am not surprised that users abandoned the app in short time to find a better one. This could have contributed to the lack of good data if users struggled to input what they wanted. In the future I would be interested in looking at another data set like this and digging just a bit deeper. Or perhaps I could create my own app, that is more user friendly and can give some more analysis.